Download as pdf or txt
Download as pdf or txt
You are on page 1of 220

MEAP Edition

Manning Early Access Program


How Machine Learning Works
Version 5

Copyright 2020 Manning Publications

For more information on this and other Manning titles go to


manning.com

©Manning Publications Co. To comment go to liveBook


welcome
Thank you for purchasing the MEAP of How Machine Learning Works. I hope this book would
serve as a step forward in your career and an aid in the journey of making your products
smarter.
Machine learning is one of the hottest topics out there. From autonomous cars to
intelligent personal assistants and smart business analysis and decision making, you can find
machine learning almost anywhere. With that abundance, it makes perfect sense that there is
a lot of resources out there teaching machine learning and making it easier by the day for
anyone to get up and running with a functional machine learning product. However, machine
learning is quite different from other kinds of programming; it's an intersection of multiple
fields that include programming, mathematics, statistics and computer science. Unfortunately,
for an average software engineer like me, when I started learning ML myself I had a lot of
difficulty finding a resource that presented ML form all these different aspects, showing how all
these fields work together coherently and in a principled manner to give all these ML tools and
methods. What I was able to find is either how to use the off-the-shelf libraries with neat
programmatic recipes that hides all the meat of the algorithms, or a harsh academia with the
mathematical foundations being seemingly distant from what one would use on their day-to-
day work. That challenging link from understanding the mathematical theory to the internals
of how the various ML algorithms work and how they are implemented; that link seemed
missing to me. This book takes on the challenge and attempts to provide that missing link; an
introduction to machine learning in which both practice and theory collaborate into giving you
a deeper and working understanding of the field.
No fancy mathematical knowledge is required from you; only your basic algebra. The book
teaches the math it needs along the way, and it doesn't teach how to churn numbers like a
regular math textbook. Not that learning to churn numbers is not important, but the book is
more concerned with introducing the meaning and intuition behind the math in order to
understand how they serve as the foundations of machine learning algorithms. And as the
book is mainly written for python software engineers and developers with little to no
knowledge of machine learning, the book takes a practice-first approach; we start by
encountering a real-world problem and see how to create a product that solves it using
machine learning software, and from our practice we start poking under the hood and
discovering the basis of the algorithms and how and why they work; thus completing the circle
of a deep and practical understanding of machine learning.
As we said before, the book takes on a challenging task, and it takes some time to write
the chapters in an adequate manner that is worthy of your reading time. So we thank you for
your patience and support during the time the book is developed. Your feedback will be
invaluable in improving the book as we go, so please do not hesitate to share your questions,
thoughts, comments and suggestions in the liveBook Discussion Forum.

©Manning Publications Co. To comment go to liveBook


Thanks again for your interest in How Machine Learning Works. We hope you have a
fruitful reading!
— Mostafa Samir

©Manning Publications Co. To comment go to liveBook


brief contents
PART 0: SETTING THE STAGE
1 The Traveling Diabetes Clinic: A first take at the problem
2 Grokking the Problem: What does the data look like?
3 Grokking Deeper: Where did the data come from?
4 Setting the Stage
PART 1: SIMILARITY BASED METHODS
Prelude: Uniformly Continuous Targets
5 K-Nearest Neighbors Method
6 K-means Clustering
PART 2: TREE-BASED METHODS
7 Decision Trees
8 Hierarchical Clustering
PART 3: LINEAR METHODS
9 Linear and Logistic Regression
10 Support Vector Machines
11 Principal Component Analysis
Appendix A

©Manning Publications Co. To comment go to liveBook


1

preface
Machine learning is one of the hottest topics out there. From autonomous cars to intelligent
personal assistants,smart business analysis, and decision making, you can find machine
learning almost anywhere. With that abundance, it makes perfect sense that there is a lot of
resources out there teaching machine learning and making it easier by the day for anyone to
get up and running with a functional machine learning product. However, machine learning is
quite different from other kinds of programming; it's an intersection of multiple fields that
include programming, mathematics, statistics and computer science. Unfortunately, for an
average software engineer like me, when I started learning ML by myself I had a lot of
difficulty finding a resource that presented ML form all these different aspects, showing how all
these fields work together coherently and in a principled manner to give us all these ML magic.
What I was able to find is either how to use the off-the-shelf libraries with neat programmatic
recipes that hides all the meat of the algorithms, or a harsh academia with the mathematical
foundations being seemingly distant from what one would use on their day-to-day work. That
challenging link from understanding the mathematical theory to the internals of how the
various ML algorithms work and how they are implemented; that link seemed missing to me.
This book takes on the challenge and attempts to provide that missing link; an introduction to
machine learning in which both practice and theory collaborate into giving you a deeper and
working understanding of the field.

Why this book


Well, we said earlier that we're writing this book to try and provide a picture of machine
learning where we can use the programmatic tools while understanding their foundations and
how they work on the inside, but a question remains: why?! Why is understanding the
internals of machine learning so important? Why do we need to write a book about it? Why
can't we simply use libraries where we specify a model, train it and use its predictions without
worrying about what the library is doing under the hood? The answer to that question is that
these tools and libraries are leaky abstractions. Leaky abstractions are abstractions that
leak aspects of its hidden details, usually when something goes wrong with them.
Think about the brakes system in a car; to slow down a car or bring it to stop, all you have
to do on your end is simply step on the pedals. Under the hood, that pedal is abstracting a
complex network of pistons, pipes, hoses, hydraulic fluids and discs that all work together to
bring your car to stop. The pedal is shielding you from all these intricate inner working by
simply requiring you to step on it. Unfortunately, this is not the case when something goes
wrong with underlying mechanism; if a pipe got pinched or the hydraulic fluid leaked out, then
the system is going to stop working and the pedal can't do anything for you at this moment.
The brakes pedal is an example of a leaky abstraction.

©Manning Publications Co. To comment go to liveBook


2

In a 2002 article 1 , Joel Spolsky, the co-founder of Stack Overflow and Trello, coined the
law of leaky abstractions, which states that:

“All non-trivial abstractions, to some degree, are leaky”

This law states that the more complexity an abstraction is hiding, the more probable that it is
leaky. In software development, abstractions are inevitable: if we want to efficiently manage
the ever growing complexity of a software system, then there is no escape from using
abstractions. In the same time, we still don't want to drown in the leakage of our abstractions
(pardon the pun); hence, we need to have some understanding about how that abstraction is
working under the hood. Think back to the car brakes system: if you are the one who is
making the car or maintaining it, you can't afford not knowing how a brakes system works and
just treat it as a black-box. If the slightest mistake happened during the installation or
operation, you're probably going to be in trouble.
Machine learning libraries are no exception from that law. If you use a library to write
something like some_complex_model.train and get a trained model and
some_complex_model.predict to get a prediction on new data, then this library is an
extremely non-trivial abstraction; it's hiding a lot of number crunching and data manipulation
through cleverly designed data structures in order to get your results. By the law of leaky
abstractions, these machine learning libraries are leaky. So, just like the case with car brakes,
if you're the one creating or maintaining the system that uses these libraries, you can't afford
not knowing how they work. This is why this book is being written.

Who is This Book For


This book is for software engineers who are working, going to work, or wishing to work on
machine learning software and want to learn about it. Although the book involves some math,
you are not expected to have a strong mathematical background. You're only required to have
three things:

1. Working knowledge of Python,


2. Some of the very basic algebra you learned in high school, and
3. A computer, a pencil, and some paper to play around with the math.

All the extra python libraries and the more advanced math will be covered gradually as we
move through the book.

1
The article can be found here: https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/

©Manning Publications Co. To comment go to liveBook


3

How is This Book Written


Because this book is written for software engineers, we adopt a practice-first approach. We
start each chapter with an example of a real -world problem woven around one of the publicly
available data sets online. We then start to gradually build up a working solution using python.
In parallel, we explore and motivate parts of the theory behind what we do until we reach a
well-rounded understanding of the theoretical aspects of our solution. The only exception to
that rule is the first three chapters, where all of these chapters work on the same problem
introduced in chapter one. These first three chapters were designed to showcase how a deeper
understanding of theoretical aspects yields a better solution than black-box approaches, and
show with a working example why this book is written.
When we discuss how to implement a machine learning solution in practice, you may find
sometimes that we start by implementing versions of that solution from scratch before we
resort to using an off-the-shelf library. This is sometimes important for the same reason we
believe that knowing the mathematical foundations are important; these libraries abstract a
lot of programmatic ideas and algorithms as they abstract the mathematical parts, and that
abstraction is also leaky. So we believe, that in some cases, working by hand on a solution is
very beneficial in understanding how the libraries we're eventually going to use work.

What to Expect From This Book


This book will not make you a master of machine learning. We don't believe that any one
book can do such thing. You become a master of machine learning by reading a lot more than
one book, by working on a lot of machine learning problems and growing your experience, by
experimenting a solution once and failing, and then again and maybe still failing, and trying
for one more time and getting it right. The road of becoming a machine learning master is
long. What this book does is put you on the beginning of the road, and maybe walk a few
miles with you to guide you through the way. After that, it's all up to you. But fear not, the
road is full of other companions that can continue the journey with you; whether they are
other books 2 , courses, or even fellow travelers who walked that path before you. You'll
always find help.
By the end of our journey together, you can expect that:

• You have developed a working experience of Python's ML stack, which includes: scikit-
learn, numpy, pandas, matplotlib, and others.
• You have acquired a diverse tool set of machine learning models and algorithms
• You'd be able to apply, debug and evaluate a machine learning system for a real-world
problem (like the ones found in kaggle competitions) using the tools you have through
the above mentioned stack.

2
The Manning library is full of other very good companion books, you should check them out

©Manning Publications Co. To comment go to liveBook


4

And we hope that by our discussion of the mathematical foundations, you'd develop a
working sense of mathematical maturity that would expand your problem-solving skills
and make it easier for you tackle more advanced methods and ideas in the field.

How to Run the Book's Code


All of the code that exists within the chapters can be found on the book's Github repository
here: https://github.com/Mostafa-Samir/How-Machine-Learning-Works. Appendix-a will help
you set up the required software environment to run this code.

©Manning Publications Co. To comment go to liveBook


5

1
The Traveling Diabetes Clinic: A
first take at the problem

“Essentially, all models are wrong, but some are useful”

- George E. P. Box, British Mathematician

This chapter covers:

• The pandas library and how to use it in reading and manipulating data
• The scikit-learn library and how to use it to train ML models

If we want to solidify the reason behind knowing the inner workings of machine learning, there
is nothing better than working through a concrete example ourselves. In part 0, we’ll work
through the Traveling Diabetes Clinic problem, which is an example of a classification problem.
We’ll start with high-level and black box solution and then gradually dive into a deeper one.
This will allow us to see how a deeper understanding can give us better solutions.

1.1 The Traveling Diabetes Clinic Problem


Diabetes is a serious chronic disease in which glucose, the main source of energy for the
human body, accumulates in the bloodstream without being consumed – hence, it becomes
toxic rather than energetic. It's estimated that around 30 million people in the United States
alone have diabetes, and about 24% of those people are undiagnosed. In order to identify and
address those undiagnosed patients, a group of doctors decided to initiate the Traveling
Diabetes Clinic project.

©Manning Publications Co. To comment go to liveBook


6

The goal behind this project is for doctors to travel the country and stay in a certain
location for a couple of days to find people who are undiagnosed, and then initialize their
treatment plans and coach them on how to live an easy and pleasant life with this chronic
disease. The problem is that they have very limited resources (on both the human and
equipment levels) to perform the necessary medical examinations on all who visit the clinic,
and they want to make sure that these resources are efficiently utilized with those who
probably have the disease. What they'd like to have is way to quickly prioritize the visitors
who will probably have diabetes, so that they can utilize their resources more efficiently on
confirming their condition and providing them with the necessary coaching and treatment.
There are two resources they have plenty of information on because they test almost every
patient to visit the clinic: the instant blood glucose (BG) monitoring finger-sticks and the
digital scale that calculates the Body Mass Index (BMI) of the patient. So these are the only
two resources that can be used in any prioritizing system.
The doctors prefer an automated system, so they can cut the time needed to go through
all the records themselves and prioritize the patients by hand. This seems like a job for a
software engineer, and that's why they come to you for help! You need to create a software
system that takes two inputs (the BG value of the patient and their BMI), then produces
one output: whether this patient probably has diabetes or not.
A possible way to solve this problem is for the doctor to design some rules that decide
whether a patient has diabetes given the values of their BG and BMI values. These rules can
then be programmed and used to predict whether the patient is diabetic or not. To create a
reliable set of rules though, the doctors will need to go through mounds of data by hand in
order to figure out the relation between BG and BMI to the existing of diabetes. This is not
possible because they, as stated before, don't have the time for such manual labor.
However, the fact that there is a mound of data that they can go through is interesting,
maybe there's a way that these data can be utilized automatically.
This is what Machine Learning (ML) is all about: using mounds of data to automatically
figure out the underlying relations and structures!

1.1.1 Reading the data with pandas


To start with anything, we first need a sample of people (both diabetics and non-diabetics)
with their BG and BMI recorded. The sample we're going to work with here is the sample
recorded in the Pima Indians Diabetes Dataset.

If the term Dataset seems unfamiliar, you can think of it as a dumped content of a database
table

This dataset contains a sample of 768 adult women of the native American Pima Indian
heritage, each member with records of various medical information as well as whether or not
she has diabetes. The dataset was originally owned by the National Institute of Diabetes and
Digestive and Kidney Diseases, and was donated by The Johns Hopkins University's school of

©Manning Publications Co. To comment go to liveBook


7

medicine (which conducted research on the same data) to the UCI machine learning
repository, a source of datasets that we will often use throughout the book. The dataset can
be found in a csv format in the datasets directory of the book's GitHub repository.

The acronym csv stands for “comma-separated values,” a very simple text file format used to
store tabular data in which the columns of each row are separated by a comma, and the rows
themselves are separated by newlines. We can see below the first few lines in the
PimaIndiansDiabetes.csv that we're going to work with here.

Figure 1.1: structure of a csv file

It's a very convenient way of storing tabular data that makes it both human-readable as well
as machine-readable. We'll find most of the datasets out there are in csv format, or its cousin
tsv format where columns are separated by tabs instead of commas. Each line can be thought
of as a row in a table; this row represents one data item, in this case is a PIMA Indian woman
(except for the header row of course). Each row has multiple values separated by a comma
(or tab in tsv files), each value can be thought of as a column's value in a table, and each
column contains some information about the data item. In our case here, each column is one
medical measurement/test for the PIMA Indian woman, and among these columns we can see
the two pieces of info we're interested in: Blood Glucose and BMI.
To start working with this information, we need to load this file into our Python program.
This can be achieved using the pandas library, which is one the most commonly used libraries in
machine learning and data analysis, and is provided by default with the Anaconda distribution.
pandas is a very fast and powerful library providing an easy-to-use set of data structures
designed specifically to be used in the analysis of data stored in different formats, supporting
a set of advanced indexing and querying operations. One of the its operations that we're going
to be using a lot is the read_csv method, which allows us to read a csv format (or any variant
of it like tsv) into our code.

©Manning Publications Co. To comment go to liveBook


8

We here use the read_csv method with a single parameter that specifies the location of
the file to read, but the function contains much more parameters that we can learn about in
pandas official docs
import pandas as pd

data = pd.read_csv("../datasets/PimaIndiansDiabetes.csv") data

This small code, when you run it in this chapter's notebook in the book's code repository, uses
the read_csv method to read our dataset and dumps it to the cell's output as a pretty and
formatted table. raed_csv works by reading our csv data into a data structure called
DataFrame. If we looked at the output of the cell containing the last piece of code inside our
notebook, we'll see that a DataFarme is basically a table, with indexed rows and labeled
columns; so it follows intuitively that if we want to select some data within that table, we'll do
that by providing info about in which rows and columns they are. This is exactly what
DataFrame.loc[row_labels, column_labels] does.

Figure 1.2: Indexing into a pandas DataFrame

The values of these access labels can take multiple forms: we can use single values to retrieve
single entries, we can use lists to index multiple entries at once, or we can use python's slices
operator ":". These indexing options are essential for us here because we need to retrieve the
only two pieces of information we need from this big table. To do that, we'll utilize the slice
operator to retrieve all the rows (that is, all the patients records) and we'll use a list
containing just "Blood Glucose", "BMI", and "Class" to retrieve these pieces of data only.
"Class" column contains a 0 or 1 value, indicating whether the record's subject is diabetic or
not, which is a necessary piece of information for us, as that what we're trying to predict in
the first place.
data_of_interest = data.loc[:, ["Blood Glucose", "BMI", “Class”]]

©Manning Publications Co. To comment go to liveBook


9

Now that we have all the data we're interested in, we can start exploring how we can utilize
ML in solving the problem and creating that prioritizing software.

1.2 A Simple ML Attempt with scikit-learn


We're starting simple; we need a simple model to use and we also want a simple and efficient
way to use it. But before we can choose any model, we first need to understand what a model
is exactly.

1.2.1 Choosing a Model


Looking at a diabetic person, we implicitly know that there's some process going on inside
their bodies that links their measurements to their disease. For example, for someone who has
diabetes, we know that there is something going on in their body that causes their BG values
to be above normal; we might not know the exact details of that process, but we know that
there's a process. This is generally the way we look at the world: between the pieces of data
we gather from any phenomenon, we expect that there is a process that leads one piece to
the other. However, these processes can be very complicated to the extent that we can't
capture all their aspects and their details, and that's why we create a model for them.
A model is an assumption we make about how the process under investigation works in
order to approximate it. One of the most important characteristics of a model is that we can
work with it – that is, we have the tools (both mathematical and computational) that allow us
to get some useful information out of it. This characteristic usually means that the model will
lose details from the actual phenomenon; but if the model is good enough we can capture a
lot of the important details.
For example, one of the simplest models is the linear model. In a linear model, we assume
the internal process works through a line – that is, we can differentiate between a diabetic and
a non-diabetic person by determining on which side of a straight line this person appears.

©Manning Publications Co. To comment go to liveBook


10

Figure 1.3: A linear model for diabetics

The figure above shows how linear model for the diabetes problem we have would look like if
it was a good fit to the data; the line separates the diabetics (black dots above the line) from
the non-diabetics (below the line). Such line is specified with the following equation:

Where x1, and x2 correspond to the BMI and BG values of a patient, the other quantities w1, w2
and b are those that define the slope of the line and where it goes on the plane; these
unknown values are all where the learning in “machine learning” happens. What the machine
learns here is the values of these unknowns that would best fit the data so that the everything
above the line is diabetic and everything thing underneath it is not – or to put it
mathematically, for any patient that has BMI and BG values x'1, x'2:

Where y corresponds to the Class value of the patient determines whether they're diabetic or
not. In ML jargon, the x values are usually called the features, the y value is called the label,
the w values are called the weights, and the b value is called the bias. Both the weights and
the bias represent the model's parameters. This gives us another way to define exactly what
a model is: it's the relation (or the assumption of the relation) between our phenomenon's
features and its label, parameterized by a set of defined parameters whose value is to be
fitted with the data. This is a very general way of describing what a model is; that's why we'll

©Manning Publications Co. To comment go to liveBook


11

find some cases that do not align with it in one way or another, but it remains a helpful
definition to understand what's a model is actually doing.
This linear model assumption seems simple enough to start with for our traveling
diabetes clinic, we now need to put it in action and test it. This is where scikit-learn comes into
action.

1.2.2 Implementing the Model with scikit-learn


scikit-learn is the machine learning library in Python! With a huge community of users,
maintainers, and contributors, it provides a comprehensive, consistent and easy-to-use
machine learning framework covering a wide range of models and algorithms across the
machine learning world. Among the diversity of models that scikit-learn provides, our
simple linear model we chose resides there under the name Perceptron.
Models in scikit-learn are organized by their category; because our perceptron model is
essentially a linear model, we'll find it under the linear_model module.

from sklearn.linear_model import Perceptron


from sklearn.model_selection import train_test_split

X = data_of_interest.loc[:, ["Blood Glucose", "BMI"]] y = data_of_interest.loc[:, "Class"]


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

classifier = Perceptron() classifier.fit(X_train, y_train)

The above code basically creates an instance of the Perceptron model and fits the data we
have to it, but what are those three lines in the middle?
The first two lines are self-explanatory; we're just mapping our features data to the
variable name X and the label data to the variable name y, following the mathematical
convention we introduced earlier. In the third line, we take these two variables and split them
randomly into two groups: a group of X_train and y_train, and another of X_test and
y_test. It's obvious from the naming that a group will be used for the model training (that is,
the model fitting) and the other will be used to test the model, but why such a split?
Say that you're teaching a kid how to add two numbers. You'll probably start by
explaining the method and then give some examples to solidify the kid's understanding. Now
you want to test if the kid actually understands the method, so you decide to give an exercise.
Would you repeat one of your examples as an exercise or give something the kid hasn't seen
before? If you used an example that you have used before in the teaching process, you run
the risk that the kid will remember it from before and easily solve it, and your exercise won't
be able to determine if the kid actually understands anything. The only valid solution is to
test the kid on a totally new problem that hasn't been seen before.
The same argument holds when you're teaching a machine learning model; to test their
performance, you need to try the model on data it hasn’t seen during training. This is what the
train_test_split method allows us to do. It shuffles the data and spilts it into two groups:

©Manning Publications Co. To comment go to liveBook


12

a bigger one for training and a smaller one that the model doesn't see until test time.
Evaluating the model on the test data allows us to decide whether to choose this model to
work with or go on and look for a better one; that's why the train_test_split method is
part of scikit-learn's model_selection module. The random_state argument used in the
call to train_test_split is to ensure that the random shuffler is seeded with the same value
across runs; hence making the same split when anyone runs it. We set it to 42 because “it's
the answer to the ultimate question of life, the universe, and everything.” 3
Now that we've fitted the model to the training data, testing time comes. Will the model be
able to predict diabetic people correctly and with enough accuracy so that we can use it for the
prioritizing task? scikit-learn's model provide us with the score method to do that task; it
takes the test data and return the portion correctly predicted samples among the test set.

accuracy = classifier.score(X_test, y_test) print("Prediction Accuracy: {:.2f}%".format(accuracy * 100))

> Prediction Accuracy: 50.52%

Our linear model is only able to predict about 50% of the data patients in the test data
correctly. That's not very good. Even more, if we tried to get the model's score on the training
data itself (which the model is supposed to ace completely), we'll find that it can only predict
about 46% correctly!

train_accuracy = classifier.score(X_train, y_train)

print("Training Prediction Accuracy: {:.2f}%".format(train_accuracy * 100))

> Prediction Accuracy: 46.35%

This suggests that the linear model we chose is not suitable for the problem at all. We can
further prove that by seeing that it's performing much worse than a dummy baseline.

1.2.3 Establishing a Baseline


Establishing a baseline is one of the first steps that should be done in any machine learning
project. A baseline is a simple model we train in the data in order to determine accuracy and

3
From The Hitchhiker's Guide to the Galaxy by Douglas Adams

©Manning Publications Co. To comment go to liveBook


13

compare to the real models we're going to try. This helps us determine whether the models
we try are actually providing any kind of improvements or not.
One type of model that we can use as a baselines is called a dummy model. Dummy
models do not learn anything from the data, they just generate their decision by following a
rule that may or may not be related to the data. For example, a dummy model for our
problem here is one that outputs 0 or 1 at random with a 50% chance for each; this is an
example of a dummy rule that is not related to the data. Another dummy model is one that
always outputs the most frequent label in the training data; this dummy model is related to
the data, but it does not learn anything from it.
These kinds of dummy models are provided in scikit-learn under the dummy module. All
of them are implemented in the DummyClassifier class, which accepts a strategy
parameter at initialization. This strategy parameter determines which rule the model is going
to use. Here, we're going to use the most_frequent strategy, which always returns the most
frequent label in the training data.

from sklearn.dummy import DummyClassifier

dummy_baseline = DummyClassifier(strategy="most_frequent") dummy_baseline.fit(X_train, y_train)

baseline_accuracy = dummy_baseline.score(X_test, y_test)


print("Dummy Prediction Accuracy: {:.2f}%".format(baseline_accuracy * 100))

> Dummy Prediction Accuracy: 64.06%

The dummy classifier is almost 14% better than the linear perceptron model we used! This
shows stronger evidence against the perceptron model, and justifies our hypothesis that it's
not suitable at all for the problem we're trying to solve. We now need to choose a better
model, but it doesn't seem like a good idea to just skim through ML models and try each one
until we hit a working one. We need a principled way to do so.
In the next two chapters, we're going to acquire the necessary tools to allow us to start
choosing the most suitable model (or possible models). These tools allow us to understand
what the data is saying to us, and how we can use what its saying to our advantage. By the
end of Chapter 3, we will have reached a model that achieves an extra 13% accuracy over the
dummy baseline we just established, so fasten your seatbelts!

©Manning Publications Co. To comment go to liveBook


14

2
Grokking the Problem: What does
the data look like?

“Statistics may be defined as a body of methods for making wise decisions in the face of
uncertainty”

W.A. Wallis, American economist and statistician

In this chapter and the next, we continue with the traveling diabetes clinic problem. We'll be
taking a deeper look inside the data we used to start to understand its characteristics and see
how it looks like – that is, describe it. By describing the data, we'll be able to make more
convenient models that fit the data better than blindly choosing a model, like we did in
Chapter 1.
To start to understand the data and describe it, we first need to understand what the data
represents and how it relates to the real-world phenomenon that we're trying to study. That's
what we're going to see in the next section.

2.1 Populations and Samples


According to the World Health Organization (WHO), the number of adults suffering from
diabetes around the world has blown from 108 million in 1980 to 422 million adults in 2014,
making diabetes a sort of the epidemic of the new world. The number of people who have
diabetes around the world today has probably increased over the number reported by WHO in
2014; but even if we took that number to be valid today, we're looking at a huge population of
diabetics. However, the data we use consists of only 768 records of people, and not all of
them are diabetic. What we have is just a sample out of the population. But a question
arises: does what we learn from a sample reflect the truth about the population?

©Manning Publications Co. To comment go to liveBook


15

Figure 2.1: Population and sample

To answer this question, we first need to realize that by using the word population, we don't
mean its demographic meaning – that is, a group of people. While this is certainly true in our
case (we are, after all, studying a group of people), the word population has a more general
meaning that is not specific to humans. The word population can be used to denote any
group of entities that share one or more common properties and attributes: a group of cars
with the same model and specifications is a population, a set of commercial transactions in a
specific market is a population, and, of course, a set of people with a specific health condition
is a population.
From that definition, because all the members of a population share common properties,
it's reasonable to assume that our findings on an unbiased sample selected from these
members is representative of the population. In an unbiased sample, we don't see any
favoring toward specific features in the data. For example, the sample in Figure 1.2(a) is a
biased sample, as it contains only overweight diabetics and normal-weight non-diabetics.

©Manning Publications Co. To comment go to liveBook


16

Figure 2.2: (a) A sample biased towards overweight diabetics and normal-weight non-diabetics. (b) An unbiased
sample with no favoring regarding weight.

If our sample includes only the overweight diabetics and the normal non-diabetics, then the
game is obviously rigged! From such a sample, any model we use will predict “diabetic” for
anyone whose BMI value is high. We cannot consider such a conclusion to be representative of
the population. That's why only an unbiased sample can be representative of the original
population, such as a sample constructed by randomly choosing diabetics and non-diabetics
regardless of their BMI values, like the one we see in Figure 1.2(b).
Now, to start building a better model for the traveling diabetes clinic, we need to
investigate the PIMA Indians data, we have and see if it's an unbalanced sample or not and
what would be the best model to fit it. That's what descriptive statistics allows us to do.

©Manning Publications Co. To comment go to liveBook


17

2.2 Descriptive Statistics


The word statistic can be simply defined as a single piece of info or data digested from a
large set of numerical values, and hence the field of mathematical statistics is concerned with
analyzing, extracting, and finally using these pieces of information in a principled, rigorous
fashion. There are two types of this statistical analysis. One is called inferential statistics,
which is the process of using these pieces of data and information to infer facts and rules that
govern the original population. This is the type we're going to use for building a better solution
to the traveling clinic problem in Chapter 3. The other type is what usually comes before the
inferential step and what we'll start with now, which is called descriptive statistics. From its
name, it's the branch concerned with describing various aspects of the sample in order to get
insights into the data, its validity, characteristics, and shape, which leads us finally to a
principled process for choosing a model.

2.2.1 Mean, Mode, and Median


The question “What is the average value of [blank]?” comes to mind very naturally when we
try to get some insights into how something behaves. We tend to ask about the average price
of the houses in some specific area to get an idea of how expensive the houses are over there.
We also like to ask about the average number of students from a specific school who get into
college every year in order to understand how good the neighborhood’s school is. The power
that lies within the average value of some process comes from the fact that it tells us that
whatever we're going to get out this phenomenon will be probably be close to that average
value.
In statistics, the concept of average is generalized in what we call measures of central
tendency, which are a bunch of descriptive statistics outlining the central values that our
sample values tend to be around. The most famous of these measures is the mean, which
corresponds directly to the concept of average in our previous discussion. The mean is simply
calculated by summing up all the values of the sample, and finally dividing that summation by
(1) (2) (m)
the size of the sample. So if we have a sample of values x , x , …, x , the sample mean,
denoted by x̄ is defined as:

The number in the superscript of x is not a power, but the index of the example in the
sample. We differentiate it form a power by putting it between parentheses. We reserve
subscript numbers to distinguish different features.

We can easily write a Python function that takes the data from our pandas DataFrame, add the
values, and then divide that by the number of rows in the data to get the mean value, but

©Manning Publications Co. To comment go to liveBook


18

pandas took care of that for us! By calling the mean() method on our DataFrame, we get the
mean values for each column in the data.

mean_values = data_of_interest.mean() print(mean_values)

> Blood Glucose


120.894531
BMI 31.992578
Class 0.348958
dtype: float64

As we can see in the output of the previous code snippet, the mean values are: 120.89,
31.99, and 0.35 for the BG, BMI, and Class respectively. From this description, we can say
that the values of BMI for the examples in the sample tend to be around the value of 31.99.
The same can be said about BG. However, it's not very clear what we'd mean if we said that
the Class of the members (being diabetics or non-diabetics) is centered around 0.35. This field
of data can only take one of two values, 0 or 1. How can its average be a real number?
In the context of understanding the central tendency of a sample of discrete values like the
examples' Class, the mean is not a very suitable measure. So, instead, two other measures
can be used to give more meaningful information about the central tendency in that case. The
first one is the mode, which is simply the most frequent value across the sample. This
captures similar information to the mean but for discrete values, in the sense that what tends
to be the most common in a group is the average setting within the group. pandas also
provides us with the mode() method in order to calculate the mode of the values across the
columns of a DataFrame.

mode_values = data_of_interest.mode() print(mode_values)

>
Blood Glucose
BMI
Class
0 99 32.0 0.0
1 100 NaN NaN

We can see now that the mode of the Class field is 0, meaning that the most common
members in the sample are non-diabetics. And as we can see, the mode can also be calculated
for continuous data like the BMI; and in the case here, its value is close to the mean value (32
and 31.99). Another thing to notice is that the BG has two mode values, 99 and 100, unlike
the other fields, which have only one mode. This happens because the 99 and 100 BG values
are the two most common values in the set, and they happen to occur the same number of
times

©Manning Publications Co. To comment go to liveBook


19

The remaining central tendency measure that we can use with discrete data is the
median, which is simply the value that splits the sample into two halves: a lower one and a
higher one. The median is the measure most related to the concept of “center,” as it's
basically the value at the center of the values, which gives a sense of where the other values
hang around. While calculating the median of sample is only a matter of sorting the values
and reporting the value in the middle, pandas provides with a median() method as well to
alleviate the task of having to writing something specifically to carry out this process.

median_values = data_of_interest.median() print(median_values)

> Blood Glucose


117.0
BMI 32.0
Class 0.0
dtype: float64

We can see from the output that the median and the mode agree for the Class field, both
being 0. However, the median would be more informative when working with discrete data
that have more than two possible values, as it gives a clearer view of a center compared to
the case with Class, which here has only two possible values. We can also see that like the
mode, the median is close to the mean values of the BMI, but that's not the case for the Blood
Glucose. This kind of discrepancy between the central tendency measures calls into question
the distribution of the data itself, which is responsible of whether these measures align or not.
The first step in understanding how the values of a sample are distributed is to understand
how the these values vary from their central tendencies, which is provided by another set of
descriptive statistics called measures of dispersion.

2.2.2 Ranges, Sample Variance, and Sample Standard Deviation


Just like the question about the “average value” comes naturally to mind, the questions about
the minimum and maximum values usually follow. Knowing the lowest and the highest price of
houses in some area in addition to the average info gives us a simplified sense of how the
prices in this area vary. The same sense holds in descriptive statistics as the information about
central tendency, along with the minimum and maximum values of the sample, could start
giving us a view on how the values are varying. For this reason, pandas gives us the min()
and max() values that allow us to find the minimum and maximum values of our data along
each column in the DataFrame.

min_values = data_of_interest.min() max_values = data_of_interest.max()

print(min_values) print("=========")
print(max_values)

©Manning Publications Co. To comment go to liveBook


20

> Blood Glucose


0.0
BMI 0.0
Class 0.0
dtype: float64
=========
Blood Glucose 199.0
BMI 67.1
Class 1.0
dtype: float64

Something interesting appears in the minimum values of BG and BMI. It seems that some
records have a BG of 0 and/or a BMI of 0. These values are not biologically acceptable; it
essentially means that the subject has zero weight or has no glucose whatsoever in their blood
– aka, they’re dead!
Usually, such cases of unreasonable measurements are due to errors in the
recording/measuring process. We consider such cases as missing data. The existence of
these missing data could throw off our analysis and give us wrong inference afterward, so we
need to clean our data get rid of these missing values and use only the correct data we have;
this can be solved by selecting all the entries that doesn't have a 0 BMI or a 0 BG from
data_of_interest.
Pandas provide us with a very neat way of doing such filtering and querying into its data
frames. Back in Chapter 1, we saw that the selector labels passed to DataFrame.loc can be a
slice or a list; in addition to that, pandas support the usage of lists of Boolean values as labels
selectors. Their idea is very simple, with Boolean lists that has the same length as the number
of labels along, labels that correspond to a False element will be ignored from the result
while those corresponding to a True element will be included. It's as if this Boolean list masks
out the DataFrame, with False elements hiding their corresponding entries and True
elements bringing them to the surface.
To utilize this feature for cleaning our data, we make use of the fact that we can run
arithmetic and logical operations (especially comparisons) on the pandas objects in an
element-wise fashion; that means the operation gets applied on every single element in the
object and the resulting object will contain the result for each element! So if we ran the
following statement:
bmi_zeros_mask = data_of_interest.loc[:, "BMI"] != 0

We get in bmi_zero_mask a sequence of booleans where there are Trues in place of nonzero
BMI entries and Falses otherwise. We do the same for the BG values and query the
data_of_interest with their logical AND result to get our clean_data_of_interest:
bg_zeros_mask = data_of_interest.loc[:, "Blood Glucose"] != 0

clean_data_of_interest = data_of_interest[bmi_zeros_mask & bg_zeros_mask]

©Manning Publications Co. To comment go to liveBook


21

We can check the validity of our cleaning process by running the min() and max() methods
again, but this time against the new clean version. This will show us that the minimum values
of BMI and BG are no longer zeros.

min_values = clean_data_of_interest.min() max_values = clean_data_of_interest.max()

print(min_values) print("=========")
print(max_values)

> Blood Glucose


44.0
BMI 18.2
Class 0.0
dtype: float64
=========
Blood Glucose 199.0
BMI 67.1
Class 1.0
dtype: float64

Missing data: To impute or to amputate?


Missing data is one of the most common problems faced by people working with data. Missing data are not
good; they limit the power of the analysis we can make by giving us fewer data than we originally planned for,
and increase the chance of introducing biases in the results. Experiments and sampling should be designed in
the first place to minimize the amount of missing data, but they're inevitable. So anyone working with data
should be able to know how to handle them.
There are two main categories of approaches in handling missing data: removal, in which the missing data is
discarded and amputated from the data set completely, and imputation, in which we try to fill in the missing
values. Which approach to use depends on the type of missing data we have, and there are three of them:

• Missing Completely at Random (MCAR): In this type, the fact that the data is missing has nothing to do with
any of the other measurements of the experiment. Like if Blood Glucose data were missing because of
faulty measurement devices or recording mistakes.
• Missing at Random (MAR): In this type, the missing data depend on some other measurements in the
experiments, but these measurements are known. For example, if the BMI data was missing from
teenagers because they are self-conscious about their figures, and we can confirm this hypothesis from
known age data in the dataset, then these data are MAR.
• Not Missing at Random (NMAR): In this type, the reason the data is missing is related to our primary
measurement of interest. For example, if people with diabetes refused to step on a scale to get their BMI
calculated, then the data in this case is NMAR.

If the missing data is either MCAR or MAR, then it's safe to discard them provided that this cut doesn't
significantly reduce the size of the dataset. However, if the missing data is NMAR or there's a lot of them, then
discarding them would put the model at the danger of bias, and an imputation strategy needs to be followed.

©Manning Publications Co. To comment go to liveBook


22

Determining which of these types is your missing data is not always straightforward; you can continue reading
4
here on how to do that if you're interested.

Though the min and max values allowed us to detect and fix a serious bug in our data, we
have to ask the question: Are they enough to describe how the data is distributed? Let's get
back to our house prices example. Let's say that we know that the min and max prices in the
area we're looking into are 50k and 750k USD. Wouldn't it be better if we also knew that 25%
of them are for 70k or less, while 75% of them are for 300k or less? With this information, we
can tap into a more detailed view of the prices distribution.

Figure 2.3: House prices values distributions across different ranges

These values at which a specific portion of the sample fall under is called quantiles. We
usually couple the definition of a quantile with a proportion q, such that 0 < q ≤ 1, so a q-
quantile is the value under which q-proportion of the sample fall. A 0.15-quantile is the value
which 0.15 of the sample (or 15%) are less than or equal to, and when we say that the 0.68-
quantile is 74, then 68% of the sample is less than or equal to 74. While we can calculate any
q-quantile we want, there are some special quantiles that are used more often than others,
like those at 0.25, 0.5, and 0.75, which have special names of their own as the 1st-quartile

4
https://measuringu.com/missing-data/, last visited in July 21st, 2018

©Manning Publications Co. To comment go to liveBook


23

(Q1), 2nd-quartile (Q2), and 3rd-quartile (Q3) respectively (quartiles as in they split the
sample into quarters). These quartiles are used to calculate a measure of dispersion called the
Inter-Quartile Range, or IQR, which is the range in which the mid-50% of the sample fall
into.

As we saw visually in our housing prices analogy, quantiles help us visualize the dispersion of
the prices; the same visual representation is generalized into the field of descriptive statistics
to show how the quantiles, and the IQR derived from them, give a description of the values'
distribution. This visual representation is called a box plot.

Figure 2. 4: (a) the box plots of our data of interest, (b) the interpretation of box plots.

The box plots are pretty simple to interpret once we understand what each item in the plot
means. As we can see in Figure 2.4(b), the plot consists mainly of a box with two whiskers:
the box boundaries are the 1st quartile and the 3rd quartile, and so it extends across the IQR.
Within the box, the line across the median value is shown, and outside the box the whiskers
extend beyond the boundaries toward the minimum and maximum values with a reach no
longer than 1.5×IQR. Any values that lie outside the reach of whiskers are marked with special
markers, like we see with the high BMI values in figure 2.4(a)
With this in mind, we can start interpreting the box plot as a description of dispersion by
investigating the size of the box and the length of the whiskers. The smaller the size of the
box, the denser the values are around their central tendency, and vice versa. This is shown

©Manning Publications Co. To comment go to liveBook


24

in Figure 2.5, along with fact that the lengths of the whiskers with the position of the median
line within the box tell us about how symmetric the values are, whether they distributed
similarly on both ends (when the whiskers have similar lengths), or if the values tend to be
skewed toward one end than the other (the end with the shorter whisker).

Figure 2.5: What different box plots tell us about the distribution of the data.

Quantiles and box plots can be simply calculated and created with pandas with no need to go
over all the work needed by hand. Simply by calling the quantile() method on our
DataFrame passing a list of the q values of the quantiles we desire, we get another frame with
the specified quantalies calculated for each column in the data. The box plot on the other hand
can be created simply by calling the box() method on our DataFrame.plot object, this will
create for us the box plots for each column in the data grouped in a single figure.

quartiles = clean_data_of_interest.quantile([0.25, 0.5, 0.75])

print(quartiles)
clean_data_of_interest.plot.box()

>
Blood Glucose
BMI
Class
0.25 99.75 27.5 0.0
0.50 117.00 32.3 0.0
0.75 141.00 36.6 1.0

The resulting plot form the above code looks exactly like the one in Figure 2.3(a), only
different in drawing style.
While quantiles and box plots give us a very good idea about the dispersion of the sample,
we're still missing some information. For example, we don't have any idea how the values

©Manning Publications Co. To comment go to liveBook


25

along the IQR are dispersed and the box doesn't show us this information. One way to get this
information is draw box plots for the values within the box, and then another box plot within
that new box to get finer results, and so on until we have ourselves a Russian nested box plot
like in Figure 2.6!

Figure 2.6: Russian nested box plots

sample_vars = clean_data_of_interest.var() sample_stds = clean_data_of_interest.std()

print(sample_vars) print("==========")
print(sample_stds)

This is clearly a bad idea! It's an inefficient way to capture the fine details of the variation
within the smaller groups of data. A much better idea is to leave the box plots just the way
they are, and to have a general estimate of how much the values differ from their central
tendency – that is, the average distance to the central tendency across all the sample. This is
precisely the definition of what we call the sample variance s2 and the sample standard
deviation s (just the square root of the sample variance). Using the mean value, we can
calculate s2 and s as:

The formula is fairly simple to understand: we just calculate the average of how much each
sample deviates form the mean value. We sum the square of deviations instead of the
deviations themselves to make the value of the sample variance positive all the time. The

©Manning Publications Co. To comment go to liveBook


26

significance of these values lies in the fact that they give a summary about the distribution of
the sample: the smaller these values get, the more condensed the sample is around its center,
and vice versa. As usual, pandas provides us with quick and easy methods to compute both
the sample variance and standard deviation: the var() and std() methods which return a
Series with the desired statistic for each column in the frame.
sample_vars = clean_data_of_interest.var() sample_stds = clean_data_of_interest.std()

print(sample_vars)
print("==========")
print(sample_stds)

> Blood Glucose


936.433323
BMI 48.010018
Class 0.228121
dtype: float64
=========
Blood Glucose 30.601198
BMI 6.928926
Class 0.477621
dtype: float64

The interpretation of standard deviation is fairly straightforward; when we say that the
standard deviation of BG is 30.6, this means that the BG lie, on average, around 30.6 points
form the mean value.

So why m-1 and not m?


We described the equation of the sample's standard deviation as being an average of the squared deviations
from the sample mean across the m samples we have. However, when it was time to divide the sum of the
squared deviation over m, the size of the sample, we divided over m-1instead! So what's the idea behind that?
To illustrate the intuition behind this modification from m to m-1, let's first go through a quick question: If we
were told that a sample of numbers of size 3, whose values are unknown, has a sample mean of 2, how many
members of this sample are free to vary and take any value?
Well, let's see! For the first member, we are free to choose any possible value, so let's give it the value 2. For
the second member, we still find no issue in assigning any arbitrary value to it, we're still free to choose
whatever we want for a value! Let's choose 3! Are we still free to choose any value for the third member? No!
To understand why, recall that we already know the sample mean of these 3 members, so with the possible
value x3 of the third member, we must have that:

2 = ⅓ (2 + 3 + x3)

©Manning Publications Co. To comment go to liveBook


27

Solving this for the value of x3 bounds us to x3=1! The third member must have the value of 1, and we cannot
assign it any arbitrary value. As we know the sample mean, we lose our freedom on that member.
The same argument can be made for any sample with any size. As long as we know the sample mean, we
lose the freedom to vary one member of the sample. So if we have m members, knowing the sample mean
constrains us to only m-1 Degrees of freedom.
With the sample variance, we're trying to estimate the variation across the sample, so it makes sense to only
average over the size that can actually vary! Since we already know the sample mean and use it in the
calculation, we lose one degree of freedom and end up with only m-1 members that can actually vary freely, so

we average over m-1 and not m.

There's a more rigorous explanation of why we're using m-1 that we're going to see later. Until then, this
gives a valid intuitive explanation of such modification.

2.2.3 Histogram Plots


Let’s go back to our house pricing analogy. To get a better understanding of the distribution of
the prices of the houses, you start asking: “So how many houses cost $50k? And how many of
them are worth $120k? What about $300k?” And so on and so forth, but this soon becomes
problematic! There are probably so many houses with prices that fall between these values
that we could ask about. There's probably a house that's worth $55K, $88.5K, or $245.36K.
We could include these prices into our questions as well, but we still run the risk of having
houses that cost, say, $75K and fall between $55K and $88.5K! To get out of this dilemma,
we need to ask a different question.
Instead of asking about the counts of houses with specific prices, we start asking about the
count of houses that fall within a specific range of prices. For example, instead of asking “How
many houses have the price $50k? And how many of them are worth $120k?”, thus failing
to account for the houses with prices between these two, we instead ask, “How many houses
have a price between $50K and $120K?” and include all those houses with intermediate prices.
So what we basically do is divide the whole range of prices we're interested in into smaller
ranges, and start asking questions about the number of houses within each small range.
Depicting this information reveals an interesting visualization of the data, as we can see in
Figure 2.7.

©Manning Publications Co. To comment go to liveBook


28

Figure 2.7: Counting houses within specific price ranges

This kind of visualization is typically known as a histogram plot, in which the range of data
values is split into smaller ranges called bins and the height of the bins indicate how common
the values inside the bins in comparison to each other. In that sketch we have for the house
prices, the height of the bins is the count of houses that fall within each bin. We can easily
generate similar plots for our data via pandas plotting methods, by simply calling the hist()
method on the DataFrame.plot or objects.
clean_data_of_interest.loc[:, "BMI"].plot.hist()

The result of this code is depicted in Figure 2.8 below.

©Manning Publications Co. To comment go to liveBook


29

Figure 2.8: The histogram plot of the BMI values

A histogram plot gives us a clearer visual representation of how the different values of our
sample are distributed.
In this form, where the vertical axis of the plot represents the counts of the data points,
the histogram has a striking similarity to simple bar charts. However, histograms were not
designed to be just bar charts; histograms were actually designed to use the sample data to
get an idea how the distribution of the original population looks. So the histogram can be put
into another, more useful form.
Let's go back for a moment to the house prices analogy. Instead of sizing our bins at $70K
like we had before, let's make them smaller and say we have a histogram with bins of size
$2K. Within each of these small bins, we see that the variation of prices is so small to a point
that we can attribute it to other small factors other than the house itself, like the utilities and
the number of floors and so on. But we can actually expect that all the houses within this bin
are similar in major characteristics like the area, the number of rooms, and so on.

©Manning Publications Co. To comment go to liveBook


30

Figure 2.9: A deeper look inside a histogram bin

Imagine that the bin between $50K and $52K is populated by seven houses that we found on
the market: two of them are worth $50K, three of them cost $51.5K, and the other two are
offered for $52K. By noticing that these houses are almost distributed similarly, and powered
by the assumption that they are mostly similar in major characteristics due to the small
variation within the small bin, we are motivated to think that had we explored more houses on
the market, we would've found other houses with unseen prices within that range! For
example, we could have found four at the price of $51K, or two worth $50.5K.
The power of this idea is the fact it can get us to start making predictions about unseen
data values (aka the rest of the original population) given the limited sample we have. This is
the original purpose of the histogram, which is the fundamental assumption behind the
alternative form we're going to see now of the histogram. To put it in different phrasing, we
assume that given an appropriate bin size, it's highly likely that all the unseen values within
that bin are distributed similarly to the observed ones. To capture that assumption into the
histogram plot, instead of using the actual counts, we divide the counts by the width of the
bin, giving us some kind of density (analogous to the physical density of matter), or how
many counts we'd expect to see within a small range of length 1 inside that bigger range –
even if this small range covers unseen values!

©Manning Publications Co. To comment go to liveBook


31

Figure 2.10: Histogram density vs. physical density.

There's no straightforward way to create a density histogram using pandas, but there is a very
easy workaround. The call to hist() can accept a weights argument, which should be a
sequence of the same size as the data and each of its elements specifies how much a single
data point should contribute to the counts. When this argument is omitted, each data point
simply contributes a 1 to the count. However, when present, the value of the weight element
is multiplied by that 1, and the result is the amount that the data point contribute to the
count. By noticing that dividing an integer A by a value B is the exactly the same as summing
1/B for A times, we can divide the counts in the regular histogram by the bin size by providing
a list of weights each of which is 1 / bin_size.
bin_size = 4.89 # the bin size in the 10 bins histogram

data_size = len(clean_data_of_interest.loc[:, "BMI"]) weights_seq = [1 / bin_size] * data_size

clean_data_of_interest.loc[:, "BMI"].plot.hist(weights=weights_seq) \
.set_ylabel("Density") # changes the label on y-axis

The resulting plot (figure 2.11) looks similar to the plot in figure 2.8, only with the vertical axis
now describing the density instead of the frequency.

©Manning Publications Co. To comment go to liveBook


32

Figure 2.11: Density histogram for the BMI values.

With the density form of the histogram, we get a view of the distribution that goes a step
further from the limited sample we have and into the original population; the density value
tells us how dense each range of values is compared to the other in the original population. So
we'd expect that there are more members in the population with a BMI around 40 than there
are with BMI around 20, as the density in the range containing 40 is higher than it is in the
range containing 20.

What is an appropriate bin size?


Back when we made the assumption about the values within a small bin size are highly likely to be distributed
similarly, hence motivating the notion of density, we conditioned it on the fact that the bin size should be
appropriate. So how can we choose an appropriate bin size?
There are many approaches that could be used to determine a good bin size. One of them, which is designed
to make the histogram look as much as possible like the original population distribution, is the Freedman-
Diaconis rule

We can instruct the hist() method to use this rule for the size of bins it draws via the bins argument, by
passing the value 'fd' to it.

©Manning Publications Co. To comment go to liveBook


33

With histograms and the accounting of other unseen values through the notion of density,
we're one step closer to breaking the shackles of the limited sample and into the world of the
original population it was drawn from, which will finally allow us to construct a better model
for the problem we sat out to solve.

EXERCISE 2.1

Generate the histogram plots (both frequency and density) for the Blood Glucose values.

EXERCISE 2.2

Is the data sample we're using biased, or unbiased? Think about how you can determine that
using the tools you learned in this chapter.

©Manning Publications Co. To comment go to liveBook


34

3
Grokking Deeper: Where did the
data come from?

“Probability is the intersection of the most rigorous mathematics and the messiest of life”

Nassim Taleb, Lebanese-American statistician

In the previous chapter, we ended with a look at how the data within the sample is
distributed. Understanding how the data is distributed is key to choosing the correct model to
solve a problem, and because we are interested in models that can work well not just on the
sample data but on any other data form the same population, we take a leap from the
restrains of the sample and into the realm of the population from which the sample was
drawn.
This chapter uses the descriptive statistics we generated in the previous chapter and starts
to explore and approximate the original population of the data. By approximating the original
populations of our data, we acquire the tools that will allow us to create a much better model
to solve the diabetes clinic problem we have; a model that is +31% more accurate than the
perceptron one.
This chapter delves into probability theory, which can make it look a bit more mathy then
the chapters before it. But as we established before, mathy is not the equivalent of
inaccessible. Remember to be patient, take your time, and keep a piece of paper and pencil
beside you to doodle around with the equations and examples. At the end, you'll be very
rewarded by the results of our new model!

©Manning Publications Co. To comment go to liveBook


35

3.1 Probability and Distributions


Let's go back to the first form of the histogram, the one with the counts, and leave aside the
notion of density for a little a while. Although such histograms give us a distribution of counts
over the data values, this count distribution is still tightly linked to the size of the sample we
have. In an attempt to decouple the count distribution from the limited size of the sample, we
may attempt to normalize it – that is, instead of having absolute counts like this, we divide all
the counts by the total sample size and express them as ratios. We can achieve this by again
using the weights argument of the hist() method, but this time setting all the weights to 1 /
sample_size instead of the bin size, which results into the histogram in Figure 3.1.
sample_size = len(clean_data_of_interest.loc[:, "BMI"]) weights_list = [1 / sample_size] * sample_size

clean_data_of_interest.loc[:, "BMI"].plot.hist(weights=weights_list)

Figure 3.1: The ratios histogram for the BMI data.

What's nice about having ratios instead of absolute counts is that it allows us to make small
predictions about the values we could get from the population. Remember the whole premise
here is that our sample is representative of the original population – so what we learn from
one sample, we expect to see in any other sample. For example, we can see that the first bin
(values around 20) contains about 0.07 of the data in the sample; if we took another sample
from the population with a different size, say 2000 samples, and measured their BMIs, we
expect also to see about 0.07 of them around 20 – that is, 100 of them. This is an interesting
prediction about the behavior of the whole population in case we sample more data, but we
can take this a little bit further.

©Manning Publications Co. To comment go to liveBook


36

Imagine we obtain a sample of size 1 from our population. What does it mean that 0.07
of that sample has a BMI between around 20? Following the earlier case , we conclude that
one-twentieth of that person has a BMI around 20. And to say that 0.25 of the sample has
a BMI around 30 means, in our case, that a quarter of that sampled person has a BMI
around 30! This seems a bit ridiculous. Obviously a person cannot be divided in such a way
that different parts of their body have different BMI values. We need to look at this ratio in
a different light to make sense of that 1 sample case.
Let's think about a sample of size 1000. About 50 of them will have a BMI of around 20,
and about 250 would have a BMI around 30. Assume that, from this sample of 1000, you
randomly selected one of them. Which do you think is more likely: that the member you
selected has a BMI around 30 or around 20? You're probably going to answer that it's more
likely that the member you selected will have a BMI around 30, as there are more of these
members than those with BMI around 20, thus raising their chances of being randomly
selected. The same argument will hold no matter how big the sample size is; the higher ratio
of 0.25 corresponds to higher degree of belief (aka more probable) that a randomly chosen
member will have BMI around 30, and the lower ratio of 0.07 corresponds to a lower degree of
belief (aka less probable) that the a randomly chosen member will be around 20.
With this new interpretation of the ratios as a probability, or a measure of how much we
think or believe about the data, we can think about the case of a sample of size 1 by saying
that the ratio of 0.07 represents how probable it is that the single member of our sample will
have a BMI around 20 – that is, this ratio represents a probability for that event occurring.
We can now bring back the notion of density and couple it with idea of probability and get
ourselves a distribution that is nearly free of any ties to the original limited sample we started
with: a histogram of probability density. Such a probability density histogram can be obtained
by dividing the counts by both the size of the sample and the size of the width, which can be
done by setting the weights to 1 / (bin_size * data_size). A simpler way, however, is provided
by hist() method which is simply of setting an argument called normed to True!
clean_data_of_interest.loc[:, "BMI"].plot.hist(normed=True) \
.set_ylabel("Probability Density")

©Manning Publications Co. To comment go to liveBook


37

Figure 3.2: The probability density histogram.

With this representation of the probability density that is partially decoupled from what the
sample looks like, we can start connecting the dots toward a more general representation of
the BMI's population. If we attempt to connect the point at the head of each bin as smoothly
as possible, we get an estimate of a smooth function that describes the probability density
distribution more generally, as we can see in Figure 3.3.

Figure 3.3: Approximated probability density function

Though we tried to connect the points as smoothly as we can, this estimate of the function is
still pretty rough! The number of points is so small that the segments connecting them are

©Manning Publications Co. To comment go to liveBook


38

basically lines, and the connections between each two lines create a sharp edge. We need
more points to get a better and smoother estimate of that function. We can get more points
by decreasing the bin size, allowing for more bins to emerge, and hence more points to
connect, which we can see in Figure 3.4.

Figure 3.4: Decreasing the bin size from top-left to bottom-right, allowing for a smoother estimate for the
probability density distribution.

We can see that as we gradually decrease the size of the bin, the function estimate gets
smoother and smoother, until we reach the bin size calculated by the Freedman-Diacoins rule
(bottom-right plot). Any decrease after this value would distort the histogram and give us bad
estimates of the probability density distribution (remember that the goal of the Freedman-
Diacoins rule is to give a bin size that would make the histogram look like the original
distribution).
Though we can't go any further than this in estimating the distribution given the limited
data we have, no one can say we cannot dream! Imagine that we have unlimited access to
more data, so we choose some more samples form the population, plot them in a histogram

©Manning Publications Co. To comment go to liveBook


39

with the smallest bins possible (like in Figure 3.5), and then do that again infinitely many
times, until the bin sizes become very, very, very small, so small that we cannot say “very”
enough, infinitely small (or infinitesimal) that it approaches zero. Once we reach that point,
then we have it! The heart of the BMI's population: its probability distribution function.

Figure 3.5: How would the estimation of the probability distribution would get smoother if we had more data
from the population.

Though everything we just did remains in the realm of dreams, it tells us that there is
something out there controlling the sample and the population from which we have drawn it:
the probability density function. While we can't reach that function starting from the sample,
we can go the other way around! We can study how such functions behave regardless of any
concrete data samples, and once we figure that out, we go back and plug in what we know
from the sample into them and finally get a approximation of the population out of which we
can start making better models.

©Manning Publications Co. To comment go to liveBook


40

3.1.1 Random Variables, Distributions, and their Properties


The probability density function (or PDF), by virtue of being a function, must be defined over
some variable. In our case, here with the BMI, the variable represents, obviously, the BMI
value of any member sampled from the distribution. If we take a closer look at how we obtain
a value for that BMI variable, which we'll denote by X1 from now on, we'll see that the process
goes as follows: we randomly select a member from the population, measure their weight and
height, calculate their BMI, and assign that value to X1.

Figure 3.6: The road from the selection outcome to the BMI value

Choosing a random member from the population is by definition a random process that gives
us an outcome which is the person we get out of the population. This random process is called
a random experiment, because we're basically experimenting what the population could give
us when we sample from it. With each outcome, we're interested in their BMI value, so we
calculate it and use it as a representative of the human being. This calculation maps that
outcome (the human being) into a representative numerical value, which is the BMI. This BMI
value is expected to vary each time we carry out the experiment; that's because we're
probably going to select a different person with a different height and weight. Because that
value varies, we denote it with a variable X1. And due to the fact that is represents an
outcome of a random experiment, this variable becomes a random variable.

©Manning Publications Co. To comment go to liveBook


41

If we denoted the PDF function associated with BMI distribution with fX1 , then we say that
f (X1 = 30) = 0.057; it's just a small compact way of saying that the probability density of X1
having a person with BMI of 30 is 0.057. Neat, isn't it? One of the beauties of math is that it
can say so much with so little, and this compact way of talking allows us to make more
complex statements without having to spell them out into huge paragraphs.

3.1.2 How to read math?


The fact that math is just a compact way of saying something that would take long textual
sentences to say is key to making mathematics more accessible to us. If we read the
mathematical statement f (X1 = 30) = 0.057 as “f of X one equaling 30 equals 0.057,” it X1 will
sound like incomprehensible nonsense. On the other hand, by recalling that we use X1 to
denote the BMI value, we can read it as “the probability density of the variable X1,
representing the BMI value, having the value of 30 is 0.057,” which makes much more sense.
Looking beyond the symbols and into what the equations mean is the key of breaking the
barrier between us and math. That is why we focus in this book on the intuition and the
meaning behind the formulas, not on how to plug numbers into them and get results the way
you usually see it in a typical math textbook.
This approach to reading math proves itself to be really efficient and beneficial when the
formula we have doesn't have numbers in it. For example, let's take an abstract version of the
statement we just saw, one that contains no numbers.
If we read that as “f of capital x one equaling small x one equals a” then we're back to
the incomprehensible nonsense. Instead, by knowing that random variables are always
denoted with capital letters and their values with small ones, we read it as “the probability
density of the random variable x one having the value x one equals a,” which makes sense
again. Practicing this way of reading math will finally make your brain automatically transcribe
formulas within text into their textual description, making them indistinguishable from the
nonthreatening text you know and love.
The same concept of random variables holds for data that do not have direct numeric
meaning, like Class in our dataset, which specifies whether the individual is diabetic or not.
This is done by mapping the “categorical value” of being diabetic or not with a random
variable Y to a representative numerical value, either 0 or 1. However, with random variables
that take such discrete values, the value of the probability density function is interpreted
differently compared to the case where the variable takes continuous values (like the BMI);
this is due to the inherent difference in nature between continuous and discrete values.

©Manning Publications Co. To comment go to liveBook


42

Figure 3.7: The difference between discrete and continuous values is that between every two continuous values,
there is an infinite number of other continuous values, which is not the case with discrete values.

With discrete value, even with an infinite amount of them, we can still recognize some empty
spaces between the values. With something like the variable Y, we can see that it can take
either 0 or 1, and no other values in between, as someone cannot be 0.23 diabetic and 0.77
not. But that's not the case with continuous values; each time we think that there is a space
between two values, we find out that this space is filled with even smaller fraction between the
two values. Between BMI values 18.2 and 19.2 we can expect to find BMIs of and 18.28, and
between those we can find 18.26005 and 18.27009, and so on and so forth. 5 This is the
reason we call them continuous, because they continue in a smooth stream without the
spacing we find with the discrete ones.
Now, let's think back to the process we used earlier to approximate the PDF of the BMI, by
imagining infinitely many samples from the distribution and reducing the bin size of their
histogram gradually to infinitesimal size.

5
This is of course in theory, in practice we expect that our measuring devices will have a certain precision that we can't go beyond

©Manning Publications Co. To comment go to liveBook


43

Figure 3.8: (left) Reducing the bin size with continuous values allows the bins to capture finer ranges of the
values, but reducing bin size beyond 1 with discrete values (right) is useless as there are no finer ranges to
cover.

In the continuous case, gradually reducing the bin size down to an infinitesimal size allows the
bins to cover finer and finer ranges of values that exist within the larger ranges, as we can see
in Figure 3.8. But with the discrete values, once we reach a bin with size 1 that includes our
value, there are no smaller values to cover by reducing the bin size, so reducing it would be
useless. Unlike the continuous case, the perfect bin size to approximate the original
distribution is 1. And because the value of probability density function for continuous
variables is just the probability / bin_size with infinitesimal bin size; for the discrete case where
the perfect bin size is 1, the function's value becomes the probability itself. And by drawing
from the analogy with physical density (mass / volume), the distribution function over discrete
values is hence called a probability mass function (or PMF).
To see an example of a PMF, let's imagine that we have a sack that contains 8 colored
cubes: 2 orange cubes, a blue one, 3 reds, and the remaining 2 are black. By considering the
idea of fractions as probability like we did earlier, we can easily say that if we reached into
the sack and grabbed one of these cubes:

• The probability that this cube is orange equals 2/8 = 0.25,


• The probability that it's blue is 1/8 = 0.125,
• The probability that it's red equals 3/8 = 0.375, and
• The probability that it's one of the remaining blacks is 2/8 = 0.25.

©Manning Publications Co. To comment go to liveBook


44

Figure 3.9: The random experiment of picking a cube out of a sack of multicolored cubes.

We can encode these outcomes into a representative random variable V by assigning each of
them a numerical value, so that we say: V = 1 if the cube is orange, 2 when it's blue, 3 in case
of a red one, and finally 4 for a black cube. Now, with this information, we can easily define a
PMF that distributes V and hence describes the random process of choosing a cube out of the
sack by saying that:

The first thing we notice from such a PMF is that the probability that the variable takes is one
of the possible values, which we denote by P(V=v), and is equal to fV(v), is always between zero
and one with zero for impossible outcomes and 1 for outcomes that are guaranteed to occur
every time we attempt to run the random experiment. The intermediate values represent how
much we believe that the corresponding outcome will happen.
Another thing we notice is that all the probabilities sum to exactly one. This is not
surprising if we think of probability as the ratio of outcomes sub-counts to the total count of all
outcomes, but this fact can be motivated with a more general observation. When we sum all
the probabilities together, we're essentially asking about the probability of the variable V
taking one of the values: 1, 2, 3, or 4. In other words, we're asking about the probability that
the cube we grab will be either orange, blue, red, or black. And no matter how many times we
tried to grab a cube, we must get one of theses colors, hence this outcome takes a probability
of 1. So the fact that all the probabilities must sum to one, or:

©Manning Publications Co. To comment go to liveBook


45

is representative of the fact that the random variable must take one of its possible values, or
that one of the possible outcomes must occur!
This fact implies that the sum of individual probabilities gives the probability of either
outcome of them happening, and this opens the way for another type of question we can ask
out of the PMF. For example, we might be interested in knowing the probability of getting
either an orange or a blue cube, which is equivalently the probability that the random
variable V takes either the value 1 or 2. We can compactly express that probability as P(V≤2),
and this value can be calculated by simply adding P(V=1) and P(V=2) using the implied meaning
of summing individual probabilities we just observed. This tells us that P(V ≤2) = fV(1) + fV(2)
= 0.25 + 0.125 = 0.375, which we can verify by noticing that we're asking about the probability
of grabbing one of three cubes in the sack, which equals 3/8 = 0.375.
This kind of accumulation of probabilities gives rise to an important function of the
distribution that we call the cumulative distribution function (CDF), which we denote with
a capital F sub-scripted by the variable:

The importance of the CDF stems form the fact that it can give us the probabilities of the
variable being within a certain range. And even though it's defined for an open-ended range,
we can utilize it to get the probabilities within double-ended ranges as well, and this is done
simply by saying:

EXERCISE 3.1

Can you prove the previous equation?

HINT: Think in terms of the sum of probabilities intuition we developed, what does P(a < V ≤ b) equal? It's
then a matter of manipulating the right hand side to get that value!

All the properties that we have seen with the PMF of the sack of colored cubes actually apply
to each and every probability distribution out there, as they are the essential proprieties for
any function to be a distribution function. Whether it's a discrete PMF or a continuous PDF, the

©Manning Publications Co. To comment go to liveBook


46

same set of properties and rules hold; however, they look a little bit different when working
with PDFs of continuous random variables. For example, to express that all the probabilities of
a continuous random variable sum to one, we say:

If you do not know what that is, then it'll probably look scary at first glance! However, it turns
out to be pretty easy on the eye once we get to understand what it means. The symbol ∫ is
called an integral, and you can simply think of them as a continuous, smoothed-out
summation symbol!

And it serves exactly the same purpose as the summation symbol, but for those continuous
value that are infinitely packed against each other with no spaces between them. What the
integral does is that it hops across those infinitesimal bin sizes, starting from negative infinity
(the lowest possible value) to positive infinity. On each hop, it calculates the probability of V
having a value within that infinitesimal range via multiplying the probability density value
fV(v) by the infinitesimal size of the bin denoted by dv, and it accumulates these values until
it's done hopping across the continuous values. So that scary formula is semantically identical
to its discrete counterpart! We can even bridge the gap between them a little bit more by
writing the discrete version as:

As we noticed before, the smallest bin size we need to reveal the distribution function in the
discrete case is 1, hence dv disappears from the discrete formulas and lingers in the
continuous ones.
Now that we understand the integral, we can now simply define a CDF for a continuous
random variable by saying that:

©Manning Publications Co. To comment go to liveBook


47

and the same formula for double-ended ranges hold with contentious variables.
The only real difference between the discrete case and the continuous one is that we can
no longer say that P(V=a) equals fV(a) because fV is now a probability density function, not a
mass function. The only way for us to calculate probabilities with continuous variables is via
the CDF. And actually, if we tried to calculate the value P(V=a) via its continuous CDF, we'll
find that:

You might now ask that since we can only calculate probabilities of a continuous variable via
its CDF, does that mean that we need to know how to perform an integral operations in order
to calculate them? And the answer is no, all we need to know here is just what an integral is
and what it represents, and since we already know that now, we'll let the computers then do
their job and compute these values for us.

Is it really impossible for a continuous random variable to take a specific value?


Another thing that you may have noticed so far is that the probability of a contiguous random variable V taking a
specific value a is always zero! This means that it is impossible for a contiguous random variable to assume a
particular value in its domain!
This seemingly odd result can be justified by the fact that to calculate P(V=a), we need to multiply the density
function fV(a) by the size of the bin that contains only a and nothing else but a; if we recall our earlier discussion
about continuous values, no matter how small a range we have, we're sure that it contains an infinite amount
of values within it. The only way for a bin to hold just the value of a and nothing else is for that bin to have no
size at all – a zero-sized bin. And once we multiply fV(a) by 0, we simply get a 0 for P(V=a).
Even with this mathematical justification, it's still seems counter-intuitive! One could say that, for example, if
the BMI value in our data set is an instance of a continuous random variable, then how is it impossible for it to
assume a specific value while we see in the dataset that it actually assumes specific values like 18.2?
We can elevate that apparent conflict by taking a deeper look at the how we obtained such a specific value
like 18.2 in our dataset. As we noted before, to calculate BMI requires both the height and the weight of the
individual to be measured. Let's focus on measuring the weight. Let's imagine that at first, when the member
steps on the scale it records a weight of 82.1 kg; however, we notice that our scale's precision is only in order of
0.1kg, so we throw that out and get a more precise scale. The new scale’s precision is in the order of 0.001kg,
so our member now weighs 82.1253kg. We throw that one out as well and contact the particle physics
laboratory at CERN to deliver us a scale with a precision in the order of 10-34 and now the candidate weighs 82
followed by some 34 numbers after the decimal point kilograms. No matter how precise our scale gets, we'll
always leave something unmeasured.

©Manning Publications Co. To comment go to liveBook


48

The fact that a continuous variable took a very specific value in our data set is not evidence that it took this
particular value; it's just a limited precision measurement device collapsing it to this specific value. There's no
way we can pinpoint its value exactly as our measurement devices will always lack precision against a
continuous phenomenon. If a phenomenon is truly continuous, then take a value among its infinitely possible
values, that is 1/∞, is with zero probability.

Leaving the distribution of that sack of colored cubes toy behind us now, there are a lot of
famous, well-defined, and well-studied probability distributions that are used to represent
many real-world phenomena, both discrete and continuous. A famous example of a discrete
probability distribution, and one of the simplest distributions that we'll be working with a lot
throughout the book, is the Bernoulli distribution.
The Bernoulli distribution is named after the Swiss mathematician Jacob Bernoulli and
it's used to describe a random variable that can take one of two values: 1 with a probability
p and 0 with a probability 1 - p. If a random variable V is distributed by a Bernoulli
distribution, then it has the following PMF:

Take a moment with your paper and pencil to understand what these formulas mean and how
the two are equivalent. The semicolon in the function's definition is read as “parameterized by”
and what comes after it is considered a parameter of the distribution; so in this case the
Bernoulli distribution is parameterized by a fraction p, which indicates the probability of its
random variable taking the value 1.

Figure 3.10: How the PMF of Bernoulli distribution looks.

©Manning Publications Co. To comment go to liveBook


49

The Bernoulli distribution is used, for example, to describe a situation like the flip of a coin
where the random variable takes the value 1 if the flipped coin turns out to be a head and 0 if
it turns out a tail. A more related example to our case here is the value of the Class variable
Y, which can be described through a Bernoulli distribution with p being the probability of a
population member being diabetic. Generally, we express the statement that “a random
variable V is distributed via a Bernoulli distribution parametrized by a parameter p” in the
following mathematical form:

Which is read exactly as the earlier sentence between the quotes, with the tilde “~” as short
math symbol for “is distributed via”. We’re going to see This kind mathematical notation of
VraiableName ~ DistributionName(DistributionParameters) will hold with us for all random
variables and their possible distributions.
On the continuous side, the most famous distribution of them all is what we call the normal
distribution, or equivalently the Gaussian distribution, after the mathematician who introduced
it, Carl Friedrich Gauss. The two names are usually used interchangeably; however, we need
to be careful when we call it a “normal” distribution to not interpret the term as being or “well-
behaved,” thus rendering other distributions “abnormal” or “ill-behaved.” The term “normal”
comes actually from a technical mathematical detail in the way Gauss introduced that
distribution; however, we're allowed to think of the term “normal” as being “conventional” or
“common,” and that's because the distribution pops everywhere in both practice and theory,
and what the distribution describes is pretty understandable and … normal (again, in no
morally superior tone to other distributions).
2
The normal distribution is parameterized by two values μ and σ which represent the center
and the spread of the values in the distribution, as we're going to see from the definition of its
PDF:

Where exp(x) is just the exponential function ex. The idea behind this formula is pretty neat,
and we can see it by following the evaluation of the PDF for some value of x and see what is
actually being calculated around here:
2
• Once we plug an x into the formula, the first expression to be carried out is (x-μ) ,
which is the square of how far the value of x is from the value of μ.
2 2
• Then we divide (x-μ) by σ , which is the square of how many σ is the x away from μ.
• This value is then used as an exponent after multiplying it by -0.5. If you remember
your algebra, we can say that:

©Manning Publications Co. To comment go to liveBook


50

So as the number of sigmas x is away from the value of μ increases, the value of the
exponential function in the PDF's formula decreases, and consequently the probability density
of that value x decreases. So what the normal PDF is actually saying is that the random
variable X distributed by it has a higher probability taking a value near μ, and that probability
decreases as this value gets further away (in a measure of σ ) from μ, as we can see in Figure
3.11(a).

Figure 3.11: (a) Values away from μ in terms of σ has lower probability density values. (b) How the value of
controls the shape of the normal distribution.

The same observation can be seen in the plot of the PDF function, which is in the form of a
bell curve. as we can see μ acting like a central value and that how large σ is determines how
rapidly the chances of value far from μ decreases, by making the curve narrower or wider, as
it's shown in Figure 3.11(b).
We can see that the striking similarities between the normal distribution and the way our
BMI values are distributed in their histogram. Moreover, we can see how the normal PDF's
curve is so close to that density function we estimated from the BMI data earlier! The same
similarities can be seen between the PMF of a Bernoulli distribution and the histogram of our
Class data, and the normal distribution and the blood glucose histograms. These striking
similarities suggest that the real-world populations that from which the data samples were
drawn are actually distributed by these probability distributions.
From what these similarities suggest, we have a good reason to model these data onto a
their similar probability distributions. So we say that:

©Manning Publications Co. To comment go to liveBook


51

Where N is a shorthand notation for the normal distribution, and X2 is the random variable for
the BG measurements.
Now that we modeled our data using probability distribution classes, we need to make
these distributions concrete by trying to estimate the values of their parameters (those that
define their behavior) from the data itself; this can be done by exploiting a surprising
correspondence between these parameters and the descriptive statistics we saw back in
Chapter 2.

3.1.3 Expectation, Variance, and Estimations


Let's go back to how we calculated the sample mean back in Chapter 2. We defined the
sample mean as the sum of all our data values divided by the size of the sample.

Let's take a deeper look into how this equation works by doing a simple numerical example.
Assume we have some sample of numerical values that has
{1,4,5,7,5,8,1,9,4,8,8,6,9,1,7}, we can calculate their mean by simply summing all these values
and dividing the sum by their size, but let's take a closer look at that operation: instead of just
summing all the values, we can group each unique value and just multiply it by its count in
the data:

And now, instead of evaluating all the operations in the numerator and then dividing it by 15,
we distribute the 15 in the denominator over each term in the numerator, which results in:

These fractions are the result of dividing a-value counts over the total count of the data, which
we interpreted before as the probability of this value. So if the mean of the sample can be
expressed by sum of unique values in the sample times their sample probability, it's plausible
to think that the mean of the whole population would be sum of the unique population values

©Manning Publications Co. To comment go to liveBook


52

times their population probabilities. That is exactly the idea behind the expected value (or the
expectation) of a random variable.

Where DX is the domain of the random variable X, in other words, all the distinct values that it
could possibly take. It's called an expected value or expectation because it weights all the
possible outcomes with their probabilities, so the end result would be the value that we expect
any outcome to be around. If you don't remember є, it means “in”. The summation reads as
“sum over all x where x is in D of x”. It also similar to python's in when you implement that
equation

E=0
for x in Dx:
E += x * f(x)

The same inspection we did on the sample mean formula can be applied to sample variance
equation, which would lead us to an equation of the distribution variance that looks like
this:

By the fact that we extrapolated the definition of the distribution's mean and the variance
from the sample's mean and variance, we have all the reason to think that the sample
versions of these measures are good estimates of their distribution counterparts, and indeed
they are. These sample measures estimate the distribution most likely generated in the
sample data; that's why we call them maximum likelihood estimators.
While all this is very informative, we still didn't solve the issue we started this section with.
How do we estimate the parameters of the normal and Bernoulli distributions we modeled our
data upon? The answer to this question appears when we complete the circle by connecting
the distribution's expectation and variance with these parameters. If we were to apply
expectation formula to our Y's Bernoulli distribution, we'd find that:

Which means that the expected value of the Bernoulli distribution is the same as its
parameter, and if we can estimate the expected value with the sample mean of Y, then the

©Manning Publications Co. To comment go to liveBook


53

sample mean of Y also estimates the distribution's parameter. A similar situation arises when
we apply the expectation and variance on the normal distribution and find that:

Which means the sample means and variances of BMI and BG estimate the parameters of
their corresponding normal distributions. Verifying these results for the normal distributions
requires doing integrals, give it a run if you can do it, otherwise you can trust us. And voila!
We completed the circle and now we have a way to estimate the parameters of our model
distributions: through the sample's descriptive statistics.

Figure 3.12: Because sample mean estimates the expectation of a Bernoulli distribution, and that expectation is
the same as the Bernoulli distribution’s parameter pY, hence the sample mean estimates the Bernoulli’s
parameter pY The same holds for the normal distribution’s mean and variance.

Expectation and Variance for Continuous Variables


The formulas we defined in this section only work for discrete probability distributions, which is evident by the
presence of the sum in them. However, the formulas for continuous distributions are exactly the same with only
two differences: replacing the sum with an integral and multiplying the PDF value with the infinitesimal size to
get the probability itself. This makes the continuous expectation:

©Manning Publications Co. To comment go to liveBook


54

and the continuous variance:

EXERCISE 3.2

With the definition of both expectation and variance, convince yourself that the following
equation is correct and ponder upon what it verbally means.

The Symbols μ and σ2


2
We saw earlier how the two parameters μ and σ control the behavior of the normal distribution by providing a
center point μ from which the probability of a finding a value decreases given the probability spread. Moreover,
we saw that these two parameters correspond to the mean and variance of the data. Due to both these facts,
2
the symbols μ and σ are used to denote mean (or the expectation) and variance in general, regardless of the
distribution we're working with, which doesn't need to be a normal one.
E[X] and Var(X) each and every time. Though we didn't use that notation so far, you'll see it slowly creeping
up into our formulas from now on.

Up to now, we were able to model the data itself using probability distributions, and now we
have the way to estimate their parameters from the data we have. Still, we need a model that
links the value of the Class to the values of BMI and BG, or in other words, relate the value of
Y to the values of X1 and X2, and this is something we cannot do with these individual
probability models; we need the power of conditional probability to achieve that.

3.2 Conditional Probability


Remember the sack of colored cubes we had earlier? Now imagine that we have a smaller one
with only three cubes in it, one blue and two reds. We're going to draw one cube out of the
sack at random one at time until the sack is empty. Assuming that the second cube we drew
was a red, what are the chances that the third one would be also red?

©Manning Publications Co. To comment go to liveBook


55

To answer this question, we need first to understand all the possible colors order that could
occur when we draw the cubes one by one. This can be done with the tree diagram shown in
Figure 3.13.

Figure 3.13: The tree diagram for the colored cubes sack experiment.

The first column in the diagram (from left) represents the possible outcomes in the first draw,
which are either the blue cube (denoted with B) or one of the red cubes (denoted with R). The
second column represents the second draw with each two possible outcomes after the first
one. The third column lists the possible remaining cube in the last draw. By following the tree
paths from the first to the third column, we can enumerate all the possible outcomes (which
we call, the sample space) of the three draws together, and from this enumeration we can
start answering our question. It's called the sample space because any sample we might get
from that population, will have to be one of these outcomes; so it's the space of all possible
samples
We want to calculate the probability of having a red cube in the last draw after observing
the other red one in the second draw. If we look at the sample space, our eyes will quickly
catch the two outcomes BRR and BRR at the beginning of the set. These two outcomes are
exactly what we're looking for where the two red cubes appear in the second and third draws,
so we might be tempted to say that the probability we're looking for is simply the number of
these outcomes divided over the number of all possible outcomes, or:

Where C3 and C2 are random variables we assigned to the outcome of the second and third
draw (given that each draw is in fact a random experiment). A closer look, however, would

©Manning Publications Co. To comment go to liveBook


56

reveal that this answer is wrong, and the key to reveal that lies in the expression “given that
C2 = R”. When we say “given that C2 = R”, we mean that we know for sure that the second
cube was a red one, we have seen that with our own eyes, and it's not possible that the
second cube was a blue one. However, in our calculations above, we counted all the possible 6
outcomes, including the two outcomes that had a blue cube in the second draw, namely RBR
and RBR. This contradicts with what “given that C2 = R”, so it makes sense to discard these
two outcomes from possible outcomes count and say that:

or more compactly:

The fact that we know that C2 = R acts like a condition on our sample space, and this condition
filters out some outcomes from the calculation. This is why the vertical bar is read as “given
that” or “conditioned upon”, and that whole probability is called conditional probability. This
makes the probability we calculated at first to be a joint probability, the probability of both
outcomes happening together, without knowing that any of them has already occurred. We
simply denote joint probability with a comma:

EXERCISE 3.3

Using our PIMA Indian diabetes data, try to estimate the conditional probability P(BG> 140 |
Class = 1).

This approach we used to calculate the conditional probability depends on the fact that we
know the sample space, the number of outcomes in it, and the number of outcomes relevant
to our question. But what if we don't have any numbers? What if we have a probability
distribution and we know just the probabilities of the outcomes, how can we calculate
conditional probabilities? We can lose the dependency on outcomes numbers with a simple
trick; we divide the numerator and the denominator of the conditional probability formula over
the number of all outcomes:

©Manning Publications Co. To comment go to liveBook


57

By dividing over the number of all outcomes, we were able to replace the numerator with the
joint probability of the two outcomes. By looking at the denominator, we can see that this
represents the probability P(C2 = R); we have four outcomes that have red cubes in the second
draw: BRR, BRR, RRB, RRB, which means that P(C2 = R) = 4 / 6. Hence:

This means that in general, we can calculate the conditional probability by dividing the joint
probabilities of the outcomes over the probability of the condition outcome. So for any two
outcomes or events A and B, we can say that:

With the definition of conditional probability, we now have the tool to model the relation
between the Class Y and the features X1 and X2. If our goal is to predict whether the patient is
diabetic or not given their BG and BMI, then we can treat BG and BMI as the conditions for the
probability of Y. Hence, our relation can be modeled as:

But another problem occurs here: how can we model the joint probability P(Y, X1, X2)?

3.2.1 The Bayes Rule


Modeling this triple joint distribution is difficult. Joint distribution are generally difficult to put
into a specific form, and the problem gets even more difficult we've fitted the model to the
training data when they are of different types like we have here (one discrete and two
continuous). We can get out of one of these problems with a little trick using the conditional
probability formula. Let's look at what happens when we flip the A and B on the conditional
side:

Because P(B, A) is the same as P(A, B), as the probability of B and A happening is the same as
the probability of A and B happening, we can say from the above equation that:

©Manning Publications Co. To comment go to liveBook


58

By plugging in this factorization of the joint probability into the original P(A|B) formula, we
end up with one of the most famous formulas in probability theory, the Bayes Rule:

Using the Bayes Rule, we can re-express our conditional model for our traveling clinic problem
as:

We got rid of the problem of having a discrete variable along with continuous variable in the
joint probability, but we still need a way to model joint probability of X1 and X2. We can get
around this requirement by making a small assumption that may sound naive, but it will result
in a much accurate model than we have now.

3.2.2 Independent Random Variables


Conditional probability captures the dependency between random variables; like we have seen
of the sack of cubes example, the outcome of second draw depends on outcome of the first,
and the third depends on the second. But not all random variables are dependent. For
example, consider flipping two fair coins one after another. Clearly, the outcome of each coin
flip is independent of the other; there's no relation between the two separate coins. But how
does this independence reflect in the math?

©Manning Publications Co. To comment go to liveBook


59

Figure 3.14: The tree diagram of the two coins experiment.

Let's look at the tree diagram of these two experiments in Figure 3.14 and see for ourselves.
The first question we can ask is about the conditional probability of having heads (H) in the
second flip given that we had a tails (T) in the first one. Using the ways we learned to
calculate conditional probabilities, we can find that:

Which is surprisingly the same as the probability P(B= H), which is 0.5 for a fair coin. It seems
that the condition even A = T didn't have any effect on the value of the probability, and this
makes sense given that the two events are independent. This rule actually generalizes for any
two independent events A and B:

This rule allows us to simplify the joint probability equations even more if the two events are
independent, by saying that:

This means that the joint probability of independent events is simply the product of their
individual probabilities. We can see that in action with the coin experiment, by calculating the
probability of having the two coins at tails:

©Manning Publications Co. To comment go to liveBook


60

With these interesting properties of independent variables, we can simplify our model by
assuming that the value of BG and BMI are independent, which makes P(X1, X2) = P(X1)P(X2),
and hence our model becomes:

At a first quick glance, BG and BMI might seem really independent; the first is a property of
the blood and the other is a property of weight and height. But this is usually a naive way to
look at things because deep down there might be a relation between how the body looks and
how the blood looks – they're parts of the same human being after all. Usually, when we talk
about real-life phenomena, it's hardly the case that they are independent, which makes this
assumption of independence a naive one. This is the reason why this family of models is called
naive Bayes models.

3.3 Applying the Naive Bayes Model with scikit-learn


Out of this naive Bayes formula we have above, we already know how to calculate P(Y), P(X1),
and P(X2). We're still missing how to model P(X1|Y) and P(X2|Y). These, however, are fairly
simple to model, as we can choose the right model by looking at the data we have and how
they look.
To know how these distributions look, we need to isolate the features given each value of
the Class Y. To do that, we start by creating a Boolean mask that can extract the features
for the diabetic cases only.
diabetics_mask = clean_data_of_interest.loc[:, “Class”] == 1
diabetics = clean_data_of_interest[diabetics_mask]
non_diabetics = clean_data_of_interest[~diabetics_mask]

The histograms for the diabetics data would approximate the distribution of X|Y=1, and those
for the non_diabetics data would approximate the distribution of X|Y=0. We can plot these
histograms for the BMI, and for more visualization, we can plot the density function estimate
as well to get a clear view of how the distribution would look. Plotting the density estimate can
be easily done by calling the density() method on a DataFrame's plot object.
diabetics.loc[:, "BMI"].plot.hist(color='red', normed=True)
diabetics.loc[:, "BMI"].plot.density(color='black')

©Manning Publications Co. To comment go to liveBook


61

Figure 3.15: The approximate distribution of BMI values given Class = 1.\

non_diabetics.loc[:, "BMI"].plot.hist(color='green', normed=True)


non_diabetics.loc[:, "BMI"].plot.density(color='black')

Figure 3.16: The approximate distribution of BMI given Class = 0.

©Manning Publications Co. To comment go to liveBook


62

From these plots, it seems that the normal distribution is a good modeling choice for
conditional distribution of X1|Y. However, because the distribution shape changes with the
value of Y, this suggests that the parameters for the conditional distributions of X1|Y differ with
the value of Y as well. We can express that by saying:

Where μ10, μ11 are the mean of the BMI within the non-diabetic and diabetic groups
2 2
respectively, and σ 10 , σ 11 are the variances of the BMI within the each group. The same
modeling decision can be made for the Blood Glucose conditional distribution X2|Y.

EXERCISE 3.4

Verify that the Blood Glucose conditional distribution X2|Y can be modeled with a normal
distribution.
Now that we have all our modeling details, we're all set to apply the naive Bayes formula.
The parameters of the models can be estimated directly from the data, and the parameters of
the conditional distributions X1|Y and X2|Y can be estimated from the diabetics and
non_diabetics data. But instead of doing all the work ourselves, we'll rely on scikit-learn
GaussianNB model to do all the work for us. GaussianNB can be found under the
naive_bayes module and it implements such Gaussian naive Bayes models as we have here
for our problem.
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

X = clean_data_of_interest.loc[:, ["Blood Glucose", "BMI"]]


y = clean_data_of_interest.loc[:, "Class"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

classifier = GaussianNB()
classifier.fit(X_train, y_train)

accuracy = classifier.score(X_test, y_test) print("Prediction Accuracy: {:.2f}%".format(accuracy * 100))

> Prediction Accuracy: 79.79%

EXERCISE 3.5

Notice that the data here has changed from when we tried out the perceptron model in
Chapter 1 due to the cleaning of missing values that happened in Chapter 2. Run the
perceptron model again as well as the dummy classifier to make the results comparable to
what we've got from the naive Bayes model.

©Manning Publications Co. To comment go to liveBook


63

The naive Bayes model, although being naive, was able to score almost 80% accuracy! With
about 40% increase from the perceptron model and around 18% from the dummy classifier
that always returns the most frequent class. Now we have a decent solution for our traveling
diabetes clinic to start with.

The Meaning of the Bayes Rule


There's a beautiful meaning behind how the Bayes rule works, and it can be revealed by distinguishing three
different parts of the formula, which we'll call the prior, the likelihood , and the posterior.

The prior, which corresponds to the unconditional distribution of the phenomenon we're trying to predict,
represents our initial belief about how this phenomenon behaves. For example, in our problem, our prior is our
initial belief about how people are diabetic. For now, we know that about 30% of the population is diabetic,
without knowing any factors related to that.
After establishing our prior belief, we start to observe some data related to the phenomenon we're studying.
Like in our problem, the BMI and BG measurements are those related data. Given our prior belief about the
phenomenon, we start measuring the likelihood of the data we observed, which in our example is how likely
people with or without diabetes can have specific values of BG and BMI.
By studying the data we observed and how likely they are given our initial belief, we can update our prior
belief to reflect what we know after observing the data. In our example, after observing the BMI and BG, we
updated our understanding of the diabetes phenomenon to know that these are factors that contribute to the
condition, and this is what the posterior distribution captures.
Basically, the Bayes rule describes the thinking process that starts with some initial understanding or belief
about something, then observing the results of our experiments against that belief, in order to get updated
knowledge.

EXERCISE 3.6

Try to implement the naive Bayes model on the traveling diabetes clinic problem yourself. It'll
strengthen your understanding of how the different parts of the model work and how the
estimation is done, as well as it'll develop a sense of appreciation of the work done behind
scikit-learn and its supporting libraries. To help you in that implementation, note that the
Bayes rule doesn't only apply to probabilities, but it also applies on PDF and PMF as well, so
you can write the model for our problem as:

©Manning Publications Co. To comment go to liveBook


64

©Manning Publications Co. To comment go to liveBook


65

4
Setting the Stage

“The formulation of the problem is often more essential than its solution, which may be
merely a matter of mathematical or experimental skill”

Albert Einstein

In the past three chapters, we took a quick journey through machine learning and its
foundational mathematics by working through an example of a real-life problem. We saw in
Chapter 1 how we proposed a model for the traveling diabetes clinic problem by assuming that
the relation we're trying to learn has the form of a linear function. Afterward, in Chapters 2
and 3, we saw that by studying the data using statistics and probability theory we were able
get much more accurate results by modeling all the probability distributions directly. However,
it's not always the case that modeling probability distribution is better than using a function as
a model; the two modeling approaches are widely adopted in the field as one of them can
work on some problems while the other can’t.
This is known as the No Free Lunch Theorem in machine learning, which states that no
one model will work best on every problem we can encounter. This is why a large amount of
machine learning models and algorithms exist out there, and we’ll spend our time for the rest
of the book on studying some of these different models. In order to do so, we set the stage in
this chapter by understanding how all these different models stem form the two modeling
approaches we've already seen, and then we see the different types of machine learning
problems that we can encounter.

4.1 Generative and Discriminative Models


As we have seen from the past two chapters with the diabetes data, there were probability
distributions that represented the data, and there was a probability distribution that described
the relation between the label we're trying to predict and the features we have. This applies to

©Manning Publications Co. To comment go to liveBook


66

any problem or any data we have; there will always be some underlying probability
distributions describing the data and the relationships between them, and our goal in learning
the relationship between a label y and some features x would be to learn about the conditional
distribution P(Y|X). How we learn about that conditional distribution is what differentiates the
modeling approaches we saw.

4.1.1 Generative Models


The approach we used back in Chapter 3 when we calculated P(Y|X) via the Bayes rule consists
a class of models we call the generative models. We can see the rationale behind such naming
from the formula of the Bayes rule itself:

Calculating P(Y|X) via the Bayes rule starts by computing the product P(X|Y)P(Y), which, if we
recall from Chapter 3, is just the joint distribution P(X,Y). So in this approach, we solve the
learning problem by first learning about the joint probability distribution from which we can
sample many other points that look like the data we have in the dataset as they come from
the same estimated distribution. This means that what we learn is capable of generating more
data like those we have. That's why the class of models that follows this approach is called
generative models.
On the other hand, there are the discriminative models. By virtue of the name,
discriminative models attempt to learn how to discriminate between each data point and the
other, and it does that by learning how to assign labels for the given features. Discriminative
models work by directly assuming a specific form of distribution for P(Y|X), without going over
the whole process of learning P(X, Y) that occurs in generative models. The model we used in
Chapter 1 where we modeled that the two different values of Y are separated by a linear
function, the perceptron model, is considered to be a discriminative model.

©Manning Publications Co. To comment go to liveBook


67

Figure 4.1: Generative vs. discriminative models.

4.1.2 Discriminative Models and the Target Function


We said that discriminative models work by making assumptions about the conditional
distribution of Y|X. We also said that the perceptron model is a discriminative model.
However, we didn't mention anything about probability distributions or the conditional
probability P(Y|X) back when we used it in chapter 1. This might seem strange at first but it
will become clear when we learn that the modeling decision in discriminative models is rarely
about the distribution itself – it's more about its parameters.
When it comes to deciding which distribution to use, the choices are easy: if our label is
categorical and has discrete values, we model it with a Bernoulli distribution or some
generalization of it; otherwise, if the label takes real continuous values, we model it with a
normal distribution. The real modeling magic happens at choosing how to represent the
parameters of these distributions, specifically their expectation E[Y|X]. Let's go back to the
perceptron model we used in Chapter 1 and see how that works.
First, by choosing the Bernoulli distribution to model P(Y|X) for the diabetes data, we need
to remember that its expectation E[Y|X] equals the probability that the person will be diabetic.
Now let's look at how we defined our perceptron model:

©Manning Publications Co. To comment go to liveBook


68

From this definition, any person that lies above the line we learned was considered a diabetic
person, so our function outputs 1. Otherwise, if the person was below the line, we consider
them non-diabetic and output a 0. These 1s and 0s that our function outputs can be
considered probability values, where 1 means that the person is 100% diabetic and 0 means
that it's no way they're diabetic. This fits nicely with the Bernoulli distribution we assumed for
Y|X as we can write it as:

Which means that the distribution of Y given that X has values x1 and x2 for BMI and BG is a
Bernoulli distribution whose p parameter is defined by the value of the linear function. This is
how discriminative models usually work, not mainly by modeling the distribution, but by
modeling how its E[Y|X] would look like. Because E[Y|X] that we're trying to model is
essentially a function, we usually call it the target function – as in that's what we're
targeting to approximate. The modeling assumption we make on the other hand, or the form
we assume that the target function takes, is called the hypothesis function, because we
hypothesize that target function would take such form.

Another Road to the Target Function


We verbally motivated the usage of target function in discriminative models by looking at how the perceptron
model was actually the parameter for the Bernoulli distribution modeling the Y|X. However, there's a more
general way to motivate the concept of target functions being the expectation E[Y|X], and it starts with a
simple manipulation of the most obvious fact that Y = Y.
What we do is to add and subtract E[Y|X] to the right hand side of Y = Y; this is not changing the equation in
any way because E[Y|X] – E[Y|X] is just 0, but allows us to decompose the value of Y into meaningful parts:

Given this decomposition, let's think about how the random variable Y varies and takes
different values. The first term E[Y|X] varies as the value of X varies, so any variation in E[Y|X] can be
explained by the value X, which in turn allows us to say that this terms accounts for the variation in Y that can
be explained by the variation in X. This leaves the other term to capture the remaining variation of Y that X has
nothing to do with.

©Manning Publications Co. To comment go to liveBook


69

But where does that unexplained portion come from? We can answer that question by understanding that any
real-life phenomenon is complex; there are a lot of variables at play and many things can affect how something
works. For example, a diabetic person may have genetic predispositions for the disease, and this factor is not
captured by the BMI and BG features we used. These are the sources of that unexplained term, the other factors
that we don't know about. So the second term, as far as we're concerned with the relation between Y and X, is
nothing but noise that we cannot reduce:

So by trying to learn a target function that corresponds to the conditional expectation E[Y|X], we're actually
learning all that we can about the relation between X and Y.

4.1.3 Which Is Better?


Discriminative models tend to be easier to use and implement than generative models;
generative models tend to be more mathematically involved, which means they need careful
handling and extra computation power. On the other hand, by the virtue of learning whole
joint probability distributions, generative models tend to understand better how to handle the
noise in the data as well as missing values in the features, which is something that
discriminative models can't do on their own. Moreover, discriminative models, if not handled
correctly, can mess up the learning and confuse the noise in the data as part of X's
contributions, which will lead them to fail miserably on new data.
So it’s obvious that neither is always better; some will be more beneficial than the other in
different contexts. And this conforms with the no-free lunch theorem in machine learning we
talked about earlier. It also makes sense given the diverse amount of machine learning
models out there that are built on both approaches. Each different ML model makes either a
different assumption of how the target function looks, or a different assumption about the
nature of probability distributions that control the data. This fact will be the fundamental base
on which we'll continue exploring the different machine learning models in this book: each
subsequent part/chapter in the book will start with some assumption about the target function
or the probability distributions, and starting with that assumption we're going to see how
different machine learning models arise for the different problems. But before we can start
doing that, we need to first understand the different problems we might face. It's impossible
to count all the machine learning problems out there; basically anything with any sort of data
can be a source of a machine learning problem. However, there are some general categories

©Manning Publications Co. To comment go to liveBook


70

under which fall most, if not all, machine learning problems, and that's what we're going to
explore next.

Figure 4.2: How all the machine learning models stem from the same source, just different assumption paths.

4.2 Types of Machine Learning Problems


There are three distinctive major types of machine learning problems under which any
problem we encounter will fall. The first type of those is what we call supervised learning.
The traveling diabetes clinic problem is an instance of supervised learning problem.
In supervised learning, we have a label that supervises what the model should learn, and
the label we try to predict is the piece of data that acts as the supervisor. For example, in the
traveling diabetes clinic problem, the model was learning and changing its parameters’ values
toward assigning correct labels on the training data, so we can say the that label acted a
supervisor on the model by telling it how and where to change its parameters in order to
achieve better and better accuracy. The diabetes clinic problem belongs to a sub-type of
supervised learning problems called classification problems. This sub-type differs from

©Manning Publications Co. To comment go to liveBook


71

another sub-type called regression problems by how the label looks. In classification
problems, the label is categorical and discrete, like whether a patient is diabetic or not, while
the label in regression problem is real-valued and continuous, like a price of a house for
example.

Figure 4.3: Supervise learning problems.

Contrary to supervised learning, there exists the unsupervised learning problems, where
there is no supervisory labels and we just explore the data looking for hidden structures and
interesting patterns. Now you probably think that unsupervised learning is messy; there are so
many different patterns and structures to look for in the data. This is partially true,
unsupervised learning is a little less well-defined compared to supervised learning, and can
have a variety of types and forms, unlike supervised learning where we mainly have
classification and regression. However, just because it's unsupervised doesn't mean that we
go through the data searching for anything; we determine beforehand what sort of structure
that we're looking for, but we search for that structure without any supervision.
For example, we may want to see if groups exist within the data, in each some data points
cluster and represent common pattern and behavior. This kind of unsupervised problem is
called clustering problems. Another example is when we have a problem with a large set of
features; some may be relevant to the task we're trying to do and some may not be, and
there's no way to determine what's relevant and what's not manually, so we'd want to explore
the data for a smaller and compressed representation of the features that automatically
contains the most relevant parts and discards any irrelevant ones. This type of unsupervised

©Manning Publications Co. To comment go to liveBook


72

problem is called dimensionality reduction. Notice that in both types there are no labels
guiding us, we have no clue which belongs to what group, and we have no clue what's a
relevant feature and what's not; we just look through the data for these hidden structures.

Figure 4.4: Unsupervised learning problems.

How Does the Framework of Generative/Discriminative Models Work with Unsupervised


Learning if There Are No Labels?
The math we dealt with when we talked about generative and discriminative models involved both the features
X and the label Y, and we said that this framework of making assumptions upon these two classes of models is
the way we're going to continue the book. There are no labels in unsupervised learning; does that mean we
won't be working with unsupervised learning at all? Is there another framework for unsupervised problems? The
answer to both questions is no! It does seem at first glance that unsupervised learning is outside the reach of
that framework we established. However, with careful and closer inspection, we can see that the same
framework is hold and effective for most unsupervised learning problems.
The goal of unsupervised learning is to discover some interesting pattern within the data and if we look at it
carefully, we can model such interesting patterns in a similar way to the discriminative model we established for
the supervised learning: we assume that there's some target function that takes the feature X and spits out a
variable Z. In the dimensionality reduction case, Z would be the lower-dimensional representation of X, and in
the clustering case it would be the index of the cluster to which the data point x belongs.
And for generative modeling, we simply assume a probability distribution over Z and a conditional
distribution P(X|Z), which we use with the Bayes rule to get P(Z|X). This way of introducing the structure we're
looking for in the data as a hidden random variable Z, which we call a latent variable, allows us to extend the
unifying framework we to include unsupervised learning as well, giving us the systematic approach of making
assumption about the nature of a discriminant function or probability distributions. However, because we have

©Manning Publications Co. To comment go to liveBook


73

no data about that latent variable Z, we need to get a little creative when we try to learn about it. We'll see how
this is done when we work with unsupervised problems later.

The word latent is Latin for “lie hidden”, which makes it the perfect naming!

The last major type of machine learning problems, which will be outside the scope of this book
but we're including here for comprehension, is reinforcement learning. In reinforcement
learning we're basically trying to learn how to survive in an environment. Think of a self-
driving car. The car needs to be able to survive on the streets without having or causing any
accidents. This kind of machine learns how to drive first by going through a simulation of the
streets. It starts taking actions and observes what happens and how the environment reacts.
If an accident happened, the action that the car took is punished and discouraged in the
future. On the other hand, if everything went fine, that action is rewarded and encouraged to
done in similar situations. By repeatedly interacting with environment and reinforcing the
rewarded actions, the agent finally learns how to survive in that environment. Although this
type of problems is outside the scope of the book, everything that we'll learn here will be
fundamental in doing reinforcement learning; this is true because deep down, reinforcement
learning shares the same mathematical foundation as the other two types, but is a little more
complicated due to the existence of an external environment.

Figure 4.5: Reinforcement learning

Any machine learning problem we may encounter will generally fall into one the three major
types we defined: supervised, unsupervised, and reinforcement learning. And thanks to
foundational base we have established so far, we're basically going to deal with whatever
comes in our way with the same systematic process. Make an assumption, hold some data for
testing, train, test, and once we have a good model that we trust, we use all the data to train
our production model.

©Manning Publications Co. To comment go to liveBook


74

Figure 4.6: Machine learning problem pipeline.

Now, we're all set to embark on our journey through the world of machine learning models. So
sharpen your pencils, stretch your fingers and let's dive in.

©Manning Publications Co. To comment go to liveBook


75

Prelude
Uniformly Continuous Targets
“Children need continuity as they grow and learn”

- Thoman Menino, Former Mayor of Boston.

“One of these things is not like the others,


One of these things doesn't belong,
Can you tell which thing is not like the others by the time I finish this song?”

Along with a set of four objects, this song plays during an episode of Sesame Street, the
famous children show. For example, an apple, a cup of ice-cream, a burger sandwich, and a
sock are shown, and the children watching the show would scream out: the sock! While it
seems like a trivial answer, it's worth taking a deeper look into what could be going on inside
a child's mind during the very short interval between the moment they hear the song and see
the object, and when they scream out the answer.

©Manning Publications Co. To comment go to liveBook


76

Figure P1-1: A child's Brain's ability to capture similarity allows it to catch the out-of-group object

A direction in developmental psychology, which is the study of how human beings change over
time and how they acquire knowledge form infancy to adulthood, would suggest that the
children's ability to pick out the sock amongst the other objects stems from their brains ability
to capture similarity; in other words, their ability to assign similar objects the to the same
category despite the small differences between them 6. When the children see the apple , the
ice-cream and the burger; they realize that all of these belong to the same category because
they are all edible. There is a difference in taste (apples and ice-cream being sweet against
the burger being salt) and a difference in eating method (a cup of ice-cream is usually eaten
with a spoon while both the apple and the burger are usually eaten without any tools), but
these differences do not lead the child astray from recognizing that these object are all from
the same class: edible; the child knows that whatever that makes something edible, it doesn't
not change with these small changes in taste and eating methods.
Let's take a moment to think about that “thing” that makes something edible and
understand what makes it robust to such small changes. First of all we need to make that
“thing” concrete in order to work with it, and we can do that with the help from the tools and
concepts we developed in the last chapter when we formalized machine learning problems.
Let's say that there's a variable x that encodes the features and properties of an object, and a
target function that takes that variable and outputs some real number, if that number is

6Check Neoconstructivism: The New Science of Cognitive Development By Scott Johnson, Oxford University Press, Chapter 14 for further discussion of how

similarity contributes to children's ability to learn and acquire knowledge.

©Manning Publications Co. To comment go to liveBook


77

greater than 8 then that object is edible, otherwise it's not edible; this is our modeling
decision.

Figure P1-2: Two different candidates for edibility target function

In figure P1-2 we see two possible candidates for such function, a relatively smooth f(x) and
g(x) with a sharp spike in the middle. Do you think that both functions are valid candidates
given that we know that similar objects with little changes in properties have the same verdict
of being edible?
To answer this question, we need to put the statement about small changes not affecting
the verdict into some quantifiable form. In our model, the similarity between two objects can
be quantified by the difference of their x values or the change between these values, and the
verdict of being edible or not depends on the value of the function itself, so a change of the
verdict corresponds to a change in the value of the target function. And for the model to be
consistent, that small change in x should result in a similar change in the function's value
across all possible values of x. Think about it, if the sweet potatoes are encoded with the value
25.78, and regular potatoes with 24.35 and the 1.43 difference between them accounts for
one being sweeter than the other, we need the difference of the value of the function due to
that 1.43 difference in x to be the same as the one due to the same difference between cheese
(encoded for example with 32.56) and ice-cream (encoded with 33.99). If this is not the case,
if the function is not consistent in that way, we may find out that cheese is edible while ice-
cream is not.
Now the question about the validity of f(x) and g(x) to be target functions of the edible
phenomena boils down to the question whether they satisfy that consistency condition. The
answer can then be easily obtained when we see that one of them does satisfy it, and the
other does not.

©Manning Publications Co. To comment go to liveBook


78

Figure P1-3: small x-change vs y-change for the two candidate functions

We can see that for f(x), the small changes in the values of x along all the possible values
results in consistent changes in the value of f(x); that doesn't hold for g(x) apparently. If g(x)
was the true target function in the underlying edible phenomenon, then the children's strategy
of exploiting similarity wouldn't have worked! Exploiting similarity would work well as it did
only if the true target function exhibited that consistent behavior. In mathematics, in the field
of real analysis, such behavior is called uniform continuity, making f(x) a uniformly
continuous function.
You may recall the concept of continuity from your high-school math class, where a
function is called continuous if there are no gaps and jumps in its graph. Being uniformly
continuous puts an extra requirement over the function's graph by requiring it to have no
steep sections like g(x) have 7, it's called uniform because changes in the function happen in a
uniform and consistent manner, there are no steep changes.
As we have seen, a target function like of f(x) being uniformly continuous allows the usage
of similarity between its input-output pairs to learn and recognize the pattern it generates.
Assuming that the a function is uniformly continuous represents a very simple assumption; as
it doesn't enforce any specific form of the function. If we assumed that the target function in
any machine learning problem we encounter to be uniformly continuous, then we would be

Technically speaking, a function with steep section can still be uniformly continuous, provide that this section is a short one.
7

©Manning Publications Co. To comment go to liveBook


79

giving birth to the simplest family of machine learning methods, our baby children machine
learning methods in a sense: the similarity-based methods.
In the next chapter, we'll work with the k-nearest neighbors method, which exploits the
similarity between the input data points and the labeled data in our data sets to infer the
labels of the new points. For unsupervised learning, we'll look in chapter 6 at the K-means
method and how it uses the similarity between the data points in order to group them into
related clusters.

©Manning Publications Co. To comment go to liveBook


80

5
K-Nearest Neighbors Method

“You can be a good neighbor only if you have good neighbors”

- Howard E. Koch, playwright and screenwriter.

After we established our uniformly continuous modeling assumption in the prelude, we'll utilize
the fact that uniform continuous targets allow us to exploit the similarity between objects as
an indicator for their labels similarity. With this approach we'll develop the k-nearest neighbors
(or k-NN) method which simply searches for the k most similar objects (aka, nearest
neighbors) to our input and uses their labels to predict a label for it. We'll motivate our
discussion by applying the method to the problem of classifying whether a mushroom is
edible or poisonous and along the way we'll also get to learn about.
This chapter is going to be a little long; we're going to take advantage of the simplicity of
k-NN and try to build it form scratch in different stages. In each stage we're going to improve
upon the previous one until we reach the usage of scikit-learn. This will allow us to see on
a manageable scale what it takes to implement a functional and efficient machine learning
software that can work on large amount of data and grow our appreciation scikit-learn, its
supporting libraries, and the community behind them.

5.1 A Basic K-NN Classifier


You get a call from one of your relatives, who runs a community organization that plans
camping trips in the wild for children and teenagers in order to get them to do more outdoor
activities and rely less on digital entertainment. Your relative explains that she observed that
the kids tend to enjoy the experience better if somehow their digital devices are part of the
experience, like when they search for info on a plant or an animal they see.

©Manning Publications Co. To comment go to liveBook


81

For their next series of trips, they expect to encounter a lot of mushrooms, some of them
can be eaten safely and the others can be deadly. For that reason, you're relative was thinking
about how can their digital devices can be used to help in this situations and was telling you
about his idea of “Can I eat that?” app and checking if you can help make it true.

5.1.1 The “Can I eat that?” App


“It's a very simple and straightforward idea. You just enter a visual description of the
mushroom into the app and it should tell you if it's safe to eat or not” your relative says. She
continues explaining that although mushrooms are very diverse, they all share a common
anatomy that can be described through, but they do not want to make the app use complex
anatomical features to detect if the mushroom is edible or not, they want to keep it simple for
the kids and just use a handful of easy-to-spot features.

Figure 5-1: Anatomy of a mushroom

“We just want the kid to look at the mushroom, describe its cap's shape, color and texture.
Along with these pieces of info we want the kid to take note of the habitat of the mushrooms
as well as their population” She explained that with only these info, the app should determine
if the mushroom can be eaten safely or not, “as simple as that!” she concludes.
With traditional programming methods, the task is not as simple as she thinks. To do this
with traditional programming, we need to code up some rules that determine whether the
mushroom is edible or not, and we don't have any expertise in mushrooms to do that. Even if
we had that kind of expertise or consulted a botanist, it's very likely that these rules are very
complicated to be maintainably hard coded in an app's logic. These problems make the this
app quite a challenge.
However, by looking at this problem in the light of machine learning, we can find that it's a
textbook classification problem. We have an object with a set of features (the mushroom) and
based on these features we're trying to predict it a label (being edible or poisonous), the only

©Manning Publications Co. To comment go to liveBook


82

thing that is missing is to have a dataset that we can use to train a prediction model! A quick
search on the internet reveals the Mushroom Dataset.
This dataset was collected from The Audubon Society Field Guide to North American
Mushrooms, which is nearly a thousand pages book designed specifically as a guide to identify
the mushroom types in North America for any one taking a tour in the wild. This guide was
prepared by National Audubon Society which is a non-profit environmental organization
dedicated to the conservation of wild life. This organization has issued a lot of field guides for
surviving in and identifying the wild, of which is the source of our dataset.
The set contains 8124 mushrooms, is described by 22 features and labeled either
poisonous or edible. Let's take a closer look at it by opening the mushrooms.csv file with
pandas and look at a random sample from the data as it appears in figure 5-2.
data = pd.read_csv("../datasets/mushrooms.csv")
data.sample(10, random_state=42)

Figure 5-2: A random sample of 10 mushrooms from the dataset

The first column of in the data, the one named E, is the label we're after, it takes two values:
p for poisonous and e for edible. The 22 features are named from F0 to F21, and each takes
discrete values encoded by letters. The webpage of the dataset give us the details of each of
these features and what each letter of their values mean. From the description of the visual
features needed in the app, we find ourselves interested only in F0, F1, F2, F20, and F21
which respectively correspond to: cap-shape, cap-surface (its texture), cap-color, population,
and habitat. Moreover, the dataset webpage give us the meaning of the letter codes for each
of these features:

• The cap-shape can be: bell(b), conical(c), convex(x), flat(f), knobbed(k), and
sunken(s).
• The cap-surface is either fibrous(f), groovelike(g), scaly(y), or smooth(s).

©Manning Publications Co. To comment go to liveBook


83

Figure 5-3: Possible mushrooms caps

• The cap-color could be brown(n), buff(b), cinnamon(c), gray(g), green(r), pink(p),


purple(u), red(e), white(w), and yellow(y).
• the population the mushroom exists in can be abundant(a), clustered(c), numerous (n),
scattered(s), several(v), solitary(y)

©Manning Publications Co. To comment go to liveBook


84

Figure 5-4: Possible shapes of mushrooms' populations

• Finally, the habitat of the mushroom can be grasses(g), leaves(l), meadows(m),


paths(p), urban(u), waste(w), woods(d).

Our first step is to extract only the data we're interested in working with, and this can be done
by selecting the list of columns we want form the data frame:
data_of_interest = data.loc[:, ["E", "F0", "F1", "F2", "F20", "F21"]]

The app then can be designed as a form of drop-down menus, each for one of these features,
and each contains the feature's possible values. The kid can fill in the form describing the
mushroom they see and the app would output the label of that mushroom: edible, or
poisonous. The question then is, how can we do that?

5.1.2 The Intuition Behind k-NN


Nature usually demonstrates itself to be uniform and coherent; a tomato that has sightly
rough surface and dents is still a tomato, these small changes in appearance didn't keep it
from being considered a tomato. A banana that is slightly less white and short remains a
banana, that small change from the perfect long shape didn't disqualify it as a banana. In
general, from the demonstrative evidence, it sounds reasonable to assume that the underlying
processes behind these phenomena are uniformly continuous; that is, small changes in the
input result in small consistent changes to the output. Mushrooms are no exception; it's
reasonable to expect that mushrooms that look like each other to be as edible/poisonous. So

©Manning Publications Co. To comment go to liveBook


85

to classify a mushroom, we simply search the dataset for the most similar mushroom to it,
once we find it we use its label as our classification.
This is basically the heart of the k-NN method: it searches the dataset for the nearest
neighbors to our input (those that are the most similar to it) and then uses the labels of these
nearest neighbors to infer the label of the input. But instead of just looking at the single most
similar object, k-NN allows for a more refined judgment grabbing the first most similar k
elements from dataset and taking a vote among them: the most common label among that
group of k objects is outputted as the label for the input.

Figure 5-5: How k-NN works on classifying mushrooms (k = 5)

k-NN is a very simple learning algorithm. It's so simple that you're probably feel ready to start
writing a code that implements right away, the only thing that is missing is just how to
measure the similarity between the input mushroom and the mushrooms in the dataset.

5.1.3 How to Measure Similarity?


We can find in the naming of the method itself a clue on how we can measure the similarity
between the data, specifically in the term “neighbor”. A neighbor is someone who lives within
a close distance to us, someone who's living within a few meters next door is more of a
neighbor than someone who's a few kilometers away. This suggests that we can measure the

©Manning Publications Co. To comment go to liveBook


86

similarity between the data by measuring the distance between them, and the smaller the
distance is between two objects, the more neighboring they are, and hence the more similar.
One of the most intuitive distance measures that we can use is the Euclidean distance
which you may remember from your high-school math. In a 2D plane (like a piece of paper)
you can define a point by two components: its horizontal distance from the center of the paper
(which we usually call the origin), and its vertical distance form that point.

Figure 5-6: the Euclidean distance between two points in a plane

We can denote these two components by x1 and x2. If we have two points on that plane :
p1(x1(1), x2(1)) and p2(x1(2), x2(2)), then the Euclidean distance between them can be calculated
with:

But what happens if the two points have more than two components? In the mushroom
problem we have here, we treat each mushroom as a point in some higher-dimensional space
and each feature it has as a component of that point. So with the five features we're going to
work, we'll be looking at mushrooms as point with five components. In such cases, it helps to
look at the formula of the Euclidean distance as simply the square root of the sum of squared
difference between the two points along each component; this view allow us to generalize the
formula to any N-dimensional points by saying that:

©Manning Publications Co. To comment go to liveBook


87

Now for this metric 8 to work, the components of each data point should be numeric for the
formula to work on. However, it's not a big deal to transform the categorical string codes we
have in our data features into numeric values and make the formula work on them, we just
need to consistently map each string code to an arbitrary integer.. This can be easily done
with the following function:
def numerically_encode(df):

encoded_df = df.copy()
encoders = {}

for col in df.columns:


unique_categories = df.loc[:, col].unique()
unique_categories.sort() # in-place sorting
encoder = {str: num for num, str in enumerate(unique_categories)}
encoders[col] = encoder
encoded_df.loc[:, col] = df.loc[:, col].apply(lambda x: encoder[x])

return encoded_df, encoders

This function we just wrote simply creates a copy of the DataFrame we want to encode and
then iterates over its columns. For each column we determine the unique values it can have,
sort them 9, and create a dictionary that maps the string code to its index in the sorted
sequence, which represents a numeric code for that string. After we prepare that encoder
dictionary, we use the apply() method to map each element in the column to its numeric
code. The result is then used to overwrite the column in the copied encoded_df. The method
apply() works by taking a callable (in this case we use a lambda expression) and applies it on
each element in the DataFrame. Our lambda expression uses the element as a key for the
encoder dictionary and returns the corresponding numeric value, which finally results into all
the data being consistently encoded.
We also keep track and return the encoder dictionary for each column in the data; as
looking at how the encoding is done can be important. For example, label column encoder tells
us that the edible label is encoded with a 0, and the poisonous ones are encoded with 1s. So
when our classifier labels a mushroom with a 1, we stay away form it.
encoded_data, encoders = numerically_encode(data_of_interest)

8
Any distance measure is usually called a metric.
9
It's not necessary though to sort the unique elements, but we do that anyway to make the implementation consistent with how scikit-learn does it.

©Manning Publications Co. To comment go to liveBook


88

print(encoders["E"])

> {'e': 0, 'p': 1}

Now that our data is numerically encoded, we're ready to apply the Euclidean metric on it and
start implementing our k-NN classifier!

5.1.4 k-NN in Action


We start by implementing the euclidean metric. If we have two N-dimensional points, we
always implement the metric's calculations using a for loop:
import math

def d(p1, p2):


squared_distance = 0
for i in range(N):
squared_distance += (p1[i] – p2[i]) ** 2

return math.sqrt(squared_distance)

The next step is to implement the k-NN classifier itself, this can be simply done by calculating
the distances between the input we have and all the data in the training set, sort those in
ascending fashion by the distance, pick the first k and then report the most common label in
our chosen set of k data points.
from collections import Counter

def knn_classifier(new_x, k, X_train, y_train):

neighbors = [] # a list of tuples (distance_to_new_x, neighbor_label)


for x, y in zip(X_train, y_train):
distance = d(x, new_x)
neighbors.append((distance, y))

sorted_neighbors = sorted(neighbors, key=lambda n: n[0])


nearest_k_neighbors = sorted_neighbors[:k]

labels_counter = Counter([label for _, label in nearest_k_neighbors])


most_voted_label = max(labels_counter.items(), key=lambda i: i[1])

return most_voted_label[0]

The method simply starts by looping through the training data passed in X_train (for the
features) and y_train (for the labels) and calculates the distance between each data point
and the input new_x that we want to classify. Each data point's distance is appended to a list
along with its label and that list is then sorted by the value of the distance; the k-nearest
neighbors are then the first k elements of that sorted list. To count the votes for each label in
the k-nearest neighbors set, we use the Counter object from python's collections package
on the the labels of the k-nearest neighbors. The Counter object returns a dictionary whose

©Manning Publications Co. To comment go to liveBook


89

keys are the unique elements in the given list and its values are their counts. We take the
label with the maximum count and report as the label for new_x.
To start seeing this classifier in action, we first need to split our data into training and
testing sets, which we did with scikit-learn back in chapters one and three, but let's take the
chance here and see what it takes to do that manually
shuffled_data = encoded_data.sample(frac=1., random_state=42)
X, y = shuffled_data.loc[:, 'F0':], shuffled_data.loc[:, "E"]
X, y = X.as_matrix(), y.as_matrix()

X_train, y_train = X[:6125], y[:6125]


X_test, y_test = X[6125:], y[6125:]

This split starts first by randomly shuffling the data. We make use of the sample() method we
used in the beginning of the chapter to get a random sample from the DataFrame to shuffle all
the data. This is possible because of the argument frac, which specifies the fraction of the
DataFrame we want in the sample. By setting this argument to 1, we tell it that we we want
the whole data set in the sample, only in random order. After shuffling the data we separate
out the features and the labels in two different variables X and y and then use the
as_matrix() method to strip out all the pandas niceties and work with the data directly. We
finally take the first 6125 element to be our training data and the rest to be the test data.

Is Shuffling the Data Important?


Hidden biases could occur in the data collection process if for example the data entry team decided to put all
the edible mushrooms at the end of the dataset, if we split the data as is (without shuffling), we would get the
test set to be mostly edible mushrooms with little or no poisonous ones, this is not representative of the
population we trying to investigate; hence, shuffling the data before split is crucial!

The last thing we need before running our classifier is to specify an loss function, a way to
measure the performance of our model. A very simple loss measure that we'll be using often
(specially in theoretical analysis) is what is called 0/1-loss: if the classifier predicted the
correct label then we're not losing anything (our loss is 0), otherwise we're losing everything
(the loss is 1)

Such function is called loss function due to the fact that it measures how much is the model
losing by making wrong predictions. This terminology of loss is crucial all over ML.
With that defined, we're ready to let our classifier run and see its performance in action:
test_preds = [knn_classifier(x, 5, X_train, y_train) for x in X_test]

©Manning Publications Co. To comment go to liveBook


90

losses = [1. if y_pred != y else 0. for y_pred, y in zip(test_preds, y_test)]


test_error = (1. / len(test_preds)) * sum(losses)
accuracy = 1. - test_error

print("Test Error {:.3f}, Test Accuracy: {:.3f}".format(test_error, accuarcy))

> Test Error 0.110, Test Accuracy: 0.890

The k-NN classifier seems to be working very decently! It's easy to see that the complement
of the loss represents the accuracy of the classifier over the test set, that is how accurate it is
in predicting the correct label, and our classifier is able to correctly predict the labels of %89
of the test set. That's pretty good!
Although this implementation is perfectly correct, and its producing good results, it takes a
bit too much time to work. We can use jupyter's notebook %timeit magic command to get
an estimate of its execution time.
%timeit [knn_classifier(x, 5, X_train, y_train) for x in X_test]

> 1min 18s ± 3.75 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

One minute and eighteen seconds 10. Such execution time, although being kinda okay now, will
be problematic in other cases when the size of data gets larger than that. The problem stems
from the the fact that python is itself a relatively slow language. Python is a dynamically typed
language, meaning that the types of the data being handled is determined during the run-time
of the program, which puts an overhead on the execution time as the types need to be figured
out first to run the appropriate low-level operation on them. Moreover, python's containers
(mainly lists) can hold elements of different types and each type can take a different size in
memory than the other, this makes python by default unable to store it in contiguous memory
cells (which would boost their access time), instead the elements are scattered around the
memory and the list only points to their locations, and hence python looses the performance
advantage of storing the data close to each other that a language like C or C++ has. These
features make python a lot more easier to use than other languages, but unfortunately come
at the cost of degraded performance.
These reasons make the code we just wrote very inefficient, especially when it's going to
work so many times with a large amount of data like we have in our problem here. So is there
a way we can still enjoy python's ease-of-use and avoid these problems and get faster
computations? Well the answer is yes, this is exactly what numpy is doing!

10
You may get a different execution time on your machine, and you’ll probably get different results across several runs on the same machine. This depends
on the specs of your machine and the machine’s load at the running time. What’s important is that you would get similar speedups ratio when you apply the
enchantments we’re going to introduce in the upcoming sections.

©Manning Publications Co. To comment go to liveBook


91

Why does storing in contiguous memory cells make things faster?


Down in the CPU, where our code gets executed, lies a small memory called the cache. Because its proximity to
the processing unit (they're basically on the same chip), the CPU can access data stored in the cache mush
more faster than when it's stored in the RAM; if it takes 100 nanoseconds for the CPU to access the RAM, it only
takes half a nanosecond to access the cache (almost 200x faster!).
To exploit the speed of the cache, when the code is referencing an address in RAM, the contents of this
address and its neighboring addresses are loaded from the RAM to the cache. If it happens that the code keeps
using data from that portion loaded to the cache, it will much more efficient and time-saving to get it directly
from the cache rather than taking the long way to RAM. This is called exploiting the principle of spatial locality,
that is nearby memory locations are more likely to be accessed together.
It's easy now to see why the C-behavior of storing a list in contiguous memory cell has an advantage over
python's scattered behavior. In C, the lists element are copied together to the cache due to their spatial locality,
hence the CPU can process them faster. In python however, because elements of the list are scattered around
the memory and are not spatially local, there's a high chance that the cache will miss the next elements after
processing the current one, forcing the CPU to take the long way to the RAM to retrieve them, which negatively
impacts the performance.

5.1.5 Boosting Performance with NumPy


From their website, numpy is the fundamental package for scientific computations with python.
The reason for this is that numpy is designed to overcome the shortcomings of python's
performance while keeping the simplicity of the language intact. The whole system of numpy is
built on top of an extremely efficient replacement for python's lists, the N-dimensional array
data structure, or shortly called within the library: ndarray. What's making ndarrays that
efficient is that they apply what was missing from python's lists in the first place: all the
elements of an ndarray have the same type and they are stored in contiguous memory cells.
Along with a carefully designed code for operations on ndarrays written in C/C++ (and
sometimes FORTRAN), this makes crunching numbers with numpy work blazingly fast, which in
turn makes numpy to be the backbone of any number crunching package in python. This
includes both pandas and scikit-learn.
Actually, if tried to look at the types of variables X and y (the bare data we extracted form
the data frame of our dataset using as_matrix()), we'll find that they're nothing but regular
ndarrays.
print(type(X))
print(type(y))

> <class 'numpy.ndarray'>


<class 'numpy.ndarray'>

©Manning Publications Co. To comment go to liveBook


92

To use these data-structures in our code, we first need to understand one of the factors that
control how they behave, which is their shape.
print(X.shape)
print(y.shape)

> (8124, 5)
(8124, )

Each ndarray has a specific shape, which is a tuple whose length is how many dimensions are
in that array. Each element of the tuple represents how many items each dimension holds.
Our features data X has a shape of (8124, 5), which means that it has two dimensions, just
like a table, with the first dimension represents the 8124 rows that contain our mushrooms,
and 5 columns that contain each mushroom's features values. On the other hand, the shape of
y is (8124, ). We notice that y's second dimension's value is missing; and that makes sense
because each one of the 8124 mushrooms has only one labels, so ther're are just 8124 entries
after all. Another way to think about it is that y is a table of 8124 row and just one column
holding the mushroom's label, and indeed the shape (8124, ) is conceptually equivalent to the
shape (8124, 1). We say they are conceptually similar to emphasize that they differ in other
aspects such as the programmatic aspect; a shape of (n, ) is programmatically different than a
shape of (n, 1). If a has a shape of (n, ), then a[0] will return a numerical value However, if a
has a shape of (n, 1), a[0] will return an ndarray of shape (1, ).
If we have two ndarrays with compatible shapes, then we can easily combine them in an
arithmetic operation. Arithmetic operations on ndarrays are performed element-wise;
meaning that each element from the first array is combined with its corresponding element in
the second array. So if we have an array of four ones and we divide it by the array [1, 2, 4,
8], the result would be [1, 0.5, 0.25, 0.125]
import numpy as np

four_ones = np.ones(shape=(4, ))
powers_of_two = np.array([1, 2, 4, 8])

print(four_ones / powers_of_two)

> [ 1. 0.5 0.25 0.125]

Here we created two arrays, one with the np.ones method that generates an array of ones
with the given shape and the other with np.array method that creates an ndarray from an
array like object like lists. This kind of element-wise behavior is what allowed us back in
chapter two to clean our data by creating a Boolean mask over the data whose value was non-
zero. But wait a second … we compared the arrays back then against a single value, a 0, how
does that work even though the 0 doesn't have the same shape as the array?! The answer to
that question lies in one amazing feature of NumPy, which is called broadcasting.

©Manning Publications Co. To comment go to liveBook


93

Remember that the condition for performing arithmetic operations was for the shapes to be
compatible, which is different than being the same. Two arrays can have different shapes but
they still remain compatible, in that case the elements of the smaller one can be broadcasted
across the larger one, and the result would be of the larger shape. For two arrays to be of
compatible shapes, every pair of corresponding dimensions in the two arrays has to be either:

• equal, or
• one of them is either missing or its size is 1

Back in chapter two, in section 2.2.2 specifically, when calculated our Boolean mask to select
the non-zero BMI records:
bmi_zeros_mask = data_of_interest.loc[:, "BMI"] != 0

There, our BMI data had the shape of (768, ) and the 0 we compared the data against has no
shape at all by the virtue of not being an array. If we listed the two shapes below each other,
we can see how the second compatibility condition applies:

768
-
------
768
Here - represents a missing dimension. We can see that one of the corresponding dimensions
(the one belonging to the 0) is missing. Hence, the two shapes are indeed compatible and the
comparison operation against the 0 is broadcasted over all the elements of our data array, as
we can see it depicted in figure 5-7.

Figure 5-7: The broadcast of a scalar over a vector

©Manning Publications Co. To comment go to liveBook


94

The idea of broadcasting can provide us with a significant speed improvement in our k-NN
implementation, especially in the way we calculate the distance between our new_x and the all
the points in X_train. Our X_train array contains 6125 data point each has 5 features, so it
has a shape (6125, 5); The new_x data point has 5 features, so it will have a shape of (5,).
Now, let's list the two shapes below each other like we did before and take note of the where
are the missing dimensions:

6125, 5
- , 5
-------
6125, 5
We can easily see that the two shapes are compatible, and the smaller new_x can be
broadcasted across the larger array X_train. This would allow us to do the following:
(X_train – new_x) ** 2

And get a array of shape (6125, 5) where each row contains the squared difference between
the components of new_x and components of each point in X_train. What we done with two
nested for loops in the previous implementation can be done in a one-liner with NumPy! What
the broadcasting is doing under the hood is running a the two nested for loops as well, but this
time within NumPy's optimized C/C++ code, which gives a huge boost over the python code
we have.

©Manning Publications Co. To comment go to liveBook


95

Figure 5-8: the broadcast of the vector new_x across the array X_train

Nonetheless, We still need to take the summation of these squared differences in each data
point separately. For this case, NumPy's sum method provides the argument axis, which
specifies along which dimension (aka, axis) should the summation be taking. By specifying the
axis to be 1, we tell the sum function that for each data point, its 5 elements should be
squashed into a single number representing their summation, which results in a vector of
shape (6125, ) after the second dimension is squashed by the summation as we can see in
figure 5-9.
np.sum((X_train – new_x) ** 2, axis=1)

©Manning Publications Co. To comment go to liveBook


96

Figure 5-9: NumPy's summation along a specific axis

Finally, we can use NumPy's sqrt method to take the square root of squared distances. Both
np.sum and np.sqrt run within NumPy's optimized C/C++ code, which makes them
magnitudes better than their equivalent python code.
neighbors_distances = np.sqrt(np.sum((X_train – new_x) ** 2, axis=1))

We can now get rid of the whole distance function d and replace the first few looping lines in
our knn_classifier function with that single line above and have ourselves an amazingly
faster implementation of the k-NN method!
def faster_knn_classifier(new_x, k, X_train, y_train):

neighbors_distances = np.sqrt(np.sum((X_train - new_x) ** 2, axis=1))

sorted_neighbors_indecies = np.argsort(neighbors_distances)
nearest_k_neighbors_indecies = sorted_neighbors_indecies[:k]
nearest_k_neighbors_labels = y_train[nearest_k_neighbors_indecies]

labels, votes = np.unique(nearest_k_neighbors_labels, return_counts=True)


most_voted_label_index = np.argmax(votes)

return labels[most_voted_label_index]

We also made some changes in the sorting and voting portion of the implementation and used
NumPy alternatives of the python code we wrote before to fully harness the power of
ndarrays! We use NumPy's argsort to sort the neighbors by distance and return an array of
these sorted neighbors indices. By selecting the first k of these indices, we can get their labels
by directly querying the y_train array with these indices. Instead of using python's Counter,
we use NumPy's unique method which takes an array and return the sorted unique elements
inside it. It also has an boolean argument return_counts which if true, it returns along the
occurrences the count of each unique element. We then use argmax on the counts to get the
index of the label with maximum count and finally report the label at this index as our
judgment.

©Manning Publications Co. To comment go to liveBook


97

We can now test our faster_knn_classifier implementation and see how much faster it
got than our earlier one. We treated each True and False in faster_losses as 1s and 0s and
used the np.mean to sum and divide by their count in NumPy's code instead of doing it with
python code.
faster_test_preds = [
faster_knn_classifier(x, 1, X_train, y_train) for x in X_test
]
faster_losses = (faster_test_preds != y_test)
faster_test_error = np.mean(faster_losses)

print("Test Error {:.4f}, Test Accuracy: {:.4f}".format(


faster_test_error,
1. - faster_test_error
)) # prints Test Error 0.1176, Test Accuracy: 0.8824 11

> Test Error 0.1176, Test Accuracy: 0.8824

%timeit [faster_knn_classifier(x, 5, X_train, y_train) for x in X_test]

> 1.03 s ± 8.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The execution time dropped from about 1.3 minutes to just 1.03 seconds! This is about 75x
speed-up from our earlier implementation! Now we have a basic implementation of a k-NN
classifier that has decent accuracy and faster execution time, but can we do any better?

5.2 A Better k-NN Classifier


So far we have a built a very basic version of the k-NN classifier, though we got it to work
much faster than what what we had initially, it still somewhat slow! We can feel its slowness in
that slight hang between when we run the code and get the results! If we're planning to
deploy such model to production and let an app used by a lot of people query it frequently,
such sluggishness can get our users frustrated.
The challenge is that we have already optimized the implementation as much as we can,
the only way left to make it faster has to change something in the algorithm itself. The best
candidate for such change, and the thing that represents the bottleneck in the k-NN method,
is the fact that we need to loop through all the training points and calculate their distances to
our new test point. In terms of algorithmic complexity, we can say that the time complexity of
the prediction operation is O(dm), which means that the time needed for predicting a label

11
The results are not the same compared to our earlier implementation due to how numpy treats floating point numbers compared to how native python does
it. However, the two results are not too far apart

©Manning Publications Co. To comment go to liveBook


98

grows as the product dm grows, where d is the cost of calculating the distance and m is the
size of the training set. This even reveals a bigger problem, which is that as the training data
gets larger (which is a very good thing), the prediction would get slower (a very bad thing).
So can we do any better than that? Can we not search all the training points we have to get
the nearest neighbors?

5.2.1 Doing Faster Neighborhood Search Using K-d trees


Let's step back from our “Can I eat that?” app for a while and take a look at the following
hypothetical example. Let's imagine having a bunch of data points with only two features, so
that we can visualize it on a 2D plane as in the following figure. The task is to find the nearest
point to the test point marked in the figure 5-10 with a hollow circle.

Figure 5-10: the 2D space we need to search for nearest neighbor

We need to solve this task without checking the distance between the test point and every
other point in the training data points, we want to make the least amounts of distance checks
as possible. We can start approaching such a solution by stating an obvious fact: points that
are in the same region of the space are most likely to be closer to each other than any other
points in the rest of the space. Now suppose that we divided the 2D plane we had previously
into rectangular regions like in figure 5-11:

©Manning Publications Co. To comment go to liveBook


99

Figure 5-11: our 2D space after being divided into regions

Moreover, suppose that for any point we're given we can cheaply (in small time cost) specify
at which rectangular region it lies. So for our test point, we can cheaply determine that it lies
within the region at the bottom-right corner. Now we can apply that obvious fact we just
stated and only calculate the distance to the other few points in that same region and find that
the closest one is the one is one numbered with 11. Voila! We have solved the task without
exhaustively searching through all the points!

Figure 5-12: The local nearest point vs. the True nearest point

But wait a minute!! This can't be the correct solution! While point 11 is the closest to our test
point in their region, it's not “the” closest point among all the points through the space! From
figure 5-12, it's clear that point #13 in the next region to the left is much closer, and it is in
fact the solution to the task! So why did our method gave us a wrong solution?!
The answer to that lies in the fact that this statement about points in the same region we
used is nothing but a heuristic, a way to approximate a solution but not necessarily the best
solution. Think about it, it's natural to think that the people in the room with you are those

©Manning Publications Co. To comment go to liveBook


100

with closest physical distances to you, but you could be setting next to the wall separating
your room from the next and right next to the wall on the other side there's someone setting.
In reality that person across the wall is one who's at the closest physical distance to you. This
is exactly the situation we're having here, our test point lies closely to the border between its
region and the adjacent region, and point #13 across the border is the closest one.
This gives us an idea to check for possible nearer points with directly calculating the
distance between to theses points, but by calculating the distance to the border of the region
containing them. If the distance between the test point and some candidate region is greater
than distance to the nearest local point, then that region is not even worth looking at and we
can skip all the points that lie within it (that is, we prune the whole region all together).
Otherwise, this region could potentially have a point that is closer to our local nearest
neighbor. This approach (which is an application of a more general algorithmic technique
called branch and bound) allows us to exclude a lot of points by single distance calculation
to their region's border, and hence we end up with the correct solution after calculating the
distance on a small subset of points.

©Manning Publications Co. To comment go to liveBook


101

Figure 5-13: the branch and bound search for the global nearest neighbor

We left out a few details though; we assumed that we have our space already partitioned into
these regions, and that we can cheaply determine at which region a point exists, but we didn't
specify how this partition is made or how that cheap process work. The answer to both these
questions lies in the method we use to partition the space in the first place. The partitioning
process is actually simple: we start with the whole space, pick a value along one of the
dimensions and split the space into two regions: before and after that value. This process is
then repeated on every region we get, going round-robin across the dimensions in each split.

©Manning Publications Co. To comment go to liveBook


102

Figure 5-14: space partitioning and how it corresponds to a tree structure

As we can see from the figure 5-14, this partitioning process can be represented by a binary
tree structure. Each node represents a region of the space (the root node represents the
whole space), and its two children are the resulting sub-regions from splitting on one of the
dimensions. That's why this data-structure is called a k-d tree, as in a tree used to represent
a k-dimensional space (note that this k is different from the k in k-NN. In k-d trees it's the
dimensionality of our space, while in k-NN it's the number of nearest neighbors we use for
inference). We use the same tree structure to determine at which region a point exists. Using
the values of the point's components in each dimension, we can descend down the tree till we
reach the leaf node that represents the point's region, as shown in figure 5-15.

©Manning Publications Co. To comment go to liveBook


103

Figure 5-15: traversing a k-d tree determines in which region the point resides

Once we have reached that region, we can start the branch and bound process we described
earlier.

5.2.2 Using k-d Trees with scikit-learn


Although k-d trees seem simple, it can get a bit tricky to implement them. For that reason and
the fact that our purpose is to focus on how it helps machine learning and the k-NN method,
we won't go into the hurdle of implementing it and instead KneihgborsClassifier of
sklearn.neighbors which includes an implementation of k-d trees.
from sklearn.neighbors import KNeighborsClassifier

classifier = KNeighborsClassifier(n_neighbors=5, algorithm='kd_tree')


classifier.fit(X_train, y_train)

sklearn_test_accuracy = classifier.score(X_test, y_test)

print("Test Error {:.4f}, Test Accuracy: {:.4f}".format(


1. - sklearn_test_accuracy,
sklearn_test_accuracy
))

> Test Error 0.1176, Test Accuracy: 0.8824

%timeit classifier.score(X_test, y_test)

> 18.8 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

The error and accuracy are exactly the same as our NumPy implementation, but the
introduction of k-d trees dropped the execution time to 18.8 ms. That's around 55x speed-up
NumPy implementation, and around 4000x speed-up over our earlier implementation using

©Manning Publications Co. To comment go to liveBook


104

native python. While k-d trees are doing amazing jobs reducing the time complexity to search
for nearest neighbors, it gradually looses its advantage as the dimensionality of the data gets
higher and higher. To understand why, we need to understand what's happening to the points
when we move into higher and higher dimensional spaces.

Figure 5-16: points can get further and further away from each other when the dimensions increase

As the figure 5-16 shows, from points neatly packed along a line to the same points scattered
away from each other in 3D space; as more dimensions come into the picture more space
exists for the points to get away from each other. Although we can't visualize more than three
dimensions, we can still get a statistical evidence for that fact by ransom sampling multiple
pairs of points from known distribution and calculate the average distance between these
points. The following figure depicts a plot for such an experiment showing the growth of
distance between randomly sampled points as the dimensionality of their space grows.

©Manning Publications Co. To comment go to liveBook


105

Figure 5-16: The average distance between two random points increase as the dimensionality of their space
increase

In very high dimensional spaces, where the points tend to lie faraway form each other, the
way k-d trees partition the space doesn't become efficient any more! As we have seen in our
2D example, k-d trees divide the space into rectangular regions. In high dimensional spaces
these become hyper-rectangles. Because of the vast distances between the points in high
dimensional spaces, these hyper-rectangles end up empty or with so few points that it's not
worth all the hassle to prune anymore. Moreover if we chose to make few splits so that each
hyper-cube would contain a minimum number of points, the vast distances in the high-
dimensional space would make these partitions (which are made on a single dimensional
value) look closer than the nearest local neighbor and hence we'll probably end up visiting all
the branches in k-d tree. Figure 5-18 depicts a similar scenario in 2D space.

©Manning Publications Co. To comment go to liveBook


106

Figure 5-18: regions with minimum number of points with vast distances between points might end up closer
than the local nearest neighbor, but they will contain no globally closer points. K-d trees branch and bound will
run but it will have same time complexity as exhaustive search

In such cases, it would be cheaper to run exhaustive search directly or try a variant of k-d
trees called ball trees which is also supported by scikit-learn 12. At the end of the day, we can
let scikit-learn pick the most convenient searching algorithm by omitting the algorithm
parameter and fall back on its default value 'auto', which will allow scikit-learn to look at the
data we're using and determine the best algorithm. Such properties that needs to be tuned
manually and cannot be learnt form the data are called hyperparameters, and they are hyper-
because they are different that can be learnt from the data itself (like the partition points in
the tree). The search algorithm that we've been discussing is one of those hyperparameters,
another essential one is the value of k, or how many neighbors we're going to examine, and it
plays a vital role in the performance of the k-NN method!

5.2.3 Tuning the Value of k


One can think of a machine learning model as an electric oven, some frozen or raw food gets
inserted, and a defrosted or cooked version of the same food gets out as an output. To get
this electric oven running, there are some knobs that need to be adjusted first for the amount
of time and heat needed, the oven needs these knobs to be tuned before it can do its job.
Hyperparameters are the knobs of a machine learning model!

12
You can read more about ball trees in sckiit-learn's documentations: http://scikit-learn.org/stable/modules/neighbors.html#ball-tree

©Manning Publications Co. To comment go to liveBook


107

Figure 5-19: Hyperparameters are the knobs of a machine learning model

Hyperparameters differ from regular parameters in the fact that the machine learning model
needs to know the values of these parameters beforehand so that it can start training, which
makes these hyperparameters un-learnable from the data. Moreover, in most real-world
cases, the values of the hyperparameters can be the barrier between a model working or not
working: one set of hyperparameters values can make the model have superb performance
and another set can make random guessing work better than the sophisticated model. This
makes it seem like we're in some trouble here. The choice of a hyperparameter's value is
essential to the performance of the model, yet we can't seem to learn a good value for it from
the data. So what can we do?
One possible way is to try out different values of the hyperparameter manually and check
the model's performance at each of these values. Let's work for now on the the value of k, or
the value of n_neighbors in our KneighborsClassifer. We can simply start form the value of
1 and go on up to a certain limit (say 20), training a new classifier at each value and
measuring its performance on the held out test set, the value with most accurate results on
the test is then our optimal value for k. While this seems like an Okay strategy to follow, it
suffers from a subtle problem: how can we make sure now that our trained model + its tuned
hyperparamter generalize well to unseen data? Our test set is not unseen anymore; We used
it to tune the value of the hyperparameter. The totality of the model (its trainable parameters
+ tunable hyperparameters) has now seen all the data and its fitted and tuned to work well on
all the data, including the test set. For the performance on the test data to be representative
of the model's generalization capabilities, these data must be totally hidden from the model,
but in our case here the test data has been snooped and contaminated!
A simple solution to this situation is to have two held out test sets instead of just one: one
to tune the values of the hyperparameters on, and the other to test the totality of the model
for generalization. This is indeed the most basic solution to the hyperparameter tuning
situation, and the other held out set is usually called a validation set (because it's used to

©Manning Publications Co. To comment go to liveBook


108

validate the model before finalizing it) or the development set (as it's used during the
development of the model). To apply this to the classifier in our “Can I Eat That?” app, instead
of going all the way back to split the whole data set into three portions, we can achieve the
same effect by splitting the training set we have now into two set: a smaller new training set,
and the validation set.
new_X_train, X_valid = X_train[:5525], X_train[5525:]
new_y_train, y_valid = y_train[:5525], y_train[5525:]

With this new validation set we can now run our manual trials strategy, but instead of actually
doing it by our own hands, we can just write a for loop that goes over all the possible values
of the of k we want to test, validate it on the validation set and remember the best performing
classifier. We then finally test that best performing classifier on the totally unseen test set we
have from before.
best_score, best_k, best_classifier = 0., None, None
for k in range(1, 21):
classifier = KNeighborsClassifier(n_neighbors=k)
classifier.fit(new_X_train, new_y_train)
score = classifier.score(X_valid, y_valid)
if score > best_score:
best_score = score
best_k = k
best_classifier = classifier

# Best k: 11, Best Validation Score: 0.8950


# Test Accuracy: 0.9005
print("Best k: {}, Best Validation Score: {:.4f}".format(best_k, best_score))
print("Test Accuracy: {:.2f}".format(best_classifier.score(X_test, y_test)))

> Best k: 11, Best Validation Score: 0.8950


Test Accuracy: 0.9005

This little code that showed us that the best value of k here is 11, if you remember from your
basic programming training, is nothing but a linear search algorithm. We're just iterating
through a list of values searching for the value that would get us the best performance.
Although linear search may not seem like the most intelligent strategy that we can use for
hyperparameter tuning, and that's true to some extent, it works well on small scales like our
case here. But sometimes, there are indeed better methods that we can use, one of them we'll
see next.

5.2.4 Choosing the Metric


While some hyperparameters (like the value of k) are hard to guess and require expensive and
exhaustive tuning method like linear search, some other hyperparameters (deepening on the
problem) lend themselves to expert's judgment. One of these hyperparameters in our k-NN
model is the choice of metric.

©Manning Publications Co. To comment go to liveBook


109

In order to see why this is the case we need to think a little bit deeper into the nature of
the data we have and how our initial usage of Euclidean distance is treating it. If you recall, all
of our data features are categorical and encoded into numerical value in order to be able to
run the Euclidean distance on them (as it was the most intuitive notion of a distance). But if
we look closely at that combination, we can see that we're making a very bad assumption.
Let's say that we're standing in point A whose coordinates on the floor are (2, 6) and we want
to get to point B whose coordinates are (5, 2); The Euclidean distance between these two
points is 5, and it makes sense to say that if we took 5 unit steps along the line connecting A
and B we'll eventually be at point B.

Figure 5-20: The euclidean distance between two points represents how many unit steps we need to takw to get
from A to B

Now let's make a similar argument but with mushrooms. Let's just for now describe a
mushroom by only two components: its cap-shape, and its cap-color. Let's assume we have a
mushroom A whose cap is green and sunken, so it's coordinates are (x, r). We have another
mushroom B which has a gray bell shaped cap, hence having coordinates (b, g). If we looked
at our encoding dictionaries, we can map these coordinates to their respective numeric values:
(4, 6) for A and (0, 3). The Euclidean distance between these two mushrooms is also 5, but
does it make any sense that a green sunken mushroom cap will turn gray and bell-shaped by
taking 5 unit steps along the AB line?!

©Manning Publications Co. To comment go to liveBook


110

Figure 5-21: Euclidean distance between two mushrooms is unintutive

This bizarre morphing behavior and unintuitive situation comes form the fact that using
Euclidean distance on encoded categorical data forgets that these are just dummy encoding
that has no order whatsoever, which also imposes some sort of ordering and continuity on a
rather discrete data. The usage of Euclidean distance slipped us without noticing an
assumption that our data behave like continuous numbers. Once we realize that problem, we
can make a better metric choice that would respect the categorical nature of the data. One
such choice would be the Hamming distance, defined as:

While the notation 1c denotes the indicator function, which is a very simple function that
returns 1 if the condition c is true, otherwise it returns a 0.

With this definition, we can see the sum in the hamming distance formula simply computes
the number of components that are different among the two points, hence the hamming
distance is nothing but the fraction of different components between the two points. Hamming
distance give us a more intuitive notion of distance (or conversely, similarity) when working
with categorical data. For example, for the pair mushrooms A, B we're investigating earlier,
the hamming distance between them is 1 which means that they are completely different from
each other. For another two 2D mushrooms encode with, say (1, 3) and (1, 5), we have a
hamming distance of 0.5 indicating a 50% difference (or a 50% similarity). This is a more

©Manning Publications Co. To comment go to liveBook


111

convenient notion of similarity when we talk about categorical data such as we have here, and
hence it's a good choice to set the metric argument of KNeighborsClassifier to hamming
instead of the default euclidean 13.
This expert's judgment can even be verified by our linear search approach. In addition to
searching over a list of values for k, we also add a list of metrics to search through for the
best performing classifier.
best_score, best_k, best_metric, best_classifier = 0., -1, None, None
for k in range(1, 21):
for metric in ['euclidean', 'manhattan', 'hamming', 'canberra']:
classifier = KNeighborsClassifier(n_neighbors=k, metric=metric)
classifier.fit(new_X_train, new_y_train)
score = classifier.score(X_valid, y_valid)
if score > best_score:
best_score = score
best_k = k
best_metric = metric
best_classifier = classifier

print("Best k: {}, Best Metric: {}, Best Validation Score: {:.4f}".format(


best_k,
best_metric,
best_score
))
print("Test Accuracy: {:.4f}".format(best_classifier.score(X_test, y_test)))

> Best k: 11, Best Metric: hamming, Best Validation Score: 0.8950
Test Accuracy: 0.9065

Our linear search now looks for the best combination of k's value and metric that would give
us the best performing classifier on the validation set, and the best result comes from setting
k to 11 (as we saw previously) and by choosing the Hamming metric, just as our judgment
suggested. It's worth noting that our linear search here can is also called a grid search;
because we can imagine all the possible combinations of k and metric as points on a grid (as
shown in figure 5-22) and the algorithm looks through them all.

13
The default metric is actually called Minkowski metric (pronounced, min-kawf-ski), but in combination with another parameter 's default value (the
parameter p, which is default to 2) it becomes equivalent to the Euclidean metric. Check the documentation to learn more! (Pro-tip: always check scikit-
learn's documentation, it's awesome!)

©Manning Publications Co. To comment go to liveBook


112

Figure 5-22: Combination of parameters values form a grid of points, so the linear search across them is called
grid search

After testing that best classifier on the test set, we find that it's indeed the best one we had so
far with an almost 91% accuracy! We're now almost ready to deliver a working “Can I eat
that?” app, it's only a matter of wrapping our scikit-learn code in a RESTful API and creating a
front-end to call it!

More on “expert's judgment” and hyperparameter tuning


Any hyperparameter can be susceptible to expert's judgment method as long as we have the sufficient
experience and knowledge about the domain and the problem you're working on. For example, the metric
hyperparameter became susceptible to expert's judgment because we know something about the nature of our
data, that is that they're categorical. However, if the data was continuous and we know nothing else about it, the
metric becomes no longer susceptible to expert's judgment. We're no longer “experts” and we have to reside to
a method like grid search.
As we noted before, hyperparameter tuning is crucial in getting good performing models on our problems,
but it's expensive to do especially when we tune multiple parameters with lots of values. This makes spotting
hyperparameters on which we can use our experience and knowledge is priceless! By exploiting our knowledge
and experience in setting one or more hyperparameter, we free up the resources to focus on tuning the
hyperparameters we can't easily crack. That being said, we need to keep mind that sharpening our expertise in
order to exploit such situations takes time and practice. By doing more problems and observing how the values
of the hyperparameters affect the performance of the model and analyzing that effect in light of the problem,
our expertise will get sharper and it'll be easier to say what choice of a hyperparamter will effect the model how!

5.3 Is K-nn Reliable?


We now have an empirical evidence that the k-NN method is performing well on the
mushroom's edibility detection problem we encountered in the “Can I Eat That?” app. But a
question could be raised: is that evidence enough?! One could argue that the data set we have

©Manning Publications Co. To comment go to liveBook


113

is a rather small one, and this in turn makes the test set on which we evaluate the model's
generalization abilities even more smaller. All this can make us uncertain about the reliability
of our model.
Two things can raise our confidence though: obtaining more data to test the model and get
a more stable statistical evidence, or having some theoretical guarantee about k-NN's
generalization abilities regardless of the problem we're working on. The first option is not
accessible for us because we don't have more data, the only way we can get more testing data
is to take form the training and the validation data, which will damage the performance of the
model. On the other hand, with a tiny bit of math, we can get some theoretical guarantee that
k-NN is indeed reliable. In the upcoming analysis, we're gonna see that the 1-NN method
achieves a generalization error that is as most as twice the error of the best classifier we could
ever think of. Though we're limiting our analysis to the case where k = 1, similar results
arguments can be made for k > 1, it's gets unnecessarily messier though. That's why we're
sticking with k = 1 to make our math simpler and more manageable. Before our starting our
analysis though, we need to introduce some terms and concepts that we’ll see in our
upcoming analysis, and will continue with us throughout the book.

HYPOTHESES AND RISK

Any model we try to build is eventually trying to approximate the true target function between
the features and the labels (we talked about what target functions are in chapter 4). In order
to do so, we hypothesize that the target functions takes the form of our model and then try to
fit the model to the data we have in order to approximate the true relation. If we looked at
what we have done with “Can I Eat That?” app, we’ll find that we started by hypothesizing
that the true relation between the mushrooms features and their toxicity is uniformly
continuous, which lead us to hypothesize that k-NN would be a suitable method to
approximate that relation. That makes our k-NN model be refereed to as a hypothesis. In
theoretical analysis, it’s more common to refer to models as hypotheses, and to shortly denote
them with the letter h. That’s something we’re going to use a lot throughout the book.

Another concept that we’re going to see a lot in our theoretical analyses is the concept of
risk. Risk is simply the true generalization error of our model, or our hypothesis; that is the
average of error/loss across all the possible data points that can be sampled from the
distribution that generated our training data. We denote the risk of a hypothesis h as R(h) and
define it as the expected value of the hypothesis’ loss or error over all the possible data
points, or:

The concept of risk is very tied to the test error we use to evaluate our trained machine
learning model. To be more precise, the test error is a sample estimate of the risk. We define

©Manning Publications Co. To comment go to liveBook


114

the risk R(h) to help us make general statements about how a model performs without being
tied to a specific dataset; that’s why it comes up a lot in theoretical analysis. You may be
asking “why not just call it error? why risk? why the extra terminology?” I guess the reason
behind that is avoid ambiguities when we write our math, if we called it error and denoted it
with E, that E and the E of the expected value can cause a lot of confusion in one equation. So
for the convenience of our math, we use the synonym of error, which is risk and denote it with
R, hence clearing the possible confusion.

The following sections are a bit mathy, but the math shouldn't be scary; it's just the usual stuff we learned
about in chapter three combined with some algebraic manipulations. So take a deep breath, keep your paper
and pencil close, and always remember to read what the equation is trying to say, not its symbols.

5.3.1 The Bayes Optimal Classifier


To prove the reliability of 1-NN model, we need a few moments to understand just what
exactly is that “best classifier we could ever think of” thing. Imagine for one second that
we know the true conditional probability distribution that generates the class of the
mushroom given its visual features, that is the distribution of P(Y|X). Using that distribution we
can create a very simple and very powerful classifier that assigns a mushroom its most
probable label given its features!

Because we're just dealing with a binary classification problem where the label's random
variable can take only two values {0, 1}, we can simplify the notation a bit: we'll choose η(x)
to equal P(Y=1|X=x), and hence P(Y=0|X=x) will be equal to 1 – η(x) (follows directly from the
fact that all probabilities must sum to 1) 14. Moreover, and again because we're just dealing
with two possible values for Y, the only case where η(x) is greater than or equal 1 – η(x)
happens when η(x) is greater than or equal 0.5. With this observation and the reduced
notation we may write out our classifier in a more compact way:

14
The symbol η (pronounced, eta) was chosen arbitrarily, and what we did here was just a shorthand notation to make our math easier and more clear to
read.

©Manning Publications Co. To comment go to liveBook


115

The asterisk in that classifier's name is to say that this is the most optimal classifier we can
ever think of. Think about it for a few moments: we're doing the classification based on the
true underlying probability distribution that govern the mushroom's toxicity phenomenon,
what in the world can be better than the thing itself! There exists a rigorous proof, but I
strongly believe the intuition we just established is enough for now. This optimal classifier is
also called the Bayes classifier, and that naming comes from the fact that conditional
probability P(Y|X) (which we do not directly observe) can be calculated from the probabilities
P(X|Y) and P(Y) (which are directly observed from the label of an instance of the phenomenon,
and the its associated features) using the Bayes rule:

Because its almost impossible to accurately model the probabilities P(X|Y) and P(Y) from
arbitrary data, the Bayes classifier is not achievable in practice. However, because it achieves
the lowest possible generalization error ever, it remains a good theoretical tool to benchmark
other models against. But in order to do that, we must first know how much is that lowest
possible error, and the best way we can do that is by looking at where it's coming from.

Figure 5-23: the likelihood probabilities for X given both values of Y. The source of error stems from the
overlapping shaded area in the middle

©Manning Publications Co. To comment go to liveBook


116

In figure 5-23 we see the likelihood probabilities for a feature x given both the values of Y.
Let's look closely at the shaded overlap area in the middle:

• One point x1 can have a label of 1 as it still has a nonzero probability in the distribution
P(X|Y=1). However, its probability under the P(X|Y=0) distribution is higher, which
makes P(Y=0|X=x1) higher in return. This will result in the Bayes classifier saying that it
belongs to the class 0, and that's an error! So a class 1 point will erroneously classified
as a class 0 if it lies in the overlap region where having P(Y=1|X) smaller than P(Y=0|X)
(to the right of the vertical line in the figure), and such error will happen in that region
with the probability of that point being of class one: P(Y=1|X), or simply η(x)
• Similarly, another point like x0 can have a label of 0 as its P(X|Y=0) probability is
nonzero even though P(X|Y=1) is higher at this region (to the left of the vertical line in
the graph). In that case, the probability of the Bayes classifier making an error will be
the probability of the point being of class 0, which is 1 - η(x)

From this analysis, we can see that the probability of the Bayes classifier making an error is
the probability of the having a label different from what its highest probability says, and this
happens only in the shaded overlapping region where η(x) is less than 1 - η(x) (to the left of
the vertical line) or when 1 - η(x) is less than η(x) (to the right of the vertical line). So overall,
the probability of the Bayes classifier making an error is just the minimum of both η(x) and 1
– η(x)

This makes the generalization error of the Bayes classifier, or the risk R(h*) 15 is just the
expected value of that error over all possible values of x:model

This is lowest possible generalization error that any classifier can hope to achieve, no classifier
can achieve a generalization error lower than that! That's why it's also called the irreducible
error; because no model whatsoever can reduce the generalization error beyond that!

5.3.2 Reliability of 1-NN


To see the relation between the 1-NN generalization error and the Bayes optimal error, we
need first to analyze the source of error in the 1-NN classifier. The 1-NN classifier can make an
error if the point x is of class 0 while its nearest neighbor xnn is of class 1, or vice-versa.

15
The risk of a hypothesis is the expected value of that hypothesis’s error over all the possible data points. Another type of risk, which we’ll see in later
chapters, is the called the empirical risk which is the expected value of the hypothesis’s error over the training data points only.

©Manning Publications Co. To comment go to liveBook


117

Figure 5-24: How can 1-NN make an error

To translate this into probabilities, we consider the probability of each case of these two as a
joint probability of two events. For the first case we want to find P(x is of class one, xnn is of class
0), and because the two events are independent of each other (the label of one point doesn't
affect the label of another), we can write that joint probability as the product of the two
events' probabilities:

and similarly, the probability of the second case can be written as:

The total probability of 1-NN making an error is then the probability of the either the first
case or the second case happening, which is just the sum of probabilities of both cases:

Starting from that equation for the total 1-NN error, we make a harmless modification of
adding and subtracting η(x) from any term that's containing η(xnn) and then evaluate our
expressions, grouping similar terms together:

©Manning Publications Co. To comment go to liveBook


118

Because η(x) is a probability, we know that η(x) ≤ 1 , which makes 2η(x) - 1 ≤ 1 and hence we
can say that:

By taking the expected value of both sides and noticing that the risk of the 1-NN classifier is
just the expected value of its error over all possible x, we can say that:

We can notice that E[η(x)(1 – η(x))] ≤ E[min {η(x), 1 – η(x)}] which is the risk of the Bayes
optimal classifier, or the irreducible error. This hold because both η(x) and 1 – η(x) are less than
or equal to 1, hence when they get multiplied they will result in a number smaller than the
smallest of them (think about the product of 0.5 and 0.25 for example). With this observation
we can say that:

And here's what we're looking for! We're just one step away from proving that 1-NN has
indeed a generalization error at most as twice the irreducible error, we only need to get rid of
the expectation in the second term to have it cleanly, and to do that we go back the initial
assumption about similarity method we started this part with: uniform continuity. Back in the
prelude we explained that a similarity-based learning method (such as the k-NN) won't work
unless the target function we're trying to learn happens to be uniformly continuous, and
because the target function in the case we're studying here is inseparable form the probability
η(x) , it's safe to say that our 1-NN won't work well unless η(x) is uniformly continuous as well.
Actually, a strong form of uniform continuity is required for the method to work well, a form
where we can say that:

That form of strong uniform continuity is called lipschitzness (pronounced, lip-chitz-ness).


That Lipschitz condition (the equality above) is ensuring that the change in the value of the
function from a point to point is bounded with respect to the difference between the points,
that is the values of the function do not suddenly explode by the difference between the two
points multiplied by some constant c. This is required to ensure that the target function is

©Manning Publications Co. To comment go to liveBook


119

going smoothly between point to point with exploding at each step; if the function exploded,
then two neighboring points won't have a similar label. By applying the Lipschitz condition on
our 1-NN's risk inequality, we can obtain the following:

Now, when the size of the training set becomes sufficiently large, we can expect that for every
point x, there will exist an extremely close neighbor xnn such that the difference between them
is almost zero. In such circumstances, the second term in the above inequality would
approach zero and we can finally get that:

While this constitutes a proof for the effectiveness of the 1-NN rule (and the k-NN in general,
but with some more math required), we also learn something extremely important about the
k-NN model. We saw from our analysis that such effectiveness is only achieved when the size
of the training data is sufficiently large. Being sufficiency large is not just about the number of
the training data instances, if we take into account the discussion we had earlier about the
dimensionality of the data and the fact that distances increase as the number of dimensions
increase, we'll find that while 10K may work well in 5 dimensions, we'll probably need 1M for
500 dimensions to be able to cover the vast distances in that higher dimensional spaces. This
tell us that for k-NN to work well, the size of the training data needs to be large enough to
cover the dimensionality of the data. When we look at the usual case where the size of the
high-dimensional data is not really sufficient to cover the vastness of the high-dimensional
space, we could say that k-NN is not generally recommended to use with high-dimensional
data. This is a manifestation of what is called the curse of dimensionality in machine
learning, where the high-dimensionality of data makes the machine learning models behave
very poorly despite the fact that they perform very well with lower-dimensional ones.

k-NN for Regression


Although we focused all of our discussion on using k-NN for classification, it can simply be used also for
regression problems where our labels are real numbers. The method works exactly the same as for
classification, except for the final step where we take a vote among the labels in the classification setting. In the
regression setting, we instead take the mean of the labels of the k-nearest neighbors and report that as the
prediction of our model.
Scikit-learn provides the KneighborsRegressor object that does just that. It can be also imported form
sklearn.neighbors and it has the same parameters as the KNeighborsClassifier we worked with, as
well as the same API that is universal across the framework.

©Manning Publications Co. To comment go to liveBook


120

6
K-Means Clustering

In this chapter we take a look at a clustering problem where we try to find distinct groups of a
wholesale distributor's customer that share a common behavior so that we can devise targeted
marketing campaigns that would attract more customers and flourish the business. This type
of clustering problems is usually called customer segmentation and it's one of the most
common clustering problems that one may encounter. Many approaches can be used to solve
such a problem, and here we're going to look at one of them which stems from the similarity-
based approach we started this part of the book with. The method we're going to work with
here is called k-means. As the name suggests, k-means tries to find k mean points (central
point, or centroids) around which groups of the data points cluster.
We're going to learn how does k-means work in detail, how we can use scikit-learn to
apply it to our problem, and how exactly can we figure out the right value of k. Down the
road, we're going to take a closer at why k-means works, its limitations, and the hidden
assumptions that lie behind it. But first of all, and before we can do any of that, we'll start
with defining the problem we have and understanding the data associated with it and do the
necessary processing to get everything ready for k-means to work its magic.

6.1 A New Marketing Plan for a Wholesale Distributor


A friend of yours has co-owned a wholesale distributing business for a few years now, and
they had hundreds of customers doing business with them. Now they want to extend their
reach and establish a new warehouse and shipping fleet in a new city. To get a quick jump
start in that new city, they need to devise a and marketing plan that would get them to attract
customers quickly. In such cases, relying on past experience and employing a data-driven

©Manning Publications Co. To comment go to liveBook


121

approach is one of the key components in making good decisions. In that light, your friend
provides you with a dataset 16 of around 400 customers they have here in the city, in which
the amount of money spent by each customer on their product types is recorded. They ask
you to aid their marketing team in understanding their customers' needs and behavior better.
As we have learned so far, the first step in approaching any data-driven problem is to look
at the data itself. We start by taking a look at a random sample from the wholesale.csv file
as it appears in figure 6-1.
data = pd.read_csv("../datasets/mushrooms.csv")
data.sample(10, random_state=42)

Figure 6-1: A random sample form the wholesale customers dataset

We can see from the sample that each customer is listed with 8 features; the first two are
categorical values called Channel and Region. Channel represents the type of the customer:
is it a service provider like a hotel, restaurant or a cafe (encoded with the value 1) or is it a
retail store (encoded by value 2). The Region feature encodes the region where this customer
is located, which is not very important for our analysis here so we're not going to pay much
attention to it. The remaining six features are continuous values that represent how much
money the customer spends annually on that type of products. For example, we can see from

16 This dataset we're going to work with is the Wholesale customers dataset listed by UCI's machine learning repository

(https://archive.ics.uci.edu/ml/datasets/wholesale+customers)

©Manning Publications Co. To comment go to liveBook


122

figure 6-1 that customer 82 spends $11,009 annually on groceries. If we want to understand
customers' purchase behavior in order to devise targeted campaigns and decisions, then these
six features will be our keys to achieving that. These six features together encode the
purchasing behavior of each customer, and any attempt identify that behavior should start
from those.
A good marketing campaign is a one that aligns with what customers need, and customers'
needs are reflected in their purchase behavior; so to make sound marketing decisions, we
need to look at the distinct purchasing behaviors common between our customers in order to
target them efficiently. This task is called customer segmentation or market
segmentation, where the potential market or population of customers is divided into groups,
each group sharing a similar purchasing behavior. This is one of the most straightforward
examples of an unsupervised clustering problem: we have a set of customers with no direct
clue about their purchasing behavior, nor any clue about what the possible purchasing
behaviors could be; however, we want to assign each of them into a group that share similar
behavior. Seems like a challenging task! However, building on the similarity-based work we
have been doing before, we can start at the phrasing “similar behavior” and let it lead us down
the road to the solution.

6.1.1 The K-means Method


When we think about a group of entities, it's natural to think that this group is centered
around something, whether this thing is tangible or not. A group of guests sitting at a wedding
table are centered around that table, and a group of friends are centered around the values
and beliefs they share.
We can apply the same idea of a group's center on our customer segmentation problem.
Let's only consider two types of products for now, say Milk and Frozen, so that we can plot
them on a 2D grid with Milk on the horizontal axis and Frozen products on vertical one as in
figure 6-2. We'd expect that customers who buy a lot of Milk products and few Frozen
products to be centered around some representative point at the bottom right corner of the
grid, where Milk products purchases are high and their Frozen counterpart is low. On the other
hand, customers who buy a lot of Frozen products and few Milk products are expected to
group around some point at the top left corner of the grid.

©Manning Publications Co. To comment go to liveBook


123

Figure 6-2: Customers who share a behavior of buying a lot of milk products and few frozen products are
clustered around a bottom-right corner point, while customers who buy lots of frozen products and few milk
products are centered around a top-left corner point

This gives us a start on how to solve our customer segmentation problem: in order to find k
distinct groups within the data (which we’ll denote each with Ci with i being an index between
1 and k), we can find k distinct points around which these groups are clustered, let's call these
points centroids (which we’re denoting each with μi). It makes sense to expect that in a good
clustering, the average distance between each point in the cluster and the cluster's centroid is
as low as possible. In another words, if we defined a quantity I to be:

Then the best clustering would have the lowest possible value of I. The value of I is often
called the inertia of the clustering and it's a direct mathematical translation of the paragraph
above: it's the sum of the average squared 17 distance between the cluster's points and its
centroids over all the k clusters. That mathematical translation provides us with something

Squared distance is used without taking the square root for mathematical convenience
17

©Manning Publications Co. To comment go to liveBook


124

that we can put into an algorithm to check if a clustering is good or not, the question now is
how to come up with such a good clustering.
Actually, finding the absolute lowest value of I and hence the best k clusters within the
data is one of the hardest problems in the world! If we have m data points, then there is
m
almost k possible ways to group them into k clusters 18; searching through them to find the
best clustering with absolute lowest value of I is impossible. To give you a hint of how
impossible that is, it's enough to know that having a million computers checking configurations
at a rate of one billion configurations per second, it'd take around 8×1088 billion years to find
the best two-clustering of the 400-something customers we have at our hands! This
infeasibility to find the best solution doesn't mean we can't solve our customer segmentation
problem, we can still find a solution, maybe not the best but we can still find a good solution.
That's what the k-means method is here to do.
K-means proposes that instead of checking all the possible k-clustering configurations, we
start by picking some random centroids and move them around step-by-step in amanner that
decreases the value of I in each step. Let's get concrete and try to see how that would work
on the 2D small customers data we saw back in figure 6-2. We start in figure 6-3(a) with two
randomly chosen centroids (the red and blue crosses). By calculating the distances between
the points and these centroids, we assign each point to the cluster whose centroid is the
closest, resulting the clustering we see in 6-3(b). Assigning each point to its closest centroids
surely decreases the value of the inertia I, but there's still room to decrease it even more.
Let's look for example at the red cluster: the bulk of points at the bottom-right corner are kind
of far from their centroids than the points in the top-left corner. The distances to these bulk of
points down there increase the value of I, so maybe if we moved the red centroid halfway
between the points above and below we'd get I to decrease even more. This can be done by
moving the centroid to the mean point between all the points in the red cluster; that is we
reassign μi to be:

This reassigning of the centroids leads to the points shown in figure 6-3(c). Now that we have
new centroids, we repeat the same two processes again. We reassign the points to their
closest centroids and move the centroids to the means of the new points. This will lead us
eventually to the clustering in 6-3(f) that matches what we had in figure 6-2.

18There are k possible ways to assign a cluster to the first point, then k possible ways to assign the second point, and so on and so forth. The total number of
configurations would be k×k×k … m times, or simply km.

©Manning Publications Co. To comment go to liveBook


125

F
igure 6-3: the iterative process of the K-means in clustering the data into two groups. (a) two random centroids
are chosen. (b) points are assigned to the closest centroid to decrease the inertia I. (c) centroids are moved to
the mean point among the cluster points to decrease I even more. The process is then repeated from (d) to (f)
until we reach the final clustering at (f), matching what we have in Figure 6-2.

We can outline this iterative process in the following four steps:

• Initialize the centroids to random points in the features' space. This can be as simple as
randomly choosing one of the training points as a centeroid.
• Assign each point to the cluster whose centroid is the closest.
• For each cluster, move its centroid to the mean point among its assigned points.
• Repeat steps (2) and (3) until there is no significant decrease in the value of the inertia
I.

©Manning Publications Co. To comment go to liveBook


126

This algorithm is the essence of the k-means method; some steps can be altered in a way that
could make the algorithm perform faster or more efficiently, but the essential idea remains no
different from what is outlined in this algorithm. We can start applying that algorithm (along
with the performance modification we mentioned) on our wholesale customers dataset using
scikit-learn's Kmeans object. The sklearn.cluster module conatins many different
clustering methods, one of those is our k-means method, defined in the module called
Kmeans. Like any other scikit-learn tool, it's just a matter of object instantiation and fitting to
get a running k-means model. But before we do that, we need to isolate our data of interest in
order to feed them into the fit method. As we stated before, we need to cluster our customers
based on their purchasing behavior, so all the features except Channel and Region are of our
interest.
X = data.loc[:, "Fresh":"Delicassen"]

Here we used the slice operator “:” to select all the features from Fresh down to Delicassen;
the same way it's used with regular python lists but this time with columns labels instead of
indices 19. Now that we have our data of interest, it's a good idea to get their descriptive
statistics and get a better understanding of them, and maybe reveal some defects that would
affect our model's performance (like the missing values we found in the diabetes data back in
Chapter 2). Instead of running each descriptive statistic method on X, we can directly call the
describe method which calculates most of them for us and display them in a beautiful tabular
format as it appears in figure 6-4.
X.describe()

The good thing is that the statistics show that there are no missing values in the data and
everything seems to have a valid value, but the statistics reveal another type of problems. If
we looked at the standard deviation for each feature, we'll see that the features vary across
different ranges. Sometimes the ranges are vastly different as we can see with Fresh and
Detergent_Paper features. These huge inconsistencies between the variances (or the standard
deviations) of the features can be harmful to any model that uses Euclidean distance in its
inner working. Because k-means is one of those models, we need to handle this problem
before we can start fitting our model.

19 Another difference is that the end label is included in the slice while the usual python behavior is to exclude the end index of the slice

©Manning Publications Co. To comment go to liveBook


127

F
igure 6-4: summary of descriptive statistics for each feature in the customers data

6.1.2 All Features Shall be Equal


So what's the problem of having such different variance between the features? And why is it
problematic with models that use Euclidean distance specifically? Let's look at a concrete
example of such a problematic situation and see where the issue lies. Consider the three data
points from our dataset shown in figure 6-5. Suppose the last two were chosen at random as
centroids for two different clusters, and we now want to determine to which of the two cluster
the first point belongs.

Figure 6-5: Two randomly chosen centroids selected form the dataset, the task is to decide to which cluster the
first point belongs. Looking at the values of all features, one would think that it's closer to the first centroid, but
the calculations have a different opinion.

By comparing the values of our point to their corresponding values in the two centeroids, it'd
seem intuitive that our point should belong to the first centroid; as we have four out of six
features (Milk, Grocery, Frozen, and Delicassen) that are closer in value (aka, similar) to the
first centroid than they are to the second centroid. However, calculating the Euclidean distance
suggests otherwise.
import numpy as np

©Manning Publications Co. To comment go to liveBook


128

bad_X_df = X.loc[[110, 124, 240]]


bad_X = bad_X_df.as_matrix() # underlying ndarrays

def d(x, y):


return np.sqrt(np.sum((x - y) ** 2))

print("Distance to 1st centroid: {:.2f}".format(d(bad_X[0], bad_X[1])))


print("Distance to 2nd centroid: {:.2f}".format(d(bad_X[0], bad_X[2])))

> Distance to 1st centroid: 24396.94


Distance to 2nd centroid: 14915.14

The calculation clearly says that the second centroid is closer, so where things did go wrong?
A good idea is to look at the squared difference of each feature in the two calculation and see
what's working under the Euclidean distance hood. We can achieve that, with a nice display as
we can see in figure 6-6, by running the squaring the subtraction of our point from the whole
bad_X_df data frame.
(bad_X_df - bad_X_df.loc[110]) ** 2

Figure 6-6: the squared difference on each feature between the two centroids and our data point

As we can see in figure 6-6, the squared distances between the four features we talked about
in the first centroid are at least one order of magnitude less than their counterparts in the
second centroid. However, the squared distance of the Fresh component somehow makes up
for the lesser features and the dissimilarity in the Fresh component ends up outweighing the
similarity in the other components. Because the variance of the Fresh component is very high
(around 160×106), we'd expect it to have very different values among the different data
points, there's no problem in that. The problem is that when the Euclidean distance formula
starts working with such high variance components, these components seems to be
undeservedly given a higher weight over the other components in the formula. The vast
differences in these components suppresses any similarity in the other components that have
lower variance. The interaction between the high variance components and the Euclidean
formula introduces some injustice between the features, while all the features should be
treated equally. The question now is how to enforce such equality? The key to answer that
question is what we call feature scaling.
Feature scaling is the process in which we change the values of our features in order so
that all of them fall on the same scale and have consistent ranges. This process is one of the

©Manning Publications Co. To comment go to liveBook


129

most common processing methods that are applied to data before running them through a
machine learning model. There is hardly any machine learning that doesn't need some sort of
feature scaling before the data is fed through it. Any model that uses Euclidean distance or
any similar construct would necessarily require the features to be scaled before they can be
used; it's the only way that we can ensure that different ranges or variances doesn't introduce
any unwanted assumptions about the features being unequal. One common scaling method
that used a lot throughout the machine learning world is Standardization. In standardization,
for each feature, we transform all of its values x to x' via the following formula:

So to transform a feature value x, we subtract from it the sample mean of the feature and
then divide the result by the sample standard deviation of the features. When applied on all
features, this results in values that are tightly packed around 0 with a standard deviation of 1,
hence bringing everything into the same scale without changing the relative positions between
a point and another. This can be seen in figure 6-7 which shows our customers data points
plotted by their Fresh and Grocery components (the two highest variance components) in both
the original and standardized version.

Figure 6-7: The original and standardized plots of the wholesales customers dataset on the two features Fresh
and Grocery. As we can see, standardization didn't changes the relative positions between the points, only the
scale of the values.

©Manning Publications Co. To comment go to liveBook


130

To quickly apply standardization on our datasets, scikit-learn provides the StandardScaler


class under sklearn.preporocess module 20. It shares the same consistent API of fitting
the model, and here fitting means to calculate the sample mean and standard deviation for
the features. However, instead of a predict method, preprocessing modules has a
transform method that in this case takes in data in the original feature space and applies
the standardization process using the previously fitted means and standard deviations 21.
Using StandardScaler we can apply the standardization technique to our data set and
see if the problem we had with the distances in bad_X points would be solved or not.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X)
scaled_bad_X = scaler.transform(bad_X)

print("[Scaled] Distance to 1st centroid: {:.2f}".format(d(scaled_bad_X[0], scaled_bad_X[1])))


print("[Scaled] Distance to 2nd centroid: {:.2f}".format(d(scaled_bad_X[0], scaled_bad_X[2])))

> [Scaled] Distance to 1st centroid: 1.99


[Scaled] Distance to 2nd centroid: 2.17

The problem we had with these points is now solved and now the calculations align with the
intuition we had. Scaling brought justice to the features, all of them are being treated equally
now.

6.1.3 Applying K-means with scikit-learn


Now that we fixed the scaling problem we had in our data, we're ready to apply the Kmeans
module from sklearn.cluster to start clustering our data customers data. We set the
value of k (or the parameter n_clusters) to be 5 for now.
from sklearn.cluster import KMeans

scaled_X = scaler.transform(X)

model = KMeans(n_clusters=5, n_init=20)


model.fit(scaled_X)

print("Minimum Inertia: {:.2f}".format(model.inertia_))

> Minimum Inertia: 1058.77

20 Other scaling methods exist under the same module, learn about them in http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing-
scaler
21 IF we wish to fit a preprocessor on some data and transform that data as well, we can use the fit_transdform method which runs the fit method

then transform internally in one call

©Manning Publications Co. To comment go to liveBook


131

K-means is sensitive to Initialization


You might be wondering what that n_init parameter is. This parameter specifies the number of the times
the k-means algorithm that would be run before reporting a result. Note that this is not the number of iterations,
this is a number of full runs from random centroids initialization to termination. Your question now is, probably,
why would we need to run the algorithm multiple times in the first place?
The need to do so stems from the fact that the clusters we end up with depend on the locations of the
initialized centroids in the beginning, one initialization for a centroid may lead to a different cluster than another
initialization, and hence different inertia. That's what we mean when we say that the k-means algorithm is
sensitive to initialization. To mitigate this sensitivity, and make sure that the we reach the best possible
clustering every time, we run the algorithm multiple times and report the one with the lowest inertia. To see this
in action, I invite you to reduce n_init in the above code to 1, run it multiple times, and observe how different
the values of inertia are.
You might be worried now that this measure increases the computation time and will slow down the process,
especially when the data grows larger. This is true, but on the bright side, there exist some options in the
KMeans module that would make each run go faster. One of these options is the fact that the parameter
init is set by default to “k-means++”, which makes a smarter centroids initialization than mere random
initialization. With “k-means++”, centroids are chosen so that they are as far apart as they can be. This has
the effect of speeding up the convergence on the final centroids. Another option is setting the algorithm
parameter to “elkan” which cleverly uses the triangle inequality to reduce the number of distance
calculations in the process. Scikit-learn chooses automatically between “elkan” and “full” which runs
the traditional algorithm given the type of the data.
The last option is use the n_jobs parameters and utilize parallel hardware in order to run each of these
different runs in parallel, thus reducing the overall time needed to full fit the k-means model. So scikit-learn
comes equipped with all the options that would ensure a seamless execution of the k-means algorithm while
ensuring its robustness against random initialization.

Now we have our data clustered into 5 different groups. The inertia of the clustering we have
can be found in the inertia_ attribute of the model as we can see above. The clustering
itself can be found in the labels_ attribute, which is an ndarray whose elements are
integers between 0 and k -1 indicating to which of the k clusters the corresponding point in
the fitted data belong. While the centroids can be accessed from the cluster_centers_.
However, when we look at these attributes, we're only going to see numbers, and don't get
me wrong, these numbers are very important as it's what provides the basis for any further
analysis we're going to do on the data in order to come up with a better understanding of the
customers and devise an efficient marketing plan. But it would be nice if we can have some
visual representation of these results, wouldn't it? The problem is that our data points live in a
six-dimensional space, and that's beyond our human capabilities to visualize. We can,
however, reduce that dimensionality to something that we can easily visualize!
Back in chapter 4 when we were talking about the different types of machine learning
problems, we mentioned something called dimensionality reduction as a type of unsupervised

©Manning Publications Co. To comment go to liveBook


132

learning. In dimensionality reduction, we try to compress a large set of features into a smaller
representative set, and this perfectly matches our need here; as we need to compress six
purchasing features into a smaller number (say 2) that we can visualize while ensuring the
compressed version is representative. This makes visualization one of the most common
applications of dimensionality reduction methods.
To visualize our clustering, we're going to use a module from sklearn.decomponsition
called PCA, which is short of Principle Components Analysis. With PCA, we try to map the large
amount of features into their specified number of principal components (defined by the
parameter n_components), which in this case is going to be 2.
from sklearn.decomposition import PCA

reducer = PCA(n_components=2)
reduced_X = reducer.fit_transform(scaled_X)

print("Reduced Shape: {} vs. Original Shape: {}".format(reduced_X.shape, scaled_X.shape))

> Reduced Shape: (440, 2) vs. Original Shape: (440, 6)

We can see that PCA was indeed able to transform our data array from a shape of (440, 6) to
a shape of (440, 2), and we now can use these two compressed features (or principal
components) to visualize our data on a 2D grid. We won't go into details here about PCA did
it, we'll cover that in a later chapter, and until we reach that chapter we're going to use PCA
as a black-box tool for visualization. Up until this point, we have been using pandas plotting
capabilities to visualize our data, but our reduced_X data is a plain ndarray and not a
pandas data structure, which makes using the pandas plotting method not an option. We can
wrap it up into a pandas DataFrame so that we can use the DataFrame.plot object, but I'd
like to take this chance to introduce one of the pillars of python's data analysis environment,
which is the matplotlib package.
We mentioned the matplotlib package in Appendix-A when we were talking about the
%matplotlib inline magic. We mentioned that matplotlib is the plotting library in python
and that any visualizations or plots we're going to see throughout the book will definitely be
made with matplotlib either directly or indirectly; even the plots we made back in chapters 2
and 3 using pandas DataFrame.plot object were made under the hood with matplotlib.
Acquiring the skill of using this powerful tool would allow us to visualize any form of data
without the need to depend on pandas data structures. The main module we're gonna use is
matplotlib.pyplot which contains all the methods we need to make our visualizations,
including the scatter method that we're going to use now to plot our reduced_X data
points in the 2D grid.
%matplotlib inline
import matplotlib.pyplot as plt

plt.scatter(reduced_X[:, 0], reduced_X[:, 1])

# these change the labels of the x and y axes


plt.xlabel("First Principal Component")

©Manning Publications Co. To comment go to liveBook


133

plt.ylabel("Second Principal Component")

Figure 6-8: The scatter plot of the customers data points reduced to only two components using PCA

This kind of plots is called a scatter plot because the points drawn are scattered around the
space, and it's fairly easy to understand how it works. The two main parameters that need to
be passed to the scatter method are the x and y coordinates of the points we need to draw,
which in our case are the first and the second columns of the reduced_X array. However,
these are not the only parameters that scatter can take. For example, in order to show the
actual clustering, we can pass the model.labels_ array to the method's c parameter in
order to give each point a different color based on which cluster it belongs to.
plt.scatter(reduced_X[:, 0], reduced_X[:, 1], c=model.labels_)
plt.xlabel("First Principal Component")
plt.ylabel("Second Principal Component")

©Manning Publications Co. To comment go to liveBook


134

Figure 6-9: The clustering of the data points where each different cluster has a different color

We can even scatter plot the centroids as well using a different point style of marker 22, and
specifying the edgecolors and s (for size) for enhanced visibility, as we can see in figure 6-
10.
reduced_centroids = reducer.transform(model.cluster_centers_)

plt.scatter(reduced_X[:, 0], reduced_X[:, 1], c=model.labels_)


plt.scatter(
reduced_centroids[:, 0], reduced_centroids[:, 1],
marker='X', s=60, c=np.arange(0, 5), edgecolors='black'
)
plt.xlabel("First Principal Component")
plt.ylabel("Second Principal Component")

Using PCA together with matplotlib capabilities, we now have a pretty satisfying visualization
of the clustering we computed, now it's time to analyze it and see what it tells us about the
customers. But, wait a second, we have set k to be 5 in our clustering, why is that? Isn't it
considered a hyperparameter? How can we tune it so that we would do our analysis on the
correct number of clusters?

22 Check this list of all available markers to use: https://matplotlib.org/api/markers_api.html#module-matplotlib.markers

©Manning Publications Co. To comment go to liveBook


135

Figure 6-10: the clustering with the centeroids visualized with x markers

One might think that we can use the value of inertia to choose the value of k, such the that
the best value of k is the one with the lowest inertia possible, and since we don't have any
labels we won't be able to do any training/validation/test split as there's nothing to
validate/test against.
def find_best_k(max_k):
lowest_inertia, best_k = np.inf, None
for k in range(1, max_k + 1):
model = KMeans(n_clusters=k)
model.fit(scaled_X)
if model.inertia_ < lowest_inertia:
lowest_inertia = model.inertia_
best_k = k

return best_k

print("Best k is: {}".format(find_best_k(10)))

> Best k is: 10

Our experiment above shows that the inertia keeps decreasing as we increase the number of
cluster until we reach the maximum number of cluster we set in the call to find_best_k, which
is 10. Maybe if we increased that limit to, say 30, the search would be able to find the lowest
inertia in between?
print("Best k is: {}".format(find_best_k(30)))

> Best k is: 30

©Manning Publications Co. To comment go to liveBook


136

Again, the inertia kept increasing until we reached the limit, maybe increasing it again would
do? A closer look at the definition of inertia would suggest that the answer to this question is a
no. Think about what happens when increase the number of clusters, by increasing the
centroids we're also increasing the possibilities of a point to find a centroid close to it, which
decreases the inertia. This pattern continues up until the point where set the number of
clusters to be the same as the number of points in the data. In that case, the centroid of a
cluster will be identical with the point that belongs to it, which brings down the inertia to 0,
the lowest possible value! We can verify that by actually running that scenario and plot the
relation between the number of clusters using matplotlib's plot method, which works exactly
like the scatter method with the difference that it connects the points plotted with a line.
data_size, _ = scaled_X.shape

inertias = []
for k in range(1, data_size + 1):
model = KMeans(n_clusters=k)
model.fit(scaled_X)
inertias.append(model.inertia_)

plt.plot(np.arange(1, data_size + 1), inertias)


plt.xlabel("Number of Clusters")
plt.ylabel("Inertia")

As it's shown in figure 6-11, the inertia indeed keeps decreasing until it reaches zero when the
number of clusters is the number of data points. This is obviously the lowest inertia possible,
but setting each point in its own cluster is not informative at all; we need to group multiple
points into few clusters in order to identify trends and patterns.
It's now obvious that the inertia is not a good metric to tune the number of clusters
hyperparameter, we need a more clever approach.

©Manning Publications Co. To comment go to liveBook


137

Figure 6-11: Inertia decreasing as the number of clusters increase until it reaches zero when the number of
clusters equals the number of data points.

6.2 Tuning the Value of k with Silhouette Score


Validation and hyperparameter tuning can be more challenging in unsupervised learning
compared to their supervised counterpart. In supervised learning, we already know the true
label for our training set, so we can withhold a part of it to be used after training and test the
model's accuracy against their true labels and determining if the chosen hyperparameter is the
best one. In unsupervised learning however, this is usually not feasible, and that's why need
to get a little more clever when we validate unsupervised models. Let's think for a moment
about what we're trying to validate here. We can consider that the best value of k would result
in a clustering that accurately represents the natural groups within the data as much as
possible, and it makes sense to say that a natural group would have its members to be as
much coherent as possible and to have as few overlaps as possible with other groups. These
two criteria of cohesion and overlaps could be the keys to an informative validation measure.
Let's start by measuring the cohesion of the cluster. One would think that applying the
inertia formula on a single cluster (that is, taking the mean distance of all points to the
centroid) could serve as a measure of how coherent the cluster is (with high values indicating
less coherency and vice-versa). This line of thinking is not wrong, but we could argue that
there is a stronger measure of coherency than inertia. Let's first take a look at figure 6-12
which shows the points in two nearby clusters.

©Manning Publications Co. To comment go to liveBook


138

Figure 6-12: point P is similar to its cluster's centroid but its dissimilar to point Q (which one of its cluster points)
than it is to point R which belongs to another cluster. It could be said that P is more coherent with R than it is
with Q which belong to its own cluster, this makes a point-based coherence measure more representative than
the regular inertia.

Because the centroid is calculated as the mean of all its assigned points, the inertia at this
cluster would be minimal. However, we can see that point P can be considered more coherent
with point R (which belongs to another cluster) than some of its own cluster's points (like point
Q); the minimal inertia didn't make point P be as coherent with its own siblings as it is with
another cluster's point. This would suggest that a point-based measure, where the similarity of
each point with its surroundings is taken into consideration, would be a stronger measure of
coherence than inertia. Such coherence score, which we're gonna denote with a(i) for each
(i)
point x can be calculated as:

Where d is a distance metric like Euclidean distance. This formula simply reads as the average
(i)
distance form point x to all other points in its cluster excluding itself. As distance is a
(i)
measure of dissimilarity, high values of a(i) indicate that x is not very coherent with its
siblings while lower values indicate higher coherence.
Overlaps can be measured in a similar way; a point can overlap between two clusters if
that point is coherent with its neighboring cluster. A good measure of that would be to
measure its coherence (just like above) with all the other clusters and choosing the minimum
value among all of them. That minimum value represents the point's coherence with its closest
neighboring cluster, or its overlap with it. We denote such overlap score with b(i) and it simply
can be translated into:

©Manning Publications Co. To comment go to liveBook


139

High values of b(i) indicate less overlapping or, inversely, more separation from neighboring
clusters. On the other hand, low values of b(i) means that the point overlaps with its
neighboring clusters which would suggest that the clusters are not well separated.

(i)
Figure 6-13: An example of measuring cohesion and overlaps score on a given point x

Now that we calculated measures for cohesion and separation, we're left with only one
question: how can we interpret these values to determine the quality of clustering? This can
be done by combining both the measures into a single score that reflects the quality of
clustering by comparing both cohesion and overlaps. Let's say that we'll combine both a(i) and
b(i) into a single score s(i), that score would take the value of +1 if the point is perfectly
assigned to its natural group, and would would be -1 if the point is incorrectly assigned to
another group. Now let's look at the values of a(i) and b(i): naturally we would like to have a(i)
< b(i), which indicates that the point is more coherent with its own cluster than it is with the
neighboring one. Now in that case, we also care about how much lower a(i) is from b(i);
because the more the two values are apart, the more coherent the point is with its own cluster
and more separated from the neighboring cluster. So, starting form a perfect score of +1, we

©Manning Publications Co. To comment go to liveBook


140

should punish s(i) by the value of a(i)/b(i) as that ratio decreases as a(i) gets lower than b(i)
and increases as they get closer in value. So we can say that:

We can follow a similar reasoning if the case were that b(i) < a(i). In that case, we'd prefer b(i)
to be much greater than a(i) to mitigate the overlapping problem we have. Hence, starting
from the worst score of -1, we should reward s(i) by the value of b(i)/a(i) as this ratio increases
as b(i) gets higher than a(i) and decreases as they get closer in value. So we can state that:

We can notice that in both cases, the value of s(i) shares a common numerator b(i) – a(i), only
the denominator changes depending on which is bigger between a(i) and b(i). From that
observation we can combine the two formulas of s(i) into a case-independent formula by
saying that:

In the case where the point is the only point in its cluster, we set s(i) to be zero. This prevents
the problem we had where single point clusters have exactly zero inertia. Now the value of s(i)
(which is bounded between -1 and +1) can be used to determine the quality of the point's
assignment:

• when it's close to +1 then the point is more coherent with its own cluster and well
separated from the neighboring cluster, this means that the point is likely to be
assigned to its natural group.
• When the value is 0, our point lies on the boundaries of its own cluster and the
neighboring one. This means that the point's assignment to its cluster is indecisive.
• When the value is closer to -1, the point is more coherent with the neighboring cluster
than its own. This means that the point is likely to be assigned to the wrong group.

©Manning Publications Co. To comment go to liveBook


141

(i)
Figure 6-14: the possible values of s(i) and what they mean about the point x . The closer the value is to +1,
the more coherent the point is with its own cluster and the better separated it is form the neighboring cluster.
On the the other hand, the closer the value it is to -1, the more likely that the point is assigned to the wrong
cluster. When the value is 0,. then the point lies on the boundaries of the two clusters, which makes the
assignment indecisive.

(i)
We call s(i) the silhouette coefficient of point x , and it represents how good the point is
assigned to its natural cluster. Taking the mean of the silhouette coefficients across all points
can then give us a measure of how good the clustering is overall. We call that mean the
silhouette score and that's what we'll be using to tune the value of the k and choose the
best one.

6.2.1 Creating Marketing Plans Against the Detected Customers Segments


Calculating the silhouette score can be easily done by the silhouette_score function
located in sklearn.metrics module. The function simply takes the data points as its first
parameter, and the labels assigned by the clustering algorithm as its second one, with the
option to specify which distance metric to be used when calculating the coefficients via the
metric parameter (which defaults to 'euclidean'). All we need to do is to replace the call
to model.inertia_ in our searching code to a call to silhouette_score.
from sklearn.metrics import silhouette_score

silhouette_scores = []
for k in range(2, 11):
model = KMeans(n_clusters=k, random_state=42)
model.fit(scaled_X)
silhouette_scores.append(silhouette_score(scaled_X, model.labels_))

best_k = np.argmax(silhouette_scores) + 2
best_silhouette = np.max(silhouette_scores)
print("Best k is: {}, with Silhouette Score: {:.2f}".format(best_k, best_silhouette))

plt.plot(np.arange(2, 11), silhouette_scores)


plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")

©Manning Publications Co. To comment go to liveBook


142

> Best k is: 2, with Silhouette Score: 0.61

Notice that we started our search from the value 2 because the silhouette score assumes that
at least two clusters exist (the point's cluster and its neighboring one). Also, instead of
keeping a record of the best silhouette score, we stored the results of all the different k values
in a list and made use of np.argmax method which returns the index of the maximum
element in the list, we then used the list to plot figure 6-15 which shows that the silhouette
score value indeed goes down as the number of clusters increase indicating that the clustering
becomes more incoherent and not well separated as we increase the clusters.

Figure 6-15: the silhouette score decreasing as the number of clusters increase, and the highest value is at k =
2 which means that there is likely two natural groups of customers within our data.

Our silhouette scores analysis shows that the best clustering happens when the number of
clusters is two, which suggests that there are likely two natural groups of customers within
our data set. We can now start inspecting the characteristics of each cluster in order to devise
an appropriate marketing plan targeting each of them. We start by retrieving the two clusters'
points form the original data frame using their label form our best k-means model.
best_model = KMeans(n_clusters=2, random_state=42)
best_model.fit(scaled_X)

first_cluster_mask = best_model.labels_ == 0
first_cluster = data[first_cluster_mask]
second_cluster = data[~first_cluster_mask]

Before we start analyzing the representative purchasing behavior in each cluster, it's assuring
to see that if we plotted the histograms of the Channel feature in each cluster like in figure 6-

©Manning Publications Co. To comment go to liveBook


143

16, we'll see that the first cluster has most of its member from the first channel which are
restaurants, cafes and hotels; while on the the other hand, the other cluster has most of its
members from the second channel which are retail stores. This re-assures us about our results
and about the silhouette score as an evaluation metric because we now see that our clustering
closely corresponds to two actual groups within the data: the group of retail stores and the
group of cafes, restaurants and hotels.

Figure 6-16: distribution of customers channels in the first cluster (a) and in the second cluster (b). The first
cluster has most its members from the hotels, cafes and restaurants channel while the other has its most
members from the retail stores channel.

We can now processed confidently with our analysis knowing that our results are most
probably reflecting a natural grouping within the data. Let's start by investigating the
representative behavior of the second cluster by checking the value of its mean.
second_cluster.mean()

> Channel 1.916667


Region 2.444444
Fresh 13556.194444
Milk 21922.972222
Grocery 30602.888889
Frozen 4968.694444
Detergents_Paper 14516.333333
Delicassen 4142.361111
dtype: float64

We can see that the second cluster, which represents retails stores, purchases most of the
products with more than 10k on average. Form this, we can draw a possible marketing
strategy targeted at retail stores by offering a discount on a whole package purchase that
includes all the types of products our distributor can offer. Such an offer can be tempting for
the retail stores; because they tend to purchase all their supplies at large quantities and they
would appreciate a discount on the whole deal. On the other hand, would such an offer be
tempting for hotels, cafes and restaurants as well? We can answer that question by inspecting
the representative purchasing behavior of the first cluster.
first_cluster.mean()

©Manning Publications Co. To comment go to liveBook


144

> Channel 1.269802


Region 1.269802
Region 2.551980
Fresh 11861.653465
Milk 4359.232673
Grocery 5932.816832
Frozen 2902.913366
Detergents_Paper 1844.725248
Delicassen 1291.628713
dtype: float64

The first cluster, however, seems to dedicate its money more on Fresh ingredients, which
means that a discount offer on the whole package is not likely to be tempting for them. So to
target this cluster of customers, it may be tempting to offer a discount on the Fresh purchases
to restaurants, cafes and hotels. That way they could benefit from discount in the product they
need the most. With these two recommendations, we're considered done with the task we
were assigned. We can now handle the data and our recommendations to the marketing team
and let them workout the details for their marketing plan.

Why is it called “silhouette” score?


Usually, the word “silhouette” means that image of a person's or object's outline where the object or the person
is filled with black color against a white background. A key characteristic of such regular silhouettes is that we
identify the object in the image by it's outline, so silhouettes generally capture the outline of an object, and
that's what our silhouette score is doing!

Figure 6-17: a silhouette of a robot


Our silhouette score tells us if the clusters are well separated form each other with obvious outlines or are they
overlapping and mangled against each other such that no obvious outline can be made for each cluster. Our
score, in a sense, captures the outline of the clusters, much like the silhouette image captures the outline of its
object.

©Manning Publications Co. To comment go to liveBook


145

EXERCISE 6.1

To further appreciate the value of feature scaling, repeat the script where we tuned the value
of k using silhouette score but this time with the unscaled version of the data. Notice what
happens to the best silhouette score due to that change.

6.3 Limitations of K-means


As we have discussed earlier, each step in the k-means method contributes towards
decreasing the value of the inertia: the first step by assigning each point to its closest
centroid, and the second by re-positioning the centroid in the mean of all the points so that it
would have minimal distance to all its assigned points 23. With that we can guarantee that the
inertia will always be decreasing no matter how the data looks like or how it's distributed. But
does that guarantee that the resulting clustering will be correct no matter how the data looks
like or how it's distributed?
Whether a clustering is correct or not is a tough question; we do not have any true labels
to compare against and see if the predicted results are correct or not. And metrics like the
silhouette score doesn't guarantee the correctness of clustering, it just ensures that the
clustering we get is well behaved in terms of cohesion and separation, which makes more
likely to be correct, but that's not guaranteed. We can still get the most well behaved
clustering while it's still being incorrect. Figure 6-17 is an example of such case, where the
plot on the left shows the true clusters for some data we generated, against the right plot
which shows the clusters recognized by the k-means model with the highest silhouette score.

23 There is a rigorous mathematical proof for the mean to be the optimal point to which all the distances are minimized, but it involves differential calculus

which we don't have at this point. We'll revise that point later when we're introduced to derivatives, but for now our intutional justification suffices.

©Manning Publications Co. To comment go to liveBook


146

Figure 6-17: (a) the true clusters in an artificially generated two dimensional data vs (b) the clusters recognized
on the same data by the k-means model with the highest silhouette score. This shows that a high silhouette
score doesn't guarantee correct clustering.

The plots in figure 6-17 where created using something we call a synthetic data set. The
points we see scattered there are not from any real-world data set collected by humans,
rather, they were randomly generated from a controlled probability distribution that we know
beforehand; that's why we call them synthetic (or generated) datasets. The benefits of
synthetic data lies in two facts:

• The first is that we know everything about them. We know how they're generated, we
know their labels and we know how the features and the labels are all connected to
each other.
• And the second is that we virtually have an unlimited supply of them. Unlike real-world
datasets, we can generate as much data as our RAM can fit from our synthetic sources.

These two facts combined allow us to generate clustering datasets that we know their true
labels beforehand and in quantities that would allow us to efficiently explore how our
algorithms behave and observe their strong and weak points and this would ultimately start
giving us insights about the correctness of our clustering results.
Fortunately for us, scikit-learn's sklearn.datasets module provides a handful of
methods to generate synthetic datasets with different shapes and properties that would allow
us to cover a wide variety of scenarios and easily study how k-means behave in each of them.
We utilize these methods in the following code which generates figure 6-18.
from sklearn.datasets import make_blobs, make_moons, make_circles

©Manning Publications Co. To comment go to liveBook


147

fig, axes = plt.subplots(2, 4, figsize=(12, 6))

for i, shape in enumerate([


'Blobs', 'Elongated', 'Moons', ' Concentric Circles
]):

original_axis = axes[0, i]
kmeans_axis = axes[1, i]

# labels each row in the plot with what it represents


if i == 0:
original_axis.set_ylabel("True Clusters")
kmeans_axis.set_ylabel("Kmeans Clusters")

# generate the dataset using sklearn.dataset methods given the shape


if data_shape == 'Blobs':
X, y = make_blobs(n_samples=1500, centers=2, random_state=170)
elif data_shape == 'Elongated':
X, y = make_blobs(n_samples=1500, centers=2, random_state=170)
transformation = np.array(
[[ 0.03412646, -1.76449409],
[ 0.63937541, 2.45397584]]
)

# this transforms the circular blobs into ellipsoidal ones


X = np.dot(X, transformation)

elif data_shape == 'Moons':


X, y = make_moons(n_samples=1500, random_state=170, noise=0.05)
else:
X, y = make_circles(
n_samples=1500, random_state=170, factor=0.5, noise=0.05
)

# scatter plot the data with their true labels


original_axis.scatter(X[:, 0], X[:, 1], c=y)
original_axis.set_title(data_shape)

# fit a kmeans model to the data


kmeans_model = KMeans(n_clusters=2)
kmeans_model.fit(X)
centroids = kmeans_model.cluster_centers_
predicted = kmeans_model.labels_

# scatter plot the kmeans clustering along with the centroids


kmeans_axis.scatter(X[:, 0], X[:, 1], c=predicted)
kmeans_axis.scatter(
centroids[:, 0], centroids[:, 1],
c=[0, 1], marker='X', s=60, edgecolors='black'
)

# calculate and print the standard deviation ratios in both cluster


# in both settings, the original and the fitted kmeans
print(data_shape)

©Manning Publications Co. To comment go to liveBook


148

print("=================")
for clustering in ["True", "K-means"]:
std_ratios = []
for label in [0, 1]:
labels = y if clustering == "True" else predicted
cluster = X[(labels == label)]
stds = np.std(cluster, axis=0)
std_ratios.append(np.max(stds) / np.min(stds))

print(" - {} clusters std ratio: ({:.2f}, {:.2f})".format(


clustering, *std_ratios
))

> Blobs
=================
- True clusters std ratio: (1.08, 1.02)
- K-means clusters std ratio: (1.02, 1.08)
Elongated
=================
- True clusters std ratio: (4.89, 4.66)
- K-means clusters std ratio: (2.14, 2.30)
Moons
=================
- True clusters std ratio: (2.28, 2.27)
- K-means clusters std ratio: (1.37, 1.36)
Concentric Circles
=================
- True clusters std ratio: (1.00, 1.01)
- K-means clusters std ratio: (1.63, 1.63)

Figure 6-18: From left to right. (1) k-means clustering agrees with the true labels when the clusters are circular
blobs. (2) k-means clustering fails to capture the true groups when the clusters are more ellipsoidal than
circular. (3) k-means does a worse job when the clusters shape gets more irregular and become interleaving
moons. (4) k-means performance keeps degrading when the clusters become concentric circles. The main
observation is that k-means seeks to form circular clusters around the centroids, which fails when the true
clusters shapes are more more complex than mere circles.

©Manning Publications Co. To comment go to liveBook


149

This may look like a big scary code at first, but that big scary code actually consists of smaller
simpler sections, some of which we've already worked with before. The code starts with:

1. Creating subplots into our matplotlib figure. Our goal is to generate multiple
datasets with different shapes and compare the k-means clustering to the their true
labels, so we create 2 rows (one for the true labels, and the other for the k-means
labels) and each of these rows contains 4 columns (one for each data shape we're going
to generate. We receive these subplots (or axes) into the axes variables which is a 2D
ndarray.

After we initialize the subplots, we start looping through the different shapes we're going to
generate in order to actually plot them. In the beginning of the loop we pick the corresponding
axis we're gonna plot into and then for each step we:

2. Generate our synthetic data set depending on the shape we're currently at. We use the
method make_blobs to create the two circular blobs and apply a transformation that
stretches and rotates them in order to get the elongated clusters. For the interleaving
moons we use the make_moons method, and for the concentric circles we use
make_circles. All of these method take a parameter called n_samples which
determines how many points we want to generate, but only make_blobs takes a
centers argument as it's the only one capable of generating any number of clusters;
the the others can only generate two by default. The noise parameter used with
make_moons and make_circles is needed to introduce that fuzziness around the
shapes, otherwise they'll look like perfect moons or circles. Finally, the factor
parameter in the make_circles method determines the size ratio between the inner
and the outer circles.
3. After generating the synthetic dataset (both the features X and the true labels y), we
plot them and color them with their true labels in the designated subplot.
4. We then run a k-means model and fit it to the data in order to get its clustering and
centroids. Notice that we provided it with the correct number of clustering beforehand
and don't go through the hassle of tuning it; that's because we're trying here to study
k-means behavior regardless of the hyperparameter tuning problem of k.
5. After we fit the k-means model. We plot its results just like we did earlier in our
customers segments visualization.
6. At the end of the loop's iteration, we go through the two clusters in each setting (true
and k-means), calculate the standard deviation in each cluster along each of the two
features, and divide the maximum of those over the minimum and report it. To
understand why we're doing this, we need understand what does that standard
deviations ratio represent.

As we already know, the value of the standard deviation tells us how spread is the data
around its mean, and when we calculate it for a specific direction (or feature) like we did in
our code, then we're quantifying the spread of the data in that specific direction. Hence, by
comparing the the two standard deviations of a cluster, we can understand the shape of that

©Manning Publications Co. To comment go to liveBook


150

cluster. As shown in figure 6-19, when the two standard deviations are equal, we expect the
cluster to be the most circular, and as the two deviations grow apart, the cluster starts looking
less and less circular. Given that understanding of the standard deviations ratio, we can start
noticing a common behavior in the k-means method.

Figure 6-19: the standard deviation along a direction specifies how spread are the data along that direction;
hence the ratio between the stds along the two direction tells us about the shape of the whole data cloud, and
how circular it is. When the ratio is (a) greater than one then the shape is less circular, (b) closer to one then it's
more circular, (c) one the it's perfectly circular.

With the first synthetic data, the blobs, the true clusters have standard deviations ratios that
are close to one, which is understandable as the clusters are pretty circular as we can see. We
also find that the clusters found be k-means match these ratios and the clustering is actually
correct. This starts to change as the data shape becomes less circular line with the elongated
clusters and the moon clusters. With these two sets, the standard deviations ratios are higher
than 1, but the resulting k-means clustering is always trying to push as close to one as it can
and the clustering becomes incorrect. This might suggest that there is an implicit tendency
within the k-means method to seek out circular clusters, but the results from the concentric
circles set may seem to contradict with that hypothesis. The deviations ratio in the original
setting is close to one, which we'd expect given that the set is made of circles. However, the
k-means clustering has ratios higher than one, which means that it's being less circular than
the original setting. Does that rule out our hypothesis about k-means' tendency toward
circular clusters?
The answer to that question is no, and to understand it we need to take a moment and
think about what k-means tries to do. With k-means, we're trying to find k disjoint centroids
around which the data points cluster. Because the two clusters in the concentric datasets are
concentric (that is, they share the same center), there's no way that k-means can assign them

©Manning Publications Co. To comment go to liveBook


151

centroids that would result in disjoint clusters. That's why the resulting k-means clusters are
actually the most circular it can get given that the two clusters should be disjoint, and it's
even apparent in figure 6-18 that the resulting clusters are pretty circular around centroids.

EXERCISE 6.2

Using what you learned from our last experiments code, recreate figure 6-17 and check that
this is indeed the clustering with the highest silhouette score. Use random_state=42 to get
consistent results.

6.3.1 The Math Beyond the Circular Tendency


Our little experiment up there showed that k-means tends to seek out circular clusters, but we
can’t help but ask about the generality of that tendency: is it a general characteristic of the k-
means method or did it just occur with these cases we tried? Given that the datasets we used
are kinda different from each other in terms of shape and distribution, it's more unlikely that
it's a special behavior happening only with these cases, but that's not a definitive proof that
it's an inherit characteristic in the method itself. Here, we're gonna mathematically show that
the definition of the k-means objective (aka, the inertia) leads directly to such tendency for
circular clusters and hence proving that it's a general limitation of the k-means method.

The rest of the section is a bit mathy, so here's our usual reminder that the math shouldn't be scary; you need
nothing here but high school algebra and the definition of variance from Chapter 3. So take a deep breath,
keep your paper and pencil close, and always remember to read what the equation is trying to say.

Our proof starts by taking the formula of the inertia, and unpacking the squared Euclidean
distance across the features (which we'll limit to 2 here for simplicity) and then distributing
the sum over the samples in each cluster over the unpacked terms of the Euclidean distance:

©Manning Publications Co. To comment go to liveBook


152

Where the final equation can be obtained form the its previous by recalling the definition of
the sample variance we saw back in chapter 3 (we just changed the symbol from s to sigma):

This essentially means that the minimizing inertia consists of minimizing the sum of features'
variances within each cluster, so what we need to do now is to see what happens when this
sum of features' variances is minimized and what that minimization implies. Using a basic
algebraic identity 24 one can say that:

This means that when the sum of variances approaches its lowest value, it'll be approaching
the value of twice the product of the standard deviations. So minimizing the inertia implies
that the sum of cluster variances approaches twice it standard deviations for each cluster in
the data, or:

We're talking about the fact that (a – b)2 = a2 – 2ab +b2, and that (a – b)2 ≥ 0 because the square of a number is always positive
24

©Manning Publications Co. To comment go to liveBook


153

Where the double-lined arrow means “implies”, the single-lined arrow means “approaches”,
and the flipped A means “for every”. So that statement would be read as: “minimizing inertia
implies that the sum of cluster variances approaches twice it standard deviations for every
cluster Ci we have” . By rearranging the terms of that implication and using the algebraic fact
2 2 2
(a – b) = a – 2ab +b , we can say that:

And hence we can finally say that:

So the inertia being minimal (or close to minimal) implies that the ratio of feature's standard
deviations in each cluster to be close to one. This proves that k-means tendency to cluster
the points in circles around the centroids is inherit in the method itself and not dependent on
the data being clustered. While the result we proved only applies to two dimensional data, the
same can be proved for higher dimensional ones (with extra math of course) and in such
cases the k-means will be seeking spherical or hyper-spherical clusters.
So our analysis here shows us that k-means method are kinda limited as they tend to
generally search for circular clusters. So if the true clusters tend to be non-circular or irregular
shapes, then the chances are higher that k-means will results in an incorrect clustering. This
puts a limit on the generalization capabilities of the k-means method, but it doesn't disqualify
it entirely from being a valuable clustering tool. There's still a lot of applications where k-
means will perform decently; like our customer segmentation problem we had throughout this
chapter, for which we have good reasons to believe that it yields a correct clustering.
Moreover, it's usually used as preliminary boosting step to other clustering methods that can
overcome some of its problems. All in all, k-means remains as a valuable machine learning
tool in our toolkit.

EXERCISE 6.2

Given what we discussed in chapter 5 about how the space in high-dimensional settings
becomes vast and the average distances between any two random points explodes, think

©Manning Publications Co. To comment go to liveBook


154

about how would that affect k-means and whether would it be suitable to work with high-
dimensional data or not.

©Manning Publications Co. To comment go to liveBook


155

7
Decision Trees

He that climbs the tree has won the right to the fruit.

- Walter Scott, an 18th centenary Scottish poet.

In the last part, we explored similarity based method which was founded on the general
assumption of target function’s uniform continuity; this assumption didn’t constraint our model
to a specific functional form. However, in this part we start exploring a specific form of
modeling assumptions, which is the form of a tree. In this chapter, we solve a problem of used
cars price prediction (which is our first regression problem to encounter so far) by assuming
that our target function takes the form of a binary decision tree. Moreover, we’re going to
learn:

• How to fill-in missing entries in a dataset using the rest of available entries.
• How to train and evaluate regression models using mean squared errors.
• How decision trees and random forests are built.
• How to use scikit-learn to train and use decision trees and random forests.
• What makes learning functions from data possible and what constraints it.
• What Bias-variance Trade-off is.

7.1 Predicting the Price of a Used Car


One of your colleagues at work is a co-founder of a used cars business back in her home,
India. In that business she and her partners are buying used cars from their owners, have
their in-house mechanics apply any necessary maintenance or modifications, and then resell
these cars with a higher but affordable price for those who cannot afford a new car. Their
business serves both the car original owner and the potential new owner: the original owner
by not requiring them to maintain the car before selling it, and the potential new owner by

©Manning Publications Co. To comment go to liveBook


156

alleviating the risk and headache of buying a car from its owner directly. Their source of
revenue would be the difference in the car’s new selling price and its original buying price.
Your colleague tells you that things have been going smoothly so far, but they want to
maximize their revenue without making their selling prices higher; cause if they did make
them higher, they’re going to lose customers. So the only way to achieve such increase of
revenue is to reduce costs, and one of the most promising areas to reduce costs is in their
supply pipeline. Up till now, their supply pipeline worked like this:

1. A web scraping script provides a daily sheet of used cars ads across all India from a
specified list of online marketplaces. This daily sheet includes multiple characteristics of
these cars as listed in the online ad, including the listed price. However, more than
often, car owners refrain from adding the price to the ad and specify that price will be
revealed in a telephone call with the owner’s listed number.
2. The part of the sheet that doesn’t have prices is outsourced to a call center service that
handles the part of calling the car owners to inquire about the price.
3. At the end of day, the sheet with the prices filled is returned to the company’s financial
team to apply some filter on the prices and then get their sales team to follow up with
the cars that survived the filtering.

“That second phase, where we outsource the following-up with owners to an external call
center service, is costing us a good amount of money” said your friend. She and her partners
searched for possible solutions to automate that phase of the pipeline like they did initially
with the first phase and used an automatic web scraper, and it didn’t take them a lot of time
to find out that machine learning is probably the key to achieve such automation. She asked
you “what if we replaced the call center service with a machine learning service that takes in
car characteristics and automatically predicts its price?”.
If this was achieved, they would hit two birds with one stone: they can cut the cost of the
call center service, moreover they will fill the missing prices entries instantly without needing
to wait for the next day till the call center sends them the filled sheet, which would speed up
their process significantly and boost their supply of cars more quickly, and that will eventually
yield in more revenue. Your friend explained that such a system should be feasible, especially
that they have the data for it. Up until now, they kept all the sheets that was filled by the call
center service, so they have a dataset of car characteristics and their corresponding prices, all
they want now is a talented data scientists and machine learning engineer to translate this
data into a functional system. This is where you, my dear reader, come in.

7.1.1 Modeling the Problem with Decision Tress


For an experienced cars trader like your colleague or her partners, when someone ask her if
she’d be interested in buying a car with 310,000 INR, she’s probably going to ask the seller
some questions about the car to assess that price: what is the model of the car? What is the
engine’s capacity? What about its power? And so on. Depending on the answers given by the
seller, she begins assessing that against her past experience with cars. Let’s try to visualize

©Manning Publications Co. To comment go to liveBook


157

how this process gets carried out in her head. Let’s first say that she knows the following from
her experience:

• Any car manufactured before 2008 should, on average, worth 180,000 INR, no matter
its other specs.
• If the car is manufactured after 2008, and its engine capacity is less than 1500 CC,
then it shouldn’t worth more than 250,000 INR.
• If the car is manufactured after 2008, its engine capacity is more than 1500 CC, and
its engine power is less than 75 bhp, then it shouldn’t be valued at more than 300,000
INR.
• If the car is manufactured after 2008, its engine capacity is more than 1500 CC, and its
power is more than 75 bhp, then it can be valued around 450,000 INR.

Reading these set of rules can be a little confusing, and if we went on and added more rules,
it’s going to be much harder to know what goes where. Luckily, these set of rules can be
represented into a readable and an easy-to-follow form that doesn’t lose any of its power, and
that is the form of a tree. Figure 7-1 shows a tree representation of the same set of rules
above.

Figure 7-1: a tree representation of the set of rules listed above.

©Manning Publications Co. To comment go to liveBook


158

This representation makes it a lot easier to follow up with the rules and how they apply to the
new car the seller is trying to pitch. For example, if the seller said that the car was
manufactured in 2012, and its engine has a capacity of 1600 CC and power of 73 bhp, then by
traversing down the tree (as shown in Figure 7-2), your colleague can decide that this car is
reasonably priced.

Figure 7-2: traversing the tree with given car’s specs shows that it’s reasonably priced

Using this tree of knowledge and rules, your colleague was able to reach a decision about the
car being pitched to her. Hence, this form of structures is called a decision tree. Such plausible
representation of a car trader’s decision-making process is a good motivation behind thinking
that a decision tree structure is a good modeling assumption to build the required ML
component to automatically predict the cars prices.
With decision trees as our modeling assumption, our goal would be to learn from the data
a set of yes-or-no questions arranged in a tree-structure that would represent the rules of
predicting a car’s price. If we imagined that the tree in figure 7-2 was actually built from some
dataset (let’s call it the 12-cars dataset), then it would like the tree in figure 7-3. The tree in

©Manning Publications Co. To comment go to liveBook


159

figure 7-3 consists of a bunch nodes, some of them are decision nodes and the others leaf
nodes. Decision nodes are the nodes containing the yes-or-no questions that represent the
rules. From each decision node, two branches emerges: one when the answer to the question
is yes and the other when the answer is no, and each of these two branches is connected to
another node. All the branches from the decision nodes form paths to leaf nodes, which are
nodes that do not contain questions nor have any branches, instead they contain some of the
data samples in them.
We can think of each path down the tree as a list of characteristics , and the leaf node at
the end of each path contains the samples that have the same characteristic. So the bold path
in figure 7-3 would represent cars that: was made after 2008, has engine capacity higher than
1500 CC, and power less than 75 bhp. At the end of that bold path, the leaf node contains
three cars that have the same profile, so any other car that fits the same profile is predicted to
have a price similar to those samples in the leaf node, and it’s reasonable to assume that the
mean of the samples prices is a good estimation for that predicted price. If the car in the pitch
didn’t have a known price, we can follow the bold path in figure 7-3 and predict its price to be
the mean of the samples’ prices in the leaf node, which is 300K, and it’s a pretty good
estimate!

©Manning Publications Co. To comment go to liveBook


160

Figure 7-3: predicting the value of the new car from a decision tree trained on 12-cars dataset

This is what learning a decision tree from data is all about; we try to come up with a set of
yes-or-no questions that form paths leading to groups of samples that share the same
characteristics encoded by the path. To understand how we can do that, we need first to look
at the full 12-cars dataset, not just the prices.

7.1.2 How to Build a Decision Tree


Let’s take a look at the full 12-cars dataset, which can be found in the file named 12-
cars.csv 25 and it can be seen in figure 7-4.

25
This dataset, like any other dataset, can be found in the datasets directory of the book’s accompanying GitHub repository

©Manning Publications Co. To comment go to liveBook


161

import pandas as pd

cars_12 = pd.read_csv('../datasets/12-cars.csv', index_col=0)


cars_12

Figure 7-4: the 12-cars dataset from which the tree in figure 7-3 can be built

To understand who decision tress can be built, we should look at the effect of the tree in figure
7-3 on the data samples from a different perspective. Let’s begin by creating a scatter plot for
the samples in this dataset, which is shown in figure 7-5(left) 26. After that, Let’s draw 2D
planes at the decision values of each decision node in our tree, which results in the
partitioning of the 3D space that shown in figure 7-5(right).

26
The details of how to create such plots is out of scope now, but you can Google “3d plots with matplotlib” if you’re interested in learning how to do them
now

©Manning Publications Co. To comment go to liveBook


162

Figure 7-5: (Left) the 12-cars data points in 3D space, the x, y and z-axis correspond to year, capacity, and
power. (Right) the same scatter plot with 2D planes drawn at the decision values of the tree in figure 7-3.

This little visualization we made shows that decision trees also have the effect of partitioning
the whole feature space into sub-regions in which have similar data points; each sub-region of
the four in figure 7-5(right) corresponds to one of the leaf nodes in the tree in figure 7-3. This
partitioning effect of decision trees reminds us of k-d trees from chapter 5, and that’s the key
to understand how decision trees are built.
Decisions trees work in a similar way to the k-d trees we used back in chapter five; they
both partition the feature space into multiple sub-regions. So it won’t be strange to find out
that decision trees are built in a similar way to k-d trees: by recursively splitting regions of
space into two sub-regions. This splitting is done by choosing some feature x and a value a
(which together are called a pivot), so that one region would contain all the points that have x
≤ a and the other contains the points that have x > a. To see this building process in action, we
need to assume two things:

• That there is a stopping criteria that determines how big a tree we’re going to build,
and we’ll assume that this criteria here is for any leaf node to include 3 samples from
the data.
• That there is some method which we can provide it with the training data points and it
would provide us with best pivot, that is the pair of best feature and its value to split
the space around. In order to build a meaningful decision tree, we must choose
meaningful pivots; we’ll get to how we can choose meaningful pivots in a while, but for
now that such method exists its produced pivots are the best.

©Manning Publications Co. To comment go to liveBook


163

With these two assumptions, we can start building the decision tree in figure 7-3 a step by
step:

1. At the beginning, the best pivot is on the Year feature at 2008, this results in a tree
with only two leaves, corresponding to two partitions of the feature space, as shown in
figure 7-6(a).
2. Leaf node 1 (LN1) has only three samples, so any further partitioning will violate our
stopping criteria, so we no longer partition the region with Year ≤ 2008. But we can
partition LN2 on the best pivot which is Engine at 1500 CC, which partitions the region
where Year > 2008 into two more regions (figure 7-6(b)).
3. LN2 meets the stopping criteria, so we only partition LN3 at the best pivot at that step,
which is Power at 75 bhp, resulting into two more regions in the feature space (figure
7-6(c)).
4. LN3 and LN4 both meet the stopping criteria, so we stop here and do no more
partitions.

©Manning Publications Co. To comment go to liveBook


164

(a)

(b)

©Manning Publications Co. To comment go to liveBook


165

(c)
Figure 7-6: The process of building the decision tree on 12-cars dataset by recursively splitting the space into
two regions. Numbers on the points in the tree leaves refer to the points indices in the dataset

That’s it, we have ourselves a full grown tree! The process looks pretty easy, but keep in mind
that we omitted a very important part of the process, which is how we can determine the best
pivot. To finalize our understanding of how decision trees are built, we must figure out how we
can choose the best splitting pivot at each step. By giving this some thought, it’d appear that
figuring this out is pretty easy as well.
To understand how we can choose the best split, we need to ask ourselves, a trivial, but an
important question: what are decision trees trying to do? The answer to this question is
simple: decision trees try to build a model that can predict labels correctly. This raises another
question, how can we measure the tree’s ability to predict correct labels? We know the answer
to this from past experience: through a loss function. Back in chapter five, we used 0/1 loss
function to measure if the k-NN classifier is predicting the mushroom’s class, where we
assigned a value of 0 to the loss if the classifier predicted the class correctly, otherwise we
assign it a value of 1. Such a loss function worked well because back then we had a set of
discrete labels, a set of classes that the classifier must predict one of them. But in a regression
problem like the one we have here, such loss function wouldn’t be more suitable; because
we’re predicting a continuous value (an amount of money in this case) not a discrete one. In
such regression cases, another loss comes to the rescue, and that loss is called the squared
error loss, formulated as:

©Manning Publications Co. To comment go to liveBook


166

This loss is perfect for continuous-valued labels; it’s perfectly zero when the predicted label is
the same as the true label, and it starts growing as the predicted label starts to drift away
from the true label, in either directions 27. The only downside of the squared error loss is that it
doesn’t have a nice absolute value like the 0/1 loss. When we average the 0/1 across all
samples, we get a nice absolute value between 0 and 1 that represents the accuracy of the
model, but the mean squared error (or MSE for short) does not have such interpretation. This
makes the MSE only useful when comparing multiple models; the model with the lowest MSE
is the best, but a single model with a single MSE value doesn’t tell us much about the quality
of that model.
With the MSE loss, we can formulate the decision tree model as a model that tries to
minimize the MSE over the training set. Hence, each step in building the tree should contribute
to that final goal. It follows from that the best split pivot at a time is the pivot that’s going to
minimize the MSE across the training data. If we assumed that each pivot can be represented
with the pair (xj, a) where xj is the feature to split on and a is the value to split at, then the
best pivot is the pivot that minimizes the MSE in the two regions generated by the split. So if
we denoted the region with xj ≤ a as Rxj ≤ a ,the other region as Rxj > a, and the predicted label in
each region is mean of all points in that region, then the best pivot would be the one that
minimizes:

Where |R| is the number of points in the region R.


So to choose the best pivot given some samples, we can for example list all the possible
values of each feature from the given sample and iterate over each feature and value pair, get
the regions resulting form splitting on that pair and calculate the MSE stated above. The pair
with the lowest MSE would be the best pivot to go with. It’s a simple as that! Actually, we can
write ourselves a little code that can build a functioning decision tree on the 12-cars dataset
and it’ll show us that the pivots we choose in the process in figure 7-8 are indeed the best
ones.

27
That’s because the difference between the true and predicted label is squared; if the predicted label went bigger or smaller than the true label, it will
always result in a positive loss.

©Manning Publications Co. To comment go to liveBook


167

7.1.3 Coding a Primitive Decision Tree


The first thing we need to have to implement our primitive decision tree is to have a data
structure that can hold the tree and allow traveling down through it using with a given data
point. We implement such a data structure in the Node class below.
class Node:

def __init__(self, is_leaf=False, pivot=None):


self.is_leaf = is_leaf
self.pivot = pivot

def attach_children(self, left, right):


self.left_child = left
self.right_child = right

def attach_leaf_value(self, mean_value):


self.leaf_value = mean_value

def traverse(self, x):


if not self.is_leaf:
feature_indx, value = self.pivot
if x[feature_indx] <= value:
return self.left_child.traverse(x)
else:
return self.right_child.traverse(x)
else:
return self.leaf_value

The Node class represents a node in the decision tree, with left and right child nodes attached
if the node is not a leaf node. With these properties, the root node object would represent the
whole decision tree as it has links to all the child nodes underneath it. The traverse method
allow us to travel down the tree with a given data point: on each decision node, comparison
against the pivot is made to determine to which child shall we traverse next, until we reach a
leaf node where we'll return the attached mean value of the labels in that leaf node region.
Next, we need to use the Node class in order to build the whole tree. We do so by defining
a class called PrimitiveDecisionTreeRegressor that will hold our tree-building code. The
two main method in this class are the get_best_pivot and the split methods.
get_best_pivot enumerates all the possible values for each feature and calculates the
splitting MSE for each splitting option, the option with the smallest MSE is returned. The split
method uses the get_best_pivot method to recursively split regions into two sub-regions
until we hit the stopping criteria. The implementation of the that class is listed below, it’s a
little bit lengthy, but it’s readable and it’s augmented with comments here and there to clear
up any unclear pieces of code.
import numpy as np

def mse(y_true, y_pred):


# Calculating the mean square error (MSE)
return np.mean((y_true - y_pred) ** 2)

©Manning Publications Co. To comment go to liveBook


168

class PrimitiveDecisionTreeRegressor:

def __init__(self, stop_at=3):


self.stop_at = stop_at
self.tree = None

def get_best_pivot(self, region_X, region_y):

samples_count, features_count = region_X.shape

# record each feature index with its possible values in a dictionary


features_split_points = {}
for i in range(features_count):
features_split_points[i] = np.sort(region_X[:, i])

# minimal mse is set to infinity at start


minimal_mse = np.inf
best_pivot = (None, None)

# loop over all pairs of (feature_index, value)


for i in range(features_count):
for value in features_split_points[i]:

# use boolean masks to get the samples in each subregion


left_region_mask = region_X[:, i] <= value
right_region_mask = ~left_region_mask

# check the number of samples in each subregion


# if it's lower than the stopping criteria, we omit that split
left_region_X = region_X[left_region_mask]
right_region_X = region_X[right_region_mask]

if left_region_X.shape[0] < self.stop_at or \


right_region_X.shape[0] < self.stop_at:
continue

# get the y values in each region and calculate the split MSE
left_region_y = region_y[left_region_mask]
right_region_y = region_y[right_region_mask]

left_mse = mse(left_region_y, np.mean(left_region_y))


right_mse = mse(right_region_y, np.mean(right_region_y))
total_mse = left_mse + right_mse

# if the split MSE is lower than the prvious best


# we set the current split as the best one
if total_mse < minimal_mse:
minimal_mse = total_mse
best_pivot = (i, value)

return best_pivot

def split(self, region_X, region_y, level=1):

samples_count, _ = region_X.shape

©Manning Publications Co. To comment go to liveBook


169

# if the samples count in the region to split meets


# the stopping criteria, stop recursion and return a leaf node
if samples_count <= self.stop_at:
leaf_node = Node(is_leaf=True)
leaf_node.attach_leaf_value(np.mean(region_y))
return leaf_node

# get the best pivot at this step and craete a decision node
split_feature, split_value = self.get_best_pivot(region_X, region_y)
current_node = Node(pivot=(split_feature, split_value))

print("{}Split on feature {} at {}".format(


''.join(['\t'] * level),
split_feature,
split_value)
)

left_region_mask = region_X[:, split_feature] <= split_value


right_region_mask = ~left_region_mask

# recursively split the left subregion and get its root node
print("{}Left Region:".format(''.join(['\t'] * level)))
left_region_X = region_X[left_region_mask]
left_region_y = region_y[left_region_mask]
left_child = self.split(left_region_X, left_region_y, level + 1)

# recursively split the right subregion and get its root node
print("{}Right Region:".format(''.join(['\t'] * level)))
right_region_X = region_X[right_region_mask]
right_region_y = region_y[right_region_mask]
right_child = self.split(right_region_X, right_region_y, level + 1)

# attach the left and right subregions to the decision node


current_node.attach_children(left_child, right_child)

return current_node

def fit(self, X, y):


self.tree = self.split(X, y)

def predict(self, x):


return self.tree.traverse(x)

With this implementation, we're ready to train a decision tree on the 12-cars data we have. In
the implementation we followed the scikit-learn convention of fit/predict for consistency, we
also added print statements that would layout how the tree is built to see if the splits made by
the minimum MSE matches those we made while we were visually building the tree in figure
7-8. Fitting the model would show us that the splits are indeed correct.
X_train = cars_12.loc[:, "Year":"Power"].to_numpy()
y_train = cars_12.loc[:, "Price"].to_numpy()

primitive_dt = PrimitiveDecisionTreeRegressor()
primitive_dt.fit(X_train, y_train)

©Manning Publications Co. To comment go to liveBook


170

> Split on feature 0 at 2008


Left Region:
Right Region:
Split on feature 1 at 1500
Left Region:
Right Region:
Split on feature 2 at 75
Left Region:
Right Region:

And if we ran our testing example, the one started this section with, through the predict
example, we would get the prediction of 300K as we stated before.
x_test = np.array([2012, 1600, 73])

primitive_dt.predict(x_test)

> 300.0

Decision Trees and Categorical Features


As we have seen above, decision trees work seamlessly with numerical features by picking a pivot value and
checking if the feature value is less than or equal to that pivot, but this approach doesn’t work with categorical
features that take discrete values. Instead, the original design of decision trees suggested to split on categorical
features by asking if the feature takes a specific value or not. To be more concrete, suppose that we have
another features in the 12-cars dataset called “Fuel” which specifies which type of fuel the car is using, and it
can be one of three values: “Petrol”, “Diesel” or “Gas”. With such feature, a decision tree’s spitting would be: “Is
fuel == Gas?”, or “Is fuel == Petrol”?.

While a lot of decision trees implementations handle categorical features this way out-of-the-box, scikit-learn’s
implementation does not, and there is a good reason for that. Decision trees in scikit-learn do not support
training on categorical features directly, instead it requires the engineer to encode these categorical features
into a numeric form and then use them. One way to do that is via OrdinalEncoder, which takes in a
categorical feature with k unique values and assigns a number from 0 to k-1 for each distinct value.
OrdinalEncoder works exactly like the encoder we wrote ourselves in chapter 5. Another approach, which
ends up mimicking the behavior described in the paragraph above, is using OneHotEncoder. Instead of
assigning a number to each of the k distinct values of the feature, OneHotEncoder expands that one
categorical feature into k binary features, each of these features is an answer to the question “Is feature == the
kth value?” . With our fuel example, a car that has a feature fuel=Diesel would be replaced with the three
features: fuel_is_petrol=0, fuel_is_diesel=1, fuel_is_gas=0. When we apply numerical
splitting on such features, the splitting question would be something “fuel_is_petrol > 0.5”, which is the same as
“Is fuel == Petrol”.

The downside of OneHotEncoder is the fact that the number of features can blow up this way, and that’s not
favorable (remember the curse of dimensionality?). On the other hand, OrdinalEncoder adds this unwanted
assumption of ordering between the features values, which is usually not true. However, the effect of that

©Manning Publications Co. To comment go to liveBook


171

unwanted assumption can be sometimes tolerated in comparison to the features blow-up effect of one-hot
encoding. That’s why scikit-learn leave the encoding choice to us, to determine which is better for our problem.

7.2 Training a Decision Tree with scikit-learn


Now that we have decision trees as a modeling assumption for our problem, we can start
building a solution. We were delivered two CSV files 28: one contains the training data and the
other is a held-out testing set, so won’t need to do any train/test splitting this time. With
these two files, we’re also given an acceptance criteria for integrating the our developed ML
system in the supply pipeline: the model must achieve an MSE less than 15 on the held
out testing set. We start by taking a look at the training data.
import pandas as pd

train_data = pd.read_csv("../datasets/used-cars-train.csv")
test_data = pd.read_csv("../datasets/used-cars-test.csv")

train_data.sample(10, random_state=42)

28
The original source of this dataset is Kaggle: https://www.kaggle.com/avikasliwal/used-cars-price-prediction

©Manning Publications Co. To comment go to liveBook


172

Figure 7-7: a random sample of 10 from the used cars dataset

In figure 7-7, we see a random sample of 10 entries from our dataset which consists of 6019
row and 14 column, with each column representing one of the characteristics of the sold car:

• Name: the car’s brand name and model.


• Location: the city in India where the car is listed for sale.
• Year: the edition year of the car.
• Kilometers_Driven: how many kilometers have been driven with this car.
• Fuel_Type: the type of fuel that the car uses, whether it’s: Petrol, Diesel, Liquefied
Petroleum Gas (LPG), Compressed Natural Gas (CNG), or Electric.
• Transmission: the type of the car’s transmission system, either automatic or manual.
• Owner_Type: whether the current owner is the First, Second, Third, Fourth or above.
• Mileage: How many kilometers can the car travel on a liter (or a kilogram, depending
on the fuel type) of fuel.
• Engine: the car’s engine capacity, measured in cubic centimeters (CC)

©Manning Publications Co. To comment go to liveBook


173

• Power: the car’s engine power, measured in horsepower (bhp)


• Seats: the number of seats in the car.
• New_Price: the price of a new car of the same model.
• Price: the price asked for by the owner to sell the car in the units of 100,000 Indian
Rupees. So if a car has a value of 5.75 for price, this means that it was priced at
575,000 Indian Rupees. This column is our prediction target.

A quick look through the sample in figure 7-7 shows a two problems that needs to be fixed
before we can try to build a predictive model out of it:

1. The Mileage, Engine, and Power features, have their units attached to them. This makes
pandas interpret these values as strings and not floats, and these features should be
treated as numeric features, not categorical.
2. There is a lot of NaNs and in the numeric columns, which indicates missing data. These
data were probably missing in the original car ad; but for our model to work properly,
we need to handle these missing values.

7.2.1 Preparing the Data


In clean_dataframe function, we start by creating a deep copy of the given df data frame to
serve as the cleaned version of our data and then we apply a simple function to each one of
the numerical columns to strip out the units and cast the remaining value into float.
import numpy as np

def strip_units(str_value):
190
float_value = np.nan

if str_value is not np.nan:


number_str, units = str_value.split()
try:
float_value = float(number_str)
except Exception:
pass

return float_value

def clean_dataframe(df):

clean_data = df.copy(deep=True)

clean_data.loc[:, 'Mileage'] = df.loc[:, 'Mileage'].apply(strip_units)


clean_data.loc[:, 'Engine'] = df.loc[:, 'Engine'].apply(strip_units)
clean_data.loc[:, 'Power'] = df.loc[:, 'Power'].apply(strip_units)

return clean_data

train_data_clean = clean_dataframe(train_data)
test_data_clean = clean_dataframe(test_data)

©Manning Publications Co. To comment go to liveBook


174

The strip_units function simply checks if the given value is not a instance of NaN 29 and then
splits the string around white-spaces, which results into two parts: the number in strings and
the units. We then try to cast that number string into an actual floating point number. To
avoid any exceptions being raised in case the string doesn’t contain any actual number, we do
the casting inside a try...except block. To use this function in cleaning our data, we call the
apply method on the desired column to process and pass it that function. All what apply
does is passing each element in its calling object to the the specified function and returns a
copy of the calling object containing the returned values from the given function. We then use
this copy and assign it to the desired column in the clean_data data frame. A simple call to
train_data_clean.sample(10, random_sample=42) will reveal that the units were stripped
from the numeric values.
Now remains the second problem of missing data. Before attempting any solution, we need
to asses how severe this problem is; this will help us decide if we should just drop the rows
with the missing values or attempt to impute them. The first thing we need to do is
normalizing the representation of missing values. In our dataset, missing values can manifest
themselves in either a NaN or a 0 value, to avoid running the code two times for both values,
we transform all the 0 values to NaN values. In this way, we’d have a single representation of
missing values to handle.
train_data_clean = train_data_clean.replace(0, np.nan)
test_data_clean = test_data_clean.replace(0, np.nan)

With NaN as the representation for the missing values, we can run the following statement to
get the percentage of missing values in each feature. In this statement, we first call the
isnull() method on the data frame, this method returns a Boolean version of the data frame
where cells with NaN values are replaced with True values, and all other cells are replaced by
False. We then run the mean() method on that Boolean version to get the fractions of Trues
in each column, which is the same as the fraction of missing values. We finally multiply that
by 100 to get percentages.
train_data_clean.isnull().mean() * 100

> Unnamed: 0 0.022153


Name 0.000000
Location 0.000000
Year 0.000000
Kilometers_Driven 0.000000
Fuel_Type 0.000000
Transmission 0.000000
Owner_Type 0.000000
Mileage 0.996899
Engine 0.575986
Power 2.215330

29
NaN is short for Not a Number, in case you didn’t already know that

©Manning Publications Co. To comment go to liveBook


175

Seats 0.686752
New_Price 86.287107
Price 0.000000
dtype: float64

This report shows the there are less than 3% of missing data in Mileage, Engine, Power, Seats
features. While the New Price feature has about 86% of data missing. With that huge amount
of data missing, it’d be reasonable to drop the New Price feature all together. We can do that
by calling the drop() method of the data frame and passing it the “New Price” string (along
with the the “Unnamed: 0” column which doesn’t have any purpose). Note that drop()
works on both the rows and columns, so we also need to specify the drop axis to be 1, which
points to columns.
train_data_clean = train_data_clean.drop(['New_Price', 'Unnamed: 0'], axis=1)
test_data_clean = test_data_clean.drop(['New_Price', 'Unnamed: 0'], axis=1)

For the other features with missing data, the small amount of missing data in encouraging; we
can use the existing data to in each feature to impute the missing values in it. We’re going to
stay simple and use the SimpleImputer from sklearn.impute module. The SimpleImputer
can use multiple strategies for imputation, it can use the mean of all existing values as a filling
for the missing values, it can use the median, the most frequent value, or a constant value.
Here, we’ll use the mean strategy. But before we can start imputing the missing data, we
need to do two things: encode our categorical features, and make the train-test-split. We’re
going to encode our data numerically like we did in chapter 5, but instead of writing our own
function to do that, we’re going to make use of the OrdinalEncoder from
sklearn.preprocessing module. This encoder does the same thing our numerically_encode
method from chapter 5 did: it assigns a number from 0 to the number of categories for each
category in the feature.
To make the ordinal encoder work on the categorical columns only, we use a neat tool
from sklearn.compose module called ColumnTransformer. With ColumnTransformer, we can
specify multiple transformers to run on the data, each on a specific set of columns. This is
done by initializing the ColumnTransformer object with an array of, each tuple contains three
values that represent a specific transformation on the data:

• the first of these values is the name of the transformation operation, and that’s
something we specify.
• The second value is the transformer object which is going to apply the transformation.
• The third value is the list of columns on which the transformer is going to run.

So for our case here, our ColumnTransformer would be initialized as follows:


categorical_features = [
'Name','Location', 'Fuel_Type', 'Transmission', 'Owner_Type'
]

ColumnTransformer([
('categories_encoder', OrdinalEncoder(), categorical_features)

©Manning Publications Co. To comment go to liveBook


176

], remainder='passthrough')

The reminder parameter determines what to do with the remaining columns that were not
specified in the transformers list. By default, this parameter is assigned the value “drop”,
which drops all the unspecified column, but we want to keep those numerical columns; hence
we give it the value “passthrough” which concatenates the remaining columns after the
specified ones unchanged and in their original order. So in our example above, the output
would have the encoded Name, Location, Fuel_Type, Transmission, Owner_Type columns at
first and then the unchanged Years, Kilometers_Driven, Mileage, Engine, Power, Seats, Price
columns afterwards. Note that the encoder will be fit on both the training and testing data
combined to be able to encode all the possible values. We use the DataFrame.append to
concatenate the training and testing data together in one Data frame for fitting the encoder.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split

categorical_features = [
'Name','Location', 'Fuel_Type', 'Transmission', 'Owner_Type'
]

encoder = ColumnTransformer([
('categories_encoder', OrdinalEncoder(), categorical_features)
], remainder='passthrough')

all_clean_data = train_data_clean.append(test_data_clean)
encoded_data = encoder.fit(all_clean_data)

train_data_encoded = encoder.transform(train_data_clean)
test_data_encoded = encoder.transform(test_data_clean)

X_train, y_train = train_data_encoded[:, :-1], train_data_encoded[:, -1]


X_test, y_test = test_data_encoded[:, :-1], test_data_encoded[:, -1]

Now that we prepossessed our data, we can fit our simple imputer on the training data and
use it to fill the values in the testing data. What’s going to happen is that the imputer would
calculate the mean of the existing data along the features from the training data, and these
calculated mean is used to fill any missing value in both the training and testing data. This is
necessary to prevent information in the testing set from leaking into the training process.
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train)

X_test_imputed = imputer.transform(X_test)

Now our data is ready to train the decision tree.

©Manning Publications Co. To comment go to liveBook


177

7.2.2 Training and Evaluating the Decision Tree


Like any scikit-learn model, training a decision tree is nothing but running the fit method of a
DecisionTreeRegressor object, which can be found in the sklearn.tree module.
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor(random_state=42)
model.fit(X_train_imputed, y_train)

By default, scikit-learn implementation of decision trees randomly permutes the features


before each split, this can result in an undeterministic behavior in case of two pivots resulting
in the same MSE. In such tie, the pivot that comes first in the search will be chosen, which will
probably have a different sub-tree than the other pivot in the tie. A few cases of such ties with
random permutation would result in a different tree each time we fit the model, that’s why we
fix the random_state in the initialization of the model, to always force a deterministic
behavior.
To evaluate our trained decision tree on the testing data, we use the mean_squared_error
function from sklearn.metrics. This function calculates the MSE value between the labels
predicted by the model and the true labels from the test set.
from sklearn.metrics import mean_squared_error

y_predicted = model.predict(X_test_imputed)
mse = mean_squared_error(y_true=y_test, y_pred=y_predicted)
print("MSE = {:.2f}".format(mse))

> MSE = 29.52

We can see that the MSE of our model is higher than 15, hence our decision tree cannot be
accepted. We need to start thinking about how we can improve its performance and move it
towards the acceptance criteria. We started our discussion of decision trees with one
interesting property, and it was the fact that decision trees can be easily visualized. So in
order to understand what this decision tree is doing wrong and attempt to improve, visualizing
our trained tree would be good place to start.
Scikit-learn provides us with the ability to visualize trained decision trees with the help of
known graph-drawing library called graphviz. With the export_graphviz function from the
sklearn.tree module, we can export the trained tree in graphviz format (called dot format).
Using the exported dot format of the tree, we can use the python bindings for graphviz 30 to
export the visualized tree in a PDF file that where we can examine the tree and inspect it.
Once the following code finishes (it could take a few minutes) and you open the PDF file, you’ll
know why we exported it to a file and not display it directly to Notebook.

30
The python bindings for graphviz can be installed using Anaconda via: conda install -y python-graphviz

©Manning Publications Co. To comment go to liveBook


178

from sklearn.tree import export_graphviz


import graphviz

numerical_features = [
'Year', 'Kilometers_Driven', 'Mileage', 'Engine', 'Power', 'Seats'
]

dot_data = export_graphviz(
model,
# This allows the visualization to put feature names in decision nodes
feature_names=categorical_features + numerical_features,
# This makes the nodes with rounded corners
rounded=True,
# fills the nodes with colors indicating how high/low the decision value is
filled=True,
# uses special characters like ≤ instead of <
special_characters=True
)

visualized_tree = graphviz.Source(dot_data)
visualized_tree.render("visualized_tree")

When opening the visualized_tree.pdf file, we can see the tree is huge! It requires multiple
zoom-ins to start seeing the nodes clearly. In figure 7-8, we can see a very very small portion
of the tree, and it shows that each of the visible leaf nodes has only a single sample in them.
We can verify that this is the case for the whole tree by calling the get_n_leaves() method
on our model to get the exact number of leaves and compare it to the number of training
points in X_train.
training_count, _ = X_train.shape
leaves_count = model.get_n_leaves()

print("{} data samples are put into {} leaf nodes".format(


training_count, leaves_count
))

> 4514 data samples are put into 4286 leaf nodes

©Manning Publications Co. To comment go to liveBook


179

Figure 7-8: a small portion of the visualized tree. Decision nodes are the nodes that have a rule in the first line
(like Year ≤ 2017.0). mse is the value of MSE at this node, samples is the number of data points under the
node, and value is the mean of samples prices under the node.

Instead of learning decision paths from multiple samples that are similar to each others, our
model has learned unique decision paths for almost every sample in the training data. This is
typical overfitting behavior. To understand more how this is over fitting, take a look at the
highlighted leaf nodes in figure 7-8. These are two cars that are similar across many features,
but there’s 25K INRs difference in their selling prices. This discrepancy in prices can be due to
some other factor other than those we have as features. Maybe the one selling his car for 25K

©Manning Publications Co. To comment go to liveBook


180

less was in dire need of cash so he lowered the price to sell it fast. Maybe the other who’s
selling the car for 25K more has equipped the car with high performance tires. These are all
part of the noise in the dataset, and the model is fitting decision paths for them, hence we say
that the model is overfitting.
A consequence of such overfitting would be that if the training data changed, the model
performance is probably going to change as well; because the model is more likely fitting
different kind of noise with each different training set. That means that if trained multiple
trees on different datasets drawn from the same distribution, it’s highly likely to have very
different trees on each set such that if we combined all these trees into one single tree
representing their average, a lot of the trained trees will be very different form that average.
That is exactly the definition of high variance, and that’s why an overfitting model is said to
have high variance, because its shape (and consequentially, its performance) can vary
significantly when trained on different random samples from the population.

Figure 7-9: Decision trees are high variance models, instances trained on different datasets drawn form the
same distribution tend to vary a lot from their average representation

But what’s the problem with having a high variance model? What’s so bad about? To answer
this question we need to remember that our training dataset is itself a random sample from
larger unseen distribution of cars and their prices. Fitting a high variance model to our training
will most probably result in a model that goes beyond fitting the relation between the features
and labels and overfits to the specific noise signals in our random training sample. Once we

©Manning Publications Co. To comment go to liveBook


181

deploy such model to production and other data from the population that has no or different
noise signals, the model will fail miserably. We need to constraint our model’s variance to
ensure it performs well with new and unseen data.

7.3 Trim the Tree or Grow Yourself Forest


Before we start exploring how we can lower the variance of our tree model, we first need to
get an estimate of how high our current model’s variance is. Because the details of estimating
a model’s variance is out of this chapter’s scope, we created a special function called
estimate_tree_variance in the chapter’s utils module that can do that for us. This function
takes the un-imputed version of the data and returns a float representing the estimated
variance of decision trees on this data set.
from utils import estimate_tree_variance

tree_var = estimate_tree_variance(X_train, y_train)


print("Estimated Tree Variance = {:.2f}".format(tree_var))

> Estimated Tree Variance = 15.22

This estimated value will help us in comparing and showing how the constrained models we’re
going to build in this section will have less variance than our original model.

7.3.1 Pruning the Tree


The observation that lead us to know that our tree is suffering from high variance is the fact
that the leaf nodes of the tree has only one sample and there are almost as many leaf nodes
as there are training samples. This suggests that the tree will be less overfitting if we
increased the number of samples in the leaf nodes, or equivalently 31 reduced the number of
leaf nodes or reduced the depth of the tree. Luckily, scikit-learn provides numerous
parameters in the initialization of a DecisionTreeRgeressor object that we can use to
achieve such pruning. These parameters include:

• max_depth: which controls the maximum depth that the tree can reach. Its value is
defaulted to None, so the tree is grown as deep as possible by default.
• min_samples_split: which controls the minimum number of samples required in a
node to split into two further nodes. Its default value is 2, which means that the tree
will keep splitting nodes till it isolates each single sample in a node.
• min_samples_leaf: which determiners the minimum number of samples required to be
in a leaf node. This defaults to 1.

31
The two are equivalent because if we grouped multiple samples from the training set into one leaf nodes, there won’t enough samples to make as many
leaf nodes as before. This will also reflect on the depth of the tree as there will not be as many samples to do more splits on and go deeper.

©Manning Publications Co. To comment go to liveBook


182

• max_leaf_nodes: which specifies the maximum number of leaf nodes that should exist
in a trained tree. This defaults to None which drives the tree to have as many leaf
nodes as possible.
• min_impurity_decrease: impurity is another term for the loss function that the tree
is trying to optimize, which is the MSE here. This parameter says that a node will be
split if the decrease in MSE due to the split is bigger than its value. Its default value is
0, so any split no matter how small its MSE decrease is will be carried out.

It’s obvious that these parameters are hyperparameters, because there is no way we can
figure them out from training, they need to be specified before we even start training. Any of
these hyperparameters (or a combination of them) can be tuned to trim the tree while it’s
being built, and the choice of which hyperparameter (or which combination of them) to tune
can be different from a problem to another. However, an empirical study 32 has found that
min_samples_leaf and min_samples_split are the hyperparameters that mostly affect the
performance of decision trees (specifically the type implemented by scikit-learn). Given that
information and that fact that what caught our attention to the overfitting problem in our tree
was the number of samples in each node, we’ll only attempt to tune min_samples_leaf only.
But before we do so, it’s time to take notice of two problems in the way we do
hyperparameters tuning using grid search with train/validation split. We started doing the
training/validation split in order to test the hyperparameters against data unseen during
training without risking to snoop data from the test set. However, when we do the
train/validation split on the data we have already reserved for training, we introduce two
problems:

1. We reduce the size of the actual training data, and our model is going to see fewer
examples to learn from, and that’s probably going to make it learn less.
2. Due the to the reduced size of both the new training and validation sets, each random
split will result in a very different validation set than the other, which in turn will
increase the variance in the validation error estimates. This variance in the data
coupled with a high variance model like our decision trees will make the situation even
worse.

EXERCISE 7.1

Verify the second problem above by writing a small script that does train/validation split,
trains a decision tree on the new training data, then evaluates the trained tree against the
validation set using MSE. Loop this script 100 times, record the MSE in each time and plot a

32
Mantovani, Rafael Gomes, et al. "An empirical study on hyperparameter tuning of decision trees." https://arxiv.org/pdf/1812.02207.pdf. Credits to
Mukesh Mithrakumar and his article How to Tune a Decision Tree? (https://towardsdatascience.com/how-to-tune-a-decision-tree-f03721801680) For
bringing that study into attention

©Manning Publications Co. To comment go to liveBook


183

histogram of the MSE values across the run. Examine the resulting histogram and specify how
it verifies the 2nd problem above.
The obvious solution to these two problems is to get more data, but that’s something we
usually can’t obtain; we have to live with the data we have, so we need to find a more
creative solution. A really creative solution to this problem is what’s called k-fold cross-
validation (or k-fold CV). In k-fold CV we start by randomly splitting our full training set into
k equal-sized subsets, which we call folds. To estimate the validation error of a model, we
iterate over the k generated folds and for each one of them:

1. We hold out that fold to be used as a validation set, then


2. We train the model on the remaining k-1 folds combined together, then
3. We test the trained model against the held-out fold and record its score on that fold.

After we finish looping over all the k folds, we average the scores against each fold into one a
single score and that averaged score is our validation score estimate. Figure 7-10 shows that
exact same process when k=3.

©Manning Publications Co. To comment go to liveBook


184

Figure 7-10: K-fold CV with k=3. The process starts by randomly splitting the training set into 3 equal-sized
subsets, then a loop iterates over the 3 generated folds. In each iteration, one fold is held-out and the remaining
two are combined and used for training the model, then the model is tested against the held-out fold. The final
validation score is the average of the all validation scores from each iteration.

What makes k-fold CV powerful is that if we thought about each iteration’s validation score as
a representative of a model trained on k - 1 folds and tested against the remaining fold, then
the average of all these scores would represent a model that has seen all the data in training
and was validated against all the data as well, with no data snooping whatsoever! This
effectively solves the two problems we had before: there’s no reduction in the training set,
and no variance in the validation set. Employing this cross-validation method in
hyperparameter tuning would allows us to obtain stronger estimates of each candidate
performance and hence a stronger basis to select the best best parameters.
Extending k-fold CV for hyperparameter tuning is very simple, all we need to do is to
repeat the k-fold CV process for each candidate hyperparameters values in order to estimate

©Manning Publications Co. To comment go to liveBook


185

the performance of that candidate. This process can be computationally expensive if we’re
exploring a big grid of parameters, so it needs a very optimized implementation to carry it out
efficiently. Fortunately, we don’t need to worry about such implementation as scikit-learn
provides us with GridSearchCV which implements cross-validated hyperparameter tuning in
an efficient and fast way which can even utilize parallel multi-core processors.
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# ensures random generations are the same across all runs


np.random.seed(42)

pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('tree', DecisionTreeRegressor())
])

model_selector = GridSearchCV(pipeline, param_grid={


'tree__min_samples_leaf': np.arange(1, 20),
}, scoring='neg_mean_squared_error', cv=5)

model_selector.fit(X_train, y_train)

In the previous code snippet, we see a new object called Pipeline. Scikit-learn provides this
object to chain multiple transformation operations with an estimator 33 in the end where the
steps are executed sequentially; each step is defined by a tuple of a string representing the
step’s name and an transformer/estimator used in that step. The problem Pipeline is here
to solve is that we can’t use the imputer we fitted before in the k-fold CV process as it will
snoop data to each iteration of the CV. In each iteration of the k-fold CV, we have a different
training set, and an imputer should be fitted to this specific training set and nothing else, on
the other hand the imputer we fitted earlier has seen all the data including the validation set
in each iteration. A Pipeline object allows us to chain both imputation and the training into
a single interface. Each Pipeline object has a fit/predict interface as most of scikit-learn
estimators, and calling fit on the pipeline fits all the transformers in sequence and then fits
the final model on the transformed data. With grouping of imputation and tree model, we can
use k-fold CV with no fear of data snooping.
After constructing our pipeline, we create a GridSearchCV object and give it our pipeline to
tune its hyperparameters. Moreover, we specify the parameters grid we want to search by
providing a dictionary to the param_grid argument. Each key in that dictionary is the name of
a parameter we want to tune, and its value is the range of candidate values we want to
explore for that parameter. Parameters names are formatted as [<step name>__]<parameter
name>. So for our case here, we want tune min_samples_leaf of the tree step in the pipeline,

33
In scikit-learn lingo, estimator means an ML model. So an instance of DecisionTreeRegressor is an estimator, and an instance of
KNeighborsClassifier is an estimator. We’ll be using the terms estimator and model interchangeably.

©Manning Publications Co. To comment go to liveBook


186

hence our parameter name in the grid dictionary is tree__min_samples_leaf. If we provided


a model directly and not a pipeline, the [<step name>__]part of the format is omitted. We
also provide two more important parameters to our GridSearchCV object:

• scoring, which defines how each iteration’s validation score is calculated. Here we
specify it to be 'neg_mean_squared_error' which is the negative MSE. We use the
negative because when selecting the best model, GridSearchCV picks the model with
the highest score by default. Because we want the lowest MSE, we tell the selector to
pick the highest negative MSE.
• cv , which is the number of folds to split the data into, a.k.a the value of k in k-fold
CV. We set it here to 5.

Once the GridSearchCV object is ready, commencing the search is done by simply calling
the fit method of the object and providing it with the original un-imputed data. When the
search is done, we can access the best parameter and the best model through the
best_params_ and best_estimator_ attributes.
print(model_selector.best_params_)
best_trimmed_tree = model_selector.best_estimator_

> {'tree__min_samples_leaf': 7}

While the ordinary train/validation split method determined that the best value for
min_samples_leaf is 3, k-fold CV determined that 4 is a better value. We can confirm that k-
fold CV’s decision is the better one by calculating the test MedAE of best_trimmed_model.
One of the cool features about GridSearchCV is that the model available at
best_estimator_ is fitted on all the data after the best parameters values have been
determined, so we don’t need to do that ourselves.
best_trimmed_preds = best_trimmed_tree.predict(X_test)
best_trimmed_mse = mean_squared_error(y_test, best_trimmed_preds)

# we access steps of the pipeline like we get values from a dict


best_trimmed_n_leaves = best_trimmed_tree['tree'].get_n_leaves()
best_trimmed_variance = estimate_tree_variance(
X_train, y_train, min_samples_leaf=7
)

print("[min_samples_leaf = 7] MSE = {:.2f}".format(best_trimmed_mse))


print("[min_smaples_leaf = 7] Leaves = {}".format(best_trimmed_n_leaves))
print("[min_smaples_leaf = 7] Variance = {:.2f}".format(best_trimmed_variance)

> [min_samples_leaf = 7] MSE = 20.86


[min_smaples_leaf = 7] Leaves = 504
[min_smaples_leaf = 7] Variance = 9.03

While MSE was reduced by almost 9, the model’s error is still higher than the acceptance
criteria. Although our new trimmed tree is much smaller (504 leaves vs. 4286 leaves) and has

©Manning Publications Co. To comment go to liveBook


187

lower variance (9.03 vs. 15.22) compared to the un-pruned tree. It seems that we need to
explore more ideas to lower the variance more in order to meet our acceptance criteria.

Decision Trees for Classification


Decision trees can easily be applied to classification problems, they are not limited to regression problem. The
only thing that needs to be changed is the metric we’re going to minimize at each split. Scikit-learn’s
DecisionTreeClassifier supports two metrics: gini and entropy, and there are great resources out there
explaining these two metrics. The same is true for random forests, which we’re going to discuss next, and it’s
supported by scikit-learn in RandomForestClassifier module.

7.3.2 Random Forests


The idea of wisdom of the crowds, where the judgment of a collection is usually better than
the judgment of an individual, is known since the earliest times of history. Here’s the Greek
philosopher Aristotle capturing the idea and comparing its benefits to a well group-organized
dinner party in comparison to a single-host party:

“For the many, of whom each individual is but an ordinary person, when
they meet together may very likely be better than the few good, if regarded
not individually but collectively, just as a feast to which many contribute is
better than a dinner provided out of a single purse” 34.

This idea seem enticing to apply with our decision trees. The most straightforward way to
apply this to train multiple decision trees and aggregate their predictions into a single one by
averaging them. Clearly, we can’t train all of the different trees on the same training set,
because that way all the trees will be the same tree and there will be no point in aggregating
their predictions. Getting more data would solve that problem all together, but as we
established many times; that’s not a feasible solution, we need to get more creative. We can
draw inspiration from k-fold CV and say that we can train each tree in the collection on a
random sub-sample of the training set, that way each tree will be different form the other and
the aggregation will make sense. This is indeed the correct line of thinking, but we need to
pay attention to a small caveat, which relates to the type of random sampling we’re going to
use to generate the training sub-samples.
There are two ways to draw random samples from a population:

• without replacement, where each sample selected is not returned to the population
and not considered for the next sample, and
• with replacement, where each sample selected is returned back to the population

34
From Politics by Aristotle, Book III, Part XI. https://classicalwisdom.com/greek_books/politics-by-aristotle-book-iii

©Manning Publications Co. To comment go to liveBook


188

before we select the next sample, hence the next sample could be the same as the
previous one.

To further illustrate the difference between the two, lets imagine a situation where we have a
box with 10 balls in it, each ball has a number printed on it from 0 to 9, and the goal is
generate a random 10-digits number by repeatedly selecting a random ball form the box.

Figure 7-11: In sampling without replacement (top) each ball we randomly pick from the box is not returned
back to it, which prohibits the next digit in our number to have the same value as a previous digit. However, in
sampling with replacement (bottom), each randomly selected ball gets returned to the box, which allows the
next digit in our number to take any possible value form 0-9, even if it appeared before.

If we sample without replacement, then the value of the next digit in our number will always
depend on the outcomes of the previous samples; if we sampled a 9 before, we can’t possibly
get another 9 afterwards if we’re sampling without replacement. However, when we sample
with replacement, then the value of the next digit will not depend on the outcome of the
previous samples; because even if we got a 9 before, we can still get it again because we
added it back to the box.

©Manning Publications Co. To comment go to liveBook


189

From this small example we can see that random samples drawn without replacement are
dependent on each other. It follows from that trees trained on random sub-samples drawn
without replacement form the training data will also be dependent on each other, and that’s a
bad thing to have if we plan to aggregate the trees in the end. To understand why this is bad,
let’s imagine a team of three people working in a software house: John, Jane, and Janet. Jane
is the boss of John and Janet, and she’s kind of an oppressive boss who doesn’t like her
subordinates to disagree with her and tends to punish them if they do. One day this team is
asked by a client for their collective opinion about some feature he wants to add to his system
on which the team is working. Jane, the boss, says that this feature should be added, but John
and Janet think that feature is not needed and it will introduce unnecessary complexity to the
system. But because of how Jane treat her subordinates, John and Janet will have to say that
they like that feature too. This is clearly a failed application to the idea of wisdom of the
crowds.
So to have a better aggregation of decision trees, we need to train each tree on a sub-
sample drawn with replacement from the training data, and that’s exactly what the model
called random forests do. A random forest is a model where we train a collection of relatively
independent and uncorrelated decision trees by randomly sub-sampling the training set with
replacement to train each tree in the collection. To further de-correlate the trained trees from
each other, random forests even trains each tree by forcing it to search for the best split on
each node in different random sub-set of features in the training data. This way, each tree will
focus on a different part of the data using a different set of features and no single dominating
feature is going to force the trees to have similar structure after all. That’s why this model is
called a random forest, because it grows a forest of trees on a randomly sampled training data
with randomly sampled features.
We can train a random forest for our problem here using the RandomForestRegressor
found under sklearn.ensemble module. The hyperparameters of the RandomForestRegressor
includes the same hyperparameters that we saw in a DecisionTreeRegressor model, and
those are there to control the individual trees grown in the forest. However, we do not tune
much of these hyperparameters when we train a random forest; it’s better to leave the
underlying trees to grow as much as they want to be the best at the part of data they’re
learning from. This indeed makes the variance of the trees to be very high, but as we’ll see,
the aggregation of the all these high variance trees will result in a very low variance forest.
The two most important hyperparameters we consider when training a forest are:

• n_estimators, which specifies the number of trees to grow inside the forest. We don’t
usually need to tune this parameter with grid search because it doesn’t affect the
performance of the forest much. So we simply pick a large number, say 100, and go
with it.
• max_features, which is actually a hyperparameter of the trees inside the forest. This
hyperparameter controls how many features should be randomly sampled when
searching for the best pivot when we split a node. This parameter is the one that needs
tuning with grid search.

©Manning Publications Co. To comment go to liveBook


190

We can train and tune a random forest the same way we trained and tuned decision trees in
the previous section, by using pipelines and grid search with k-fold cross validation. At the
beginning of the code, we add the statement np.random.seed(42) to set an initial seed to all
the random processes, that way the resulted would be consistent across different execution
without having to fix all the random outcomes within the code.
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

forest_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('forest', RandomForestRegressor(n_estimators=100))
])

forest_selector = GridSearchCV(
forest_pipeline,
param_grid = {
'forest__max_features': np.arange(5, 12)
},
scoring='neg_mean_squared_error',
cv=5
)

forest_selector.fit(X_train, y_train)

In that snippet, we train the forest while tuning the number of max_features starting from 5
to the maximum number of features, which is 11. We can access the best value for
max_features through forest_selector.best_params_ and the best forest trained on all
data through forest_selector.best_estimator_ as before:
print(forest_selector.best_params_)
best_forest = forest_selector.best_estimator_

> {'forest__max_features': 5}

Now we can evaluate our forest against our held-out testing set and see that it exceeded the
best trimmed tree we had before:
from utils import estimate_forest_variance

best_forest_preds = best_forest.predict(X_test)
best_forest_medae = mean_squared_error(y_test, best_forest_preds)

best_forest_variance = estimate_forest_variance(
X_train, y_train, max_features=5
)

print("[Random Forest] MSE = {:.2f}".format(best_forest_medae))


print("[Random Forest] Variance = {:.2f}".format(best_forest_variance))

> [Random Forest] MSE = 12.98


[Random Forest] Variance = 1.16

©Manning Publications Co. To comment go to liveBook


191

Random Forests gave us a reduction of around 8 in the prediction error, making it around 13
which is lower than the acceptance criteria. You did it! Now the model can be integrated in the
supply pipeline of your colleague’s used car business. Even the variance of the forest is much
lower than its pruned tree counterpart. Why is that? And how variance exactly affects error on
testing data? Does that happen for decision trees only or it’s a general thing? These are
questions to be answered by a little theoretical investigation that we can freely embark on
now after we have delivered our product.

Ensemble Learning
Random forests are instances of a more general learning technique called ensemble learning, in which multiple
ML models are grouped together is someway to produce a single better model. There are multiple ways that we
can do ensemble learning with, these ways include:

• Voting: where multiple models are trained on the same training set and a the final prediction would the
most common class between all the models in case of classification, or the mean of all their predictions in
case of regression. Scikit-learn supports that type of ensemble through VotingClassifier and
VotingRegressor objects.
• Bagging: short for bootstrap aggregation, where multiple instances of the same model are trained on
bootstrap samples from the training set (which are samples taken with replacement) and their judgment is
aggregated in the same way as voting. Random forests are an example of bagging ensembles applied on
decision trees. Scikit-learn support bagging on any other model through its BaggingClassifier and
BaggingRegressor objects.
• Boosting: where multiple instances of the same model are trained in sequence, with each step’s model
focusing more on the examples that were incorrectly predicted by the previous step’s model. Scikit-learn
supports this kind of ensembles through the AdaBoostClassifier and AdaBoostRegressor
objects.
Ensemble learning methods usually perform better than any single model alone, which makes a very valuable
tool for a data scientist or an ML engineer to have. We only scratched the surface of ensemble learning here
with random forests, so I highly recommend that you put it next in your plan to learn more about ensembling.

7.4 What Controls Generalization?


We have practically seen how decision trees with high variance can result in overfitting and
poor performance on unseen data. We also saw how gradually reducing the variance of trees
(either via pruning or via aggregation in random forests) helps reducing overfitting and boosts
the performance on unseen data. However, our findings remain related to this specific
example we worked on, and while intuition supports that these findings are true in general, we
can’t say for sure that it is; intuition can be deceiving sometimes. So in this section, like its
counterparts in previous chapters, we start looking from a more theoretical perspective, and

©Manning Publications Co. To comment go to liveBook


192

hence more general perspective. Our investigation here will center around the relation
between the generalization and how the variance of the model affects it. To start this
investigation properly, we probably need to discuss what exactly generalization entails.
In chapter 5, we compared the generalization capabilities of the 1-NN model, measured by
its risk R(h1-NN), to the risk of the most optimal classifier that we could possibly have: the
Bayes classifier. In this chapter, we’re going to take a different approach in studying the risk
of decision tress R(hDT) that depends on the model’s error on training data. This new approach
will allow us to see the actual relation between a tree’s variance, it’s size (represented by
number of leafs), and it’s generalization abilities. Before we can start this new approach, we
need to define a concept that’s very related to risk, which is called the empirical risk.
Empirical risk 35 is just the training error, dressed in a fancy name; in other words, it’s the
average error (measured by the loss function) that the model makes across all the training
samples. We denote the empirical risk of a hypothesis with Remp(h), with the subscript emp
emphasizing that this is an empirical risk.

The rest of the section is a bit mathy, so here's our usual reminder that the math
shouldn't be scary. Take a deep breath, keep your paper and pencil close, and always
remember to read what the equation is trying to say.

7.4.1 Why do Machines Learn from Data?


Up to this point, we have seen 5 machine learning models that were actually able to learn
patterns and relations from the given data, and we intuitively know why these models were
able to learn anything from the data. It all boils down to what we said in chapter 3: with more
and more data unbiasedly sampled from the population, the sample will reveal the patterns
and relations existing in the population. While this intuition is true, a lot more can be
uncovered if we thought about it in more depth. It turns out that this intuition has a neat
mathematical formulation called the law of large numbers.
The law of large numbers states that as the number of trials increases, the average
outcome of a random experiment will get closer and closer to the true expected value. Take
for example on of the simplest random experiments ever, tossing a fair coin; one would
expect that we’ll get heads 50% of the times we toss a fair coin. The law of large number says
that if we tossed that for some number of times, and at each toss we counted the number of

35
The word empirical means practical. This means that the empirical risk is something we find in our practice, where only the training data are available.

©Manning Publications Co. To comment go to liveBook


193

times heads appeared so far, we’ll find that this count approaches 50% of the trials as we toss
the coin more and more. We can simulate that programmatically using numpy.random.choice
and pass it an array that has two elements ‘H’ for heads and ‘T’ for tails. This function will
simulate a random choice between ‘H’ and ‘T’ with equal probability, which is what we
expect from a fair coin.
np.random.seed(42)

heads_count = 0
averages = []
for i in trials:

sample = np.random.choice(['H', 'T'])

heads_count += 1 if sample == 'H' else 0


averages.append(heads_count / (i + 1))

plt.plot(trials, averages) # plots the average count of H at each toss


plt.axhline(0.5, linestyle='--', color='black') # plots the expected count of H

Figure 7-12: The blue solid curve represents the empirical heads count fraction at each trial of the 10K trials,
while the black dashed line represents the expected heads count fraction. The plot shows the law of large
numbers in action as the number of trials increases, the empirical heads count fraction approaches the
expected fraction.

In our little experiment above, we simulate tossing a fair coin for 10K times and at each trial
we count the fraction of heads that we got so far, we’ll call that the empirical fraction. We then
plot these fractions through time and compare it to the expected fraction, which is 50% (or
0.5 the times). This comparison, as shown in figure 7-12, demonstrates that the empirical
fraction approaches the expected fraction as the number of trials increase. Or in other words,
as the number of trials increase, it becomes less likely that the difference between the
empirical fraction and the true fraction to be large. This last sentence is exactly what the
mathematical formula of the law of large number says.

©Manning Publications Co. To comment go to liveBook


194

In a more general setting, the law of large number says that as the number of samples m
increases, it’s less likely that the difference between the empirical average (or the empirical
mean) and the expected value of a random variable to be large. To write that last sentence
mathematically for some random variable , we define an arbitrary small positive value that
we denote with the symbol epsilon and then we talk about the probability of the absolute
difference being greater than or equal that small value . So the law of
large number is finally written as:

This is simply read like: as the number of samples m becomes larger and approaches infinity,
the gap between the empirical mean and the expected value being greater than some small
value will probably be zero.
By noticing that the empirical risk Remp(h) that we introduced earlier is actually an empirical
average of losses across m samples, and the risk R(h) is the expected value of that loss, we
can apply the law of large numbers to our risk values and say that for some hypothesis h’:

This here is a mathematical justification for why machines can learn from data, which agrees
with the intuition we had all along; the more data we have to train our model on, the gap
between the training error and the test error will become small, we’ll call that gap the
generalization gap from now on. So if we managed to get our training error to a small
value, our test error will be small provided that we have sufficient data. While this is a
concrete and rigorous justification for why we can learn models from data, it still lacks an
several points, like how many data is considered sufficient, and how does the model’s size and
variance plays a role in that. This motivates the need to take a closer look at what the value of
the probability above looks like.

7.4.2 Generalization Bounds


Imagine that you’re lost, you ask for directions, and a nice man tells you “as long as you’re
walking, you will reach your destination”. While it’s a warm and reassuring thing to say, it’s
not really helpful; you still don’t know which way you need to go, you need someone to tell
you which way to go or what ride to take, something like that. In our situation here, the law of
large numbers is somehow like that nice man; it’s telling us that as long as we have large
amounts of data, we’re eventually going to close the gap between test error and the training
error. We still need something to tell us how much data we need to achieve that, and this is
what concentration inequalities try to do.

©Manning Publications Co. To comment go to liveBook


195

Concentration inequalities are a set of inequalities that bounds the probability of how
much a random variable differs from some value. One of these inequalities that interests us
the most is Hoeffding’s inequality which the bounds the probability of the gap between the
empirical mean and the expected value being larger than a specified small value. For a
random variable X that takes a value between a and b, Hoeffding’s inequality says that:

Which basically says that probability of having a large gap between the empirical mean and
the expected value decays exponentially as m, the size of the sample, grows. It’s easy to
notice that we can apply that inequality to the gap between the empirical risk and the risk for
some hypothesis h’. We can assume, without loss of generality, that our loss takes a value
2
between 0 and 1 so that (b-a) is just 1 36. That way our math would become simpler. When
apply that, we get:

So for a specific hypothesis h’, the probability of having a big generalization gets exponentially
smaller as the size of the training set, m, increases. However, that’s statement is only true for
a single specific hypothesis, which is h’. We can many different hypotheses form a machine
learning algorithm by different training samples, so there’s no guarantee that we’re going to
get that specific h’ hypothesis, whatever that is. If we assumed that any hypothesis that we
can get from our learning algorithms comes from a set of possible hypothesis denoted by H
that we call the hypothesis space, is there a way to define a similar bound for any possible
hypothesis h from that space?
We can approach that question by thinking about what the probability of having a large
generalization gap for any hypothesis h entails. Any hypothesis h from the space H can have a
large generalization gap if the first hypothesis in the space h1 does, or the if the second does,
or if third does, or if h4 does, and so on. That’s true because that arbitrary hypothesis h is
eventually going to be one of the possible hypotheses in the space H, so it can have a large
generalization gap if any of the spaces hypotheses does. By denoting the absolute difference
between the risk and its empirical version for a hypothesis h as G(h), we can say that for any
hypothesis h:

36
When an assumption is said to be made “without loss of generality”, it mean that this assumption does not make the result we arrive at any less general
from what we’d have got without making that assumption. That’s true for assumption here because any value that our loss may take can be mapped to a
unique value between 0 and 1 (in theory at least).

©Manning Publications Co. To comment go to liveBook


196

Where |H| is the size of the hypothesis space, or the number of possible hypotheses in it.
While there’s no way for us now to calculate the value of the probability on the right hand side
(RHS), we can bound it’s value using another famous inequality in probability theory which
states that:

Which implies that for any hypothesis h in the hypothesis space H:

We know the value of the probability under the summation in the right hand side; because it’s
2
bounded by the same exponential function 2exp(-2mє ) we got from Hoeffding's inequality. By
Using that Hoeffding's bound and rewriting G(h) back to its original form, we can say that for
any h in H:

The summation was replaced by a multiplication by |H| because the bounded value of P(G(hi)
>є) is the same of any hypothesis hi, so this summation actually sums the same value |H|
number of times, which is equivalent to multiplying the bound by |H|.
Now let’s denote the probability on the left hand side with the symbol delta δ. In that way
we can achieve two things:

1. Take the generalization gap out of the probability operation and say that with a
probability δ we have |R(h) - Remp(h)| > є. Equivalently, we can say that with a

©Manning Publications Co. To comment go to liveBook


197

probability 1 - δ we have |R(h) - Remp(h)| ≤ є 37, which can be rewritten as:

2. Express the value of є in terms of the training size m and the hypothesis space size |H|.
This is done by rearranging the terms of the inequality δ ≤ 2|H|exp(-2mє2), which results
in:

Where ln is the natural logarithm function 38. If you remember your high school algebra, you
2
should able to arrive at the above inequality starting from the inequality δ ≤ 2|H|exp(-2mє ).
By substituting є in the inequality from the first point and using its bound from the second
point instead, we arrive at one of the most fundamental rules in the theory of machine
learning:

This type of inequalities is called a generalization bound. It says that for any hypothesis h
we can get from a learning algorithm, the test error is bounded by the sum of the training
error and a term that depends on the training data size m and the size of the hypotheses
space |H| from which h is chosen, which is a representative of how rich or complex our model
is. From this equation we can get two things:

• A confirmation for our intuitive understanding that more training data means smaller
testing error.
• A new insight about the relation of the hypotheses space size to the testing error. It
appears that as the hypotheses space gets richer and more complex by having many
possible hypotheses to choose from, the testing error gets higher. So it seems to be
preferred to learn a hypothesis from a small hypotheses space. We’ll investigate this
relation more by looking at a concrete generalization bound for decision trees and
compare it to our empirical observations from earlier.

37
This is true because probabilities must sum to one. If probability of the gap being greater than є is 0.4, then the probability of the gap being smaller than
or equal to є is 1 - 0.4 = 0.6
38
The natural logarithm of a number a is the power to which the natural number e is raised in order to get a. In other words, if eb=a, then ln a = b.

©Manning Publications Co. To comment go to liveBook


198

7.4.3 The Bias-Variance Trade-off


To transform the generalization bound above to one that is specific to decision trees, the only
thing we need to do is determine the size of the decision trees hypotheses space |HDT|. There
are many ways to count the possible number of decision trees in HDT; one way to count them
would be based on the depth of the tree, another way would be based on the number of leaf
nodes in the tree. In order to give a mathematical justification of the pruning technique we
used back in section 7.3.1, which aimed to reduce the number of leaf nodes, we’re going to
use the leaf nodes counting approach.
If we denoted the number of features in the data with n, then the number of possible
decision trees that have k leaf nodes is bounded by the inequality below:

We’re going to skip the details of how this bound can be obtained so that we keep our math
strictly related to exploring the relation between generalization and the number of leafs in a
decision tree. However, obtaining this bound is a neat exercise that involves some interesting
math. If you enjoy math proofs, I recommend that you give proving this bound a shot. Now,
we plug in this bound on |HDT| to get the generalization bound for decision trees 39:

The relation between the number of leaves in the decision tree and its generalization abilities
becomes concrete with this equation; the less the value of k gets, the less the value of the
test error becomes. That agrees with what we have practically seen when we pruned the tree
by setting max_samples_leaf=7, which effectively reduced the number of leaf nodes in the
trained tree from 4300 to 504 and brought down the MSE from 29.52 to 20.68. From outside,
the above equation suggests that as we can keep reducing the number of leaf nodes and
expect the testing error to drop as we go, but this contradicts the reality. If reducing the leaf
nodes more would generally result in a reduction in test error, our GridSearchCV would have
chosen a higher value than 7 for the max_samples_leaf. Such expectation would only be true
if we assumed that the training error doesn’t get affected by reducing the number of leaves
and it would stay low no matter how small the leaves count becomes. This assumption,
however, is very far from the truth.
To see how this assumption is not true, we can compare the training error and the testing
error of a decision tree across different values for the leaves count. In the chapter’s utils

39
To arrive at the shown generalization bound of decision trees, we make use of two identities of logarithms: the first states that ln(ab) = a + b and the
second says that ln(ab) = b×ln(a)

©Manning Publications Co. To comment go to liveBook


199

module, we provided a function called plot_train_test_variance_curves that does this


comparison for us. This function does the following:

1. Train different decision trees with different leaf nodes count from small to large and
records the tree’s MSE on the training data and the testing data. Moreover, it records
an estimate variance for a tree with such size.
2. After that, the function draws two plots:
a) One plot showing the testing error vs. the training error over different leaves count.
b) Another plot showing the generalization gap (|testing error - training error|) vs the
variance over different leaves count. This should help us understand the relation
between the generalization bound above and the variance of the model.

Using this function is pretty simple, we just need to send it the training and testing data, un-
imputed. The internals of this function is not complicated as well, it’s all stuff we already
know, we just encapsulated it in a utils function for brevity, but you should be able to
understand it if you looked at the code.
from utils import draw_tree_learning_curves

draw_tree_learning_curves(X_train, y_train, X_test, y_test)

Figure 7-13: (Left) Training error curve (green, solid) vs the testing error curve (red, dashed) across different
values of leaf nodes. We can notice from the plot that there is a tradeoff between the number of leaves and the
model’s training and testing errors. Too much leaves brings the training error to a minimum but rapidly
increases the testing error, and too few leaf nodes would increases the training error indicating that the model
is not learning anything useful which results in high test error as well. (Right) Variance curve (black, solid) vs the
generalization gap curve (blue, dashed) across different values of leaf nodes. This shows that a model’s
complexity and its variance are correlated.

The result of our little method is shown in figure 7-13. In the plot on the right we see that the
the training and testing errors are behaving as they’re in a tradeoff situation with respect to
the model’s complexity, represented by the number of leaves. When we increase the number
of leaf nodes, this drives the training error to its minimum, but the error on unseen data gets

©Manning Publications Co. To comment go to liveBook


200

higher in return. In such case, we say that the model is overfitting, or suffering from high
variance. On the other hand, when the number of leaves is reduced, the training error
becomes larger indicating that the model is not learning anything, which results in a high
testing error as well. This expected because as the number of leaves gets lower, the tree
becomes too simple to be able to capture the relation between the labels and the feature.
Think of a decision tree that has only two leaves, it will only depend on one feature to split the
samples space into two regions. It’s highly unlikely that one feature would be able to explain
the relation between the data and the labels. That’s why trees with small number of leaves
have higher training errors, and in such cases we say that the model is underfitting or
suffering from high-bias 40.
That trade-off between overfitting and underfitting is a universal problem throughout the
whole realm of machine learning, and it goes by the name of bias-variance tradeoff. It’s
usually the case with most machine learning models that when the model becomes more
complex, the model overfits the training data and performs badly on new unseen samples.
When the model’s complexity is reduced to avoid that situation, the model underfits the
training data resulting in bad performance on unseen samples as well. The key to solve this
dilemma lies in the shape of the training vs. testing errors plot, and we’ve been doing it for
some time now.
The plot on right of figure 7-13 is a specific instance of a more general pattern that we see
with most machine learning models, which is shown in figure 7-14. In that pattern, the
training error follows a decreasing curve as the model gets more complex while the testing
error follows a U-shaped-like curve. The right amount of complexity, or its sweet spot that
balances both the bias and the variance, is at the lowest point of testing error curve, right
before that U-shaped curve starts climbing up. This sweet spot is what we try to find when we
test a model with different hyperparameters against a validation set. A lot of the
hyperparameters we saw control the complexity of a model, like min_samples_leaf with
decision trees and n_neighbors neighbors in k-NN. When we tune these hyperparameters, we
try to find the value that would achieve the lowest testing error, estimated through the
validation set.

40
There is a rigorous mathematical definition of what bias is, and we’ll get to it later. For now, let’s understand the word with its common meaning, which
implies some kind of inclination or prejudice towards something. In our two-leaves tree, that tree is suffering from high bias towards that single feature
selected for the split.

©Manning Publications Co. To comment go to liveBook


201

Figure 7-14: Most of machine learning models suffer from the bias-variance tradeoff where their testing and
training errors curves follow a similar pattern. The training error follows a declining curve as the model’s
complexity increases while the testing error follows a U-shaped curve starting high, going low and then climbs
up high again after certain complexity threshold. That certain threshold is the optimal complexity that the model
needs to have the best performance.

With the left plot of figure 7-13 introducing us to the bias-variance tradeoff, we haven’t talked
much about the right half of the same figure, the one with the variance curve and the
generalization gap curve. This right half is going to kick in now and help us understand why
random forests are so efficient.

7.4.4 Why do Random Forests Work so Well?


Deriving a generalization bound for random forests like the one we have for decision trees is
possible, but the math involved in it is advanced and cumbersome, so we won’t do that
exercise here. Instead, we’ll take a different route to see how aggregating multiple trees with
poor performance and high testing errors result in a model with superior performance and a
much lower testing error like we saw earlier. That route starts with the right half of figure 7-
13.
From the right half of figure 7-13, we can see that the variance of decision trees and the
their generalization gap are moving together. When the variance is high, the gap is also high,
and vice versa. From the generalization bound, it’s easy to see that the generalization gap is a
function of the model’s complexity, or the hypotheses space size |H|. So it follows that the
model’s variance and the and the |H| are also moving together. In other words, we can say
that the model’s variance and its hypotheses space size |H| are positively correlated, which
means that when |H| is high, we should expect that the model variance is also high, and when
the model variance is low, |H| should be low as well. Now let’s look at how the variance of
random forests look like.

©Manning Publications Co. To comment go to liveBook


202

A random forest is the average of multiple, fully grown decision trees. So if we assumed
that we have q fully grown decision trees hDT1, hDT2, ..., hDTq, then a random forest hRF built by
averaging these trees can be represented as:

Form that representation, we calculate the variance of random forests as:

2
We dropped the (x) for brevity and the applied the fact that Var(aX) = a Var(X) where a is
constant, not a random variable. Proving that fact is left as an exercise, but it can easily be
proven using the definition of variance from chapter. We need now to figure out how the
variance of a summation can be carried out. Unlike expectations, variance of a summation of
random variables is not simply the sum of their variances, it’s more elaborate. The variance of
a sum of random variables, like our fully grown trees, is given by:

The Cov in the equation above stands for covariance, which measures how much does the
two random variables variate together. If two random variables are dependant, then they
variate together, and hence their covariance is not zero. The less the variables become
dependant, the smaller their covariance becomes, until it reaches zero when the two variables
are completely independent. So the equation above says that the variance of a summation of
random variables depends on both the variances of the individual variables and the covariance
between each pair of them. By noticing that the variance of each tree hDTi is the same (as they
are all pulled from the same distribution), then we can write the variance or random forests
as:

©Manning Publications Co. To comment go to liveBook


203

Now comes the randomness in random forests. As you recall from our discussion earlier, the
trees in a forest are trained on a random sub-sample of the training data sampled with
replacement. Moreover, each tree is trained by considering a different random sub-set of the
available features in each split. These two training methods are used to minimize the
dependence between the trees in the forest as much as possible. By minimizing the
dependence between the trees, we end up minimizing the second term in equation above.
From that we can consider that the variance of random forests is approximately the variance
of decision trees divided by the number of trees in the forest, plus some small value, or
simply:

The symbol ≈ means is read as “approximately equal”.


We can actually verify the last equation by running the estimate_tree_variance and
estimate_forest_variance on our data and see that the estimated forest variance only 1.08
above the variance of trees divided by 100 (the number of tress in the forest).
np.random.seed(42)

tree_variance = estimate_tree_variance(X_train, y_train, max_features=5)


forest_variance = estimate_forest_variance(X_train, y_train, max_features=5)

diff = forest_variance - (tree_var / 100)


print("Var(HRF) - (Var(HDT) / 100) = {:.2f}".format(diff))

> Var(HRF) - (Var(HDT) / 100) = 1.08

Through the wisdom of uncorrelated crowds, random forests are able to:

1. Maintain the power of its fully grown trees with high number of leaves and achieve a
small training error, hence minimizing Remp(hRF)
2. Achieve a low variance, indicating that |HRF| is also low (according to our earlier
observation of the correlation between the two), which minimizes the complexity term
in its generalization bound.

These two effects add up eventually to reduce the risk R(hRF) which makes the random forests
better with unseen data samples, as we practically saw earlier.

©Manning Publications Co. To comment go to liveBook


204

The Curse of Dimensionality: Decision Trees vs. Random Forests


A quick look at the generalization bound of decision trees reveals that the number of features is also involved in
the complexity term through the natural logarithm ln n. Such factor is usually omitted with small number of
features like the case we had in this chapter (ln 11 ≈ 2). But as the number of features grow, it can no longer be
omitted. To understand how this works, let’s focus on the first term under the square root in the generalization
bound, which is:

In the above expression, you can think of the (ln n)/2 factor as a multiplier to the actual number of leafs (k-1) in
the tree, and the result of the multiplication is the effective number of leafs in the tree. For example, if the
actual number of leafs is 5 and ln n = 4, then the effective number of leafs is 10, which means that this tree
behaves as it has 10 leafs, not 5. Suppose that we have a training data of size 4000 samples, if we have 11
features and we fit a tree with 900 leaf nodes, then the effective number of leafs is 900. So ratio of effective
leaves to training samples is low, which is favorable. Now imagine that the same data set has 20000 feature.
As ln 20000 ≈ 10, our effective number of leafs will be 4500. The ratio of effective leaves to training sample is
greater than one in such case, which is bad; it’s as if we’re trying to fit 4000 samples into a tree with 4500
leaves!

This little exercise we did above shows that decision trees suffer from the curse of dimensionality if the number
of features is large and the number of training samples is not proportionate. Random forests on the other hand,
do not suffer as much from the curse even if the training samples size is not proportionate to the number of
features. The reason behind that is the fact that random forests train its trees on subsets random subsets of the
features, not all of them. In the example with 20000 features, we can train a random forest with
max_features=50, which would make the effective number of leaves in each tree around 1670 node, and set
n_estimators=4000 to ensure good coverage of all the features across the 4000 trees. In that setting, we’re
still going to get some good results even though the number of training samples is not proportionate to the
number of features.

©Manning Publications Co. To comment go to liveBook


205

A
Appendix

In this short appendix, we're going to install the Anaconda distribution (which will serve as the
whole environment we're going to work with through the book) and the basics of working with
Jupyter notebooks: the interactive interface that we're going to run our python code through.
Following the instructions in this appendix (as well as any additional environment instructions
in the chapters) will guarantee you a smooth working flow without any missing dependencies.
However, this is not the only way to run the code we write in this book! If you know enough
about python, pip, and virtual environments you're be able to setup you're own custom
environment by installing the required packages one by one when needed. This is of course
more efficient space-wise, but the Anaconda way provides an easy and plug-and-play like start
and that's why we're going with Anaconda in our work.

A.1 Installing the Anaconda Distribution


The Anaconda distribution is considered to be the easiest way one could start with doing
machine learning and data analysis with python! With more than 6 million users, and a
splendid open-source community behind it; Anaconda comes packaged with hundreds of
popular python packages used for doing data analysis and machine learning in python.
Moreover, it provides the conda package manager for easily installing any other required
package that is not included by default. All this works in a consistent way across the different
platforms: Linux-based environment (like Ubuntu), macOs, and Microsoft's Windows. This
what makes Anaconda the easiest way to develop and run your machine learning and data
science application and that's the reason why we're choosing it as our environment throughout
the book.
To install Anaconda, you can simply follow the instructions provided by their official
documentations, depending on your operating system:

• For Windows, visit the following link:

©Manning Publications Co. To comment go to liveBook


206

https://docs.anaconda.com/anaconda/install/windows

• For macOS, visit the following link:

https://docs.anaconda.com/anaconda/install/mac-os

• For Linux systems (like Ubuntu), visit the following link:

https://docs.anaconda.com/anaconda/install/linux

A.2 Working with Jupyter Notebooks


Now that we've installed the Anaconda distribution that contains everything that we'll need in
our journey through the book, we'll turn now to the interface from where we're going we're
going to use all and everything will happen, the Jupyter notebook.
Jupyter notebooks are basically web applications that we're going to use to provide an
interactive python environment where we can have code, visualizations and narrative text and
math all in the same place 41. The best way to get a feel of how awesome is that is to get our
hands dirty and see it in action, and to do that we'll need to download the book's code
repository from here. After downloading the repository, we move on to start the Jupyter
notebook application, this can be done mainly with two ways:

1. On Windows or macOS: Opening the Anaconda Navigator, which is a graphical


navigation interface shipped with the Anaconda distribution to navigate through the
modules and programs it provides. You can locate the navigator's executable using
whichever way you usually use on your OS to find applications (like Windows' Start
Menu, or macOS's Lunchpad). This is how the navigator should look after you start it.

41
The Jupyter Notebook projects support multiple programming languages other than python, but we're only interested in python here.

©Manning Publications Co. To comment go to liveBook


207

Figure A.1

In the navigator, right in the first row we can see a cell with Jupyter Notebook on it,
the second cell to the right, we click launch to start the Jupyter application and we wait
for little bit till a page in our default browser opens with Jupyter's files tree opened at
the your Home directory. From that we can navigate to the location where the code
repository is downloaded.

©Manning Publications Co. To comment go to liveBook


208

Figure A.2

2. On macOS and Linux-based systems (like Ubuntu): Jupyter notebook can be


started from the terminal. We first navigate to where the code repository is downloaded
and extracted. If you chose to disable the auto-activation of base environment (you can
tell by the absence of a (base) at the beginning of your terminal prompt), you’ll need
to activate the Anaconda environment first using the following command 42:

$ conda activate

This command (if you don't already know it) temporarily brings all of anaconda's
binaries and executables to the front (kind like what pre-pending to PATH do but only
during the time it's activated). If you already have (base) at the beginning fo your
terminal prompt, then you’re all set to run Jupyter Notebooks 43. We can now simply
run the jupyter-notebook command to launch the application.

(base)$ jupyter-notebook

42
Provided that you initialized your Anaconda installation as instructed in the installation process
43
If you’d like to deactivate the Anaconda environment, you can simply run conda deactivate

©Manning Publications Co. To comment go to liveBook


209

We wait for a few moments and a page in our default web browser will open with
Jupyter's files tree opened at the directory containing the code's repository.

Figure A.3

A.2.1 Exploring a Jupyter Notebook


Now that we have the Jupyter files tree open at the directory containing the code repository,
we can see a file called Notebooks Kiclstarter.ipynb. This is nothing but a notebook file, and
the extension ipynb stands for IPython Notebook, where IPython is the name of an earlier
project that established the whole interactive python notebooks framework. This notebook was
designed to get us quickly familiar with the whole ecosystem of Jupyter notebooks in order to
start using them in our work through the book.
To launch the notebook, we simply click on it to have it opened in a new tab in the
browser. If we switched back to the tab containing the files tree, we'll see that the icon of
our notebook turned green, in a sign that it's now running. Now Let's go back and focus on the
notebook itself.

©Manning Publications Co. To comment go to liveBook


210

Figure A.4

At the top of the notebook comes the header which contains the project logo to the left of the
notebooks name (which can be edited by clicking on it). The text to the write of the name tells
when the notebook was last saved. In the far right comes the python logo indicating that this
is a python notebook, and next to it a logout button which closes your session with the Jupyter
application. This logout button is useful when the Jupyter application is running on a remote
server, so we're not going to do much with it here.
Under the header comes the menu bar and the tool bar. The menus in the menu bar
contains all the operations that we need to perform while interacting with the notebook, and
the tool bar holds some shortcuts for the most important and frequently used operations
among them. We're not going to delve into the details of the menus and the tool bar; once
you understand how a notebook works and how it's organized, then a quick glance over these
tools will be sufficient to get the hang of what are these stuff doing. Instead, we're going to
delve directly into the content of the notebook itself.
The content of a notebook is organized into cells. When the notebook starts, the first cell
gets highlighted with a box around it as we can see in the previous figure. Whatever content
that goes into a notebook must be contained within a cell, if we scrolled around the notebook
clicking on each distinct content we'll see that a highlighting box surrounds it; this box is the
cell containing it. Usually we work with two types of cells: one that is called markdown cells

©Manning Publications Co. To comment go to liveBook


211

which can hold rich text, and other type is code cells which (obviously) will hold our python
code.

A.2.2 Markdown Cells


The first cell in our kick starter notebook is a markdown cell. Markdown is a markup language
created to write rich text easily with simple plain text syntax. The Markdown markup is
designed to be converted and rendered as HTML, so it gives the ability to write very rich texts
that contain fonts styles, lists, tables, images and hyperlink with a very easy to write (and also
easy to read) syntax.
We can take a look at the Markdown syntax used to write the content of the first cell by
double clicking on it to view the raw markup.

Figure A.5

We can see that for example the heading levels are simply expressed with the # symbol, if
there's only one then it's a level 1 heading, and if there's two of them then it's a level 2 one,
and so on. We can also see that making a text italic is only a matter of enclosing the text
between two underscores _. The Markdown syntax is very simple to grasp and you can find in
the notebook a link to a small tutorial to get you started with it.

©Manning Publications Co. To comment go to liveBook


212

In addition to regular markdown syntax, Jupyter notebooks allows you to write


mathematical expressions inside the markdown cells using the LaTeX syntax, which is a typing
system specifically designed for technical and scientific writing.
To get back to the rendered view of the markdown markup, we can click the play button at
the toolbar or simply press Shift+Enter. They Both perform a Run cell operation, which
renders markdown cells to their natural view, and runs code cells as we're going to see next.

A.2.3 Code Cells


The second type of cells we use in notebooks is code cells. As the naming reveals, these are
the cells where your python code goes. They work very simply: you write your code in them ,
it gets executed and any output to stdout or stderr is shown under the cell! You can directly
see that by running the code in the first two code cells in our kick starter notebook
(remember, by pressing Shift+Enter).

Figure A.6

And whatever code you write, it'll be accessible from later cells as the whole execution is being
done in the same process and in the same memory space, and this can be seen in the third
cell in the notebook. Moreover, code cells provide auto-complete features that can be

©Manning Publications Co. To comment go to liveBook


213

activated by pressing tab while the code is being written, and it provides that feature for both
python's and package's directives and for the code you wrote yourself. This feature allows for
more rapid coding and prototyping within the interactive framework.
Code is not the only usage of code cells though, they can also run magics! Magics, or
magic commands, are a set of commands specified by Jupyter (to be more accurate, they
were defined in its predecessor IPython), these commands can be consider aliases for other
tasks and configurations that you usually need to go through several steps to complete, the
magics allow you to do them on the fly!
There are two types of magics provided by the notebook, a cell-oriented magic which
begins by %% and a line-oriented one which has a single %. Line-oriented magics apply only
to the line they are at, while cell-oriented ones apply to the whole cell. For example, the
%%bash magic used in kick starter notebook applies to the whole cell, making all the content of
the cell be treated as bash script. On the other hand, the %timeit magic which calculates the
execution time of a code via python's builtin timeit module, this is a line-oriented magic,
meaning that only code on the same line will have its time calculated 44.
There are a bunch more magics that allow us to do some really cool stuff on the fly, and
you'll find a reference linked in the kickstarter notebook listing all the magics that IPython
(and hence Jupyter) provide. One of these magics through which we're going to see in almost
(if not all) the notebooks in the code repository is the %matplotlib inline magic. To
understand what this magic does, we first need to know that matplotlib is the plotting library
in python: any visualizations or plots we're going to see throughout the book will defiantly be
made with matplotlib either directly or indirectly (another library calling matplotlib
internally). Within a notebook, if a code cells invokes matplotlib to create a plot or a
visualization, these plots are not shown by default. Allowing for these plots to show is the job
of %matplotlib inline, as it simply configures the notebook to show the plots inline within it!

44
There exists a cell-oriented version of it as well, %%timeit

©Manning Publications Co. To comment go to liveBook


214

Figure A.7

This union of rich text, code, and visualization, infused with the interactive nature, all this
makes Jupyter notebooks a great tool cleanly write down and communicate reproducible code,
and that's the mean reason we chose Jupyter notebooks to be our working framework
throughout the book.

A.2.4 How does all this work?


At this point you might be curious about how all this works. If you're not, you should know
that knowing how Jupyter framework works on a high-level makes it a lot easier to work with
it, and this might encourage you to read this small section in which we take a high level view
on how things get done within the Jupyter framework.

©Manning Publications Co. To comment go to liveBook


215

Figure A.8 Jupyter's high-level architecture (taken from Jupyter's documentations)

The figure above shows the high-level architecture of the Jupyter framework, we've already
seen the browser part, which is the part that we have been interacting with all along. But the
browser part is nothing but a front-end, behind that lies the two components where all the
magic cooked:

• The first is the notebook server, which is a web server responsible for taking the cells
from the front end and returning their output back to them. Also, when we save the
notebook, the server receives the whole notebook from the front end (which is stored
in just a JSON object) and saves it to disk and update the notebook file with the new
contents. When the notebook is launched again, the server reads it from disk and
sends the its JSON content to the front-end to display.
• The other components is the kernel, which is the process running the python
interpreter used to make everything work! The server acts as a mediator between the
front-end and the kernel, it takes the code from the front-end and sends it to the that
kernel for execution, it then takes the result from the kernel and pass it back to the
front end to display.

Now that you know the high-level architecture of the whole application, you can now skim
through the menus and tools in the notebook and easily understand what is going and which
does what. You can use the User Interface Tool under the help menu to aid you in this if you
want.
Now you're up to speed and ready to continue where you left off and delve right into the
meaty parts of the book!

©Manning Publications Co. To comment go to liveBook

You might also like