Professional Documents
Culture Documents
Preparing Data For Machine Learning - Pluralsight PDF
Preparing Data For Machine Learning - Pluralsight PDF
Start Course
Course Overview
Course Overview
Hi, My name is Janani Ravi and welcome to this course on Preparing Data for Machine
Learning. A little about myself, I have a Master's degree in Electrical Engineering from
Stanford and have worked at companies such as Microsoft, Google, and Flip Card. At
Google, I was one of the first engineers working on drill time collaborative editing in
Google Docs and I hold four patterns for its underlying technology's. I currently work
on my own startup Loony Corn, a studio for high quality video content. As machine
how to prepare the data going into the model in a manner appropriate to the problem
we're trying to solve. In this course, you will gain the ability to explore, clean and
structure your data in base that get the best out of your machine learning model. First,
you will learn by data cleaning and data preparation are so important and how missing
data outlayers and other data related problems can be solved. Next, you will discover
how models that read too much into data suffer from a problem called over fitting, in
which models perform well under test conditions but struggle in live deployments. You
will also understand how models that are trained with insufficient or unrepresentative
data suffer from a different set of problems and how these problems can be mitigated.
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 1/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
Finally, you will round out your knowledge by applying different methods for feature
selection, dealing with missing data using imputation and building your models using the
most relevant features. When you're finished with this course, you will have the skills and
knowledge to identify the right data procedures for data cleaning and data
data to build and train machine learning models. As you make the transition from a
student of machine learning to applying machine learning models in practice, you'll need
to work with real-world data. Real-world data has many problems associated with it.
We'll discuss those problems and see the need for data preparation in machine
learning. It's often the case that you want to build a model for prediction, but you just
don't have enough data to train your model. Now this is a hard problem to solve.
Often, it involves just finding new data sources, but we'll discuss a few other mitigating
techniques as well. Well then move onto discussing in detail the problem of too much
data when your data is excessive or overly complex. This is often referred to as the
curse of dimensionality and involves the use of feature selection and engineering
techniques. We'll then discuss non-representative data, missing data and outliers, and
techniques to work with each of these. The data that you're working with might be
biased, not representative of the real world. We'll see how you can use oversampling
and undersampling to get unbiased data. We'll also discuss what it means to overfit and
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 2/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
this course are written in Python so you need to be comfortable with basic Python
programming. We'll be running Python 3 using Jupyter notebooks. This course assumes
that you have a basic understanding of machine learning. You know the machine
learning workflow, you know the steps involved. And this course also assumes that
you've built and trained simple machine learning models typically using the scikit-learn
library. If you feel that you lack some of the prereqs, here are some other courses on
Pluralsight that you can watch first. If you want to get up to speed with Python, Python
Fundamentals is a great course. If you want to get started with machine learning,
Understanding Machine Learning and Building your First Scikit-learn Solution are both
courses you should watch before this one. Here is a broad outline of all of the topics
that we'll cover in this course. We'll start off by understanding the need for data
preparation and problems that exist in data in the real world. We'll then move onto
hands-on demos after we've understood the basic concepts. We'll perform data
cleaning and transformations on a real-world dataset. We'll then work with continuous
and categorical data that is numeric data and discrete data, and we'll study and apply
some of the data transformation techniques that you'll use with such data before you
feed this data into machine learning models. We'll then move onto see how you can
extract significant and relevant features from your data to train your ML models. We'll
understand the different feature selection techniques that you could use and we'll then
time and it's extremely important. In this clip, we'll talk about the importance of feeding
in good data to your ML models. So what is machine learning? These are algorithms
which have the ability to work with a huge maze of data and find patterns that exist in
that data. Machine learning models also formulate rules based on these patterns and
then use these rules in order to make intelligent decisions on instances that it hasn't
seen before. Machine learning models can be divided into four broad categories or
types based on what they're used for. The list that we are going to discuss here is not
an exhaustive list. Machine learning is a field that constantly evolving with new
techniques. The first is classification. Is an email spam or ham? Are whales fish or
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 3/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
Regression analysis involves the prediction of continuous values, such as what is the
price of this course, what is the price of a home given its location and other
characteristics. Clustering techniques involve finding logical groupings that exist in your
data, groupings that can then be used to train other models. And finally, you might
have a lot of data which you can use to train your machine learning models, however,
not all of that data might be relevant. Dimensionality reduction is what we'll use to
extract only the most significant features to train your model. Let's talk about machine
learning in terms of classification because this is intuitive to understand. Let's say you're
trying to build a classification model to determine whether whales are fish or mammals.
Now whales could be mammals because they are members of the infraorder Cetacea.
On the other hand, whales live with fish in water, swim like fish, and move with fish. They
could reasonably be fish as well. You'll then try and build an ML-based classifier that is
trained on a huge corpus of data where it'll learn to find patterns that are relevant to
fish and mammals. You'll then feed in this ML-based classifier some information about a
whale. You'll say the whale breathes like a mammal and gives birth like a mammal. This
input to your machine learning model is the feature vector and this feature vector is
what your classification model will use to make predictions, and the output of the
classifier is the label. Now let's say you were to feed in something different to this
machine learning model. You are to tell it the whale moves like a fish and looks like a fish.
Now clearly this information is the wrong information for your model to make a
classification. It's likely to say that the whale is a fish. This input feature vector that
you've fed in is not the right set of features for a whale to make the right classification.
The input features that you've fed in about the whale misled your machine learning
model such that its prediction was fish. Here, the predicted label is not equal to the
actual label. We know that whales are mammals. What does this little example tell us?
Garbage in, garbage out. If the data fed into a machine learning model is of a poor
quality, the model itself will be of poor quality and this is data that you'll use to train
your model and data that you'll use for prediction as well. There is absolutely no way to
build a good machine learning model no matter what algorithm you'll use if your data is
not set up correctly. Now there are a huge number of problems that you'll encounter
when you work with data in the real world. You may just not have enough data to work
with insufficient data or you may have too much data, not all of which is relevant. You
may be training your model on non-representative data and this is actually what we did
in our example where we fed in the wrong features of the whale to our machine
learning model. Also, data in the real world is raw data, it may not always be clean. You
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 4/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
might find that important fields are missing, important records are missing. You might
have duplicates in your data that you have to deal with. And finally, it's possible that
your data is riddled with errors, they may be in the form of outliers or missing data. So
clearly, data is important and there are problems with data. How do we tackle these?
Insufficient Data
We know that we have problems with data in the real world. Let's explore a few
techniques to tackle the problem of insufficient data. Now models that are trained with
insufficient data, you don't have that many samples from which your model can learn,
perform poorly in prediction. If you have just a few records for your ML model, there
are two ways this could play out. It could leave to overfitting where your model reads
too much into the little data that you have and it simply memorizes the patterns that
exist in the data. This is overfitting and it's also quite possible, and this might be
surprising, that with little data your model might underfit on the data. This is where your
model can end up being overly simplistic, which means it hasn't really understood the
patterns that exist in your data. Once you go from being a student of machine learning
to working in the real world, you'll find that this problem of insufficient data is a
common struggle across projects in the real world. You might find that relevant data is
just not available, or even if it is, the actual process of collecting the data is time
consuming and extremely difficult. And really, if you're struggling with the problem of
insufficient data to train your models, there is no magic bullet, there is no great solution
to deal with insufficient data, you simply need to find more data sources, wait for
longer until you have the relevant data collected. Now there are some things that you
can do to work around this problem of insufficient data, but the techniques that we'll
discuss in just a bit are not widely applicable across all use cases. So what do you do if
you don't have enough data, if you're dealing with a small data set. Well, you could
choose to build a simpler model. Simpler models work better with less data. If you're
working with neural networks or deep learning techniques, you can use transfer learning
where you use a prebuilt model, which is then tweaked on the small dataset that you
have. You could try and increase the amount of data that you're working with using
data augmentation techniques. This is used with image data fairly often. You simply
tweak the existing images to get new images. And one last option could be, you
understand the kind of data that you need to build your model and you use the
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 5/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
statistical properties of that data to generate synthetic artificial data. Every machine
learning algorithm has its own set of parameters, for example, simple linear regression
versus decision tree regressors. Understand what kind of model you're working with
and how much data you have. If you have less data, choose a simpler model with
simpler with fewer model parameters. Simpler models are less susceptible to overfitting
on your data and memorizing patterns in your data. Overfitted models are those that
perform well on training data, but poorly in the real world because they haven't
learned from the data, they've simply memorized patterns. Some of the machine
learning models that are simpler with fewer parameters are naive bayes for
classification or logistic regression models. Decision trees have many more parameters
and are generally considered to be complex models. Another option is to train on your
small dataset using ensemble techniques. Ensemble techniques don't rely on a single
machine learning model, they train many individual learners under the hood and the
final prediction of the model is the combined aggregated predictions of the individual
some data, you don't have very much data to begin with, you might train a number of
different machine learning models on the same data, logistic regression mode, a
random forest model, a naive bayes classifier. We'll combine the results of all of these
individual models to get your final result from the ensemble. If you're working with
neural networks and you don't have very much data to train your neural network
model, transfer learning is an option. Transfer learning, however, can only be applied to
those use cases which are very common. Transfer learning involves reusing a trained
neural network that solves a problem that is similar to yours. If you're performing image
classification, you need a model that does image classification. You'll reuse the
architecture, as well as the model parameters and simply perform a little bit of training
so that your model has been fit on your data. Here is a very simple description of how
transfer learning works. You have a pretrained model which has some knowledge or
information, it has been trained on some data set called dataset 1. That's not your data.
The model is part of the machine learning community. It's available to you for use. You
take the knowledge that is part of this model and then train this model on your dataset,
that is dataset 2. Your dataset may not be huge, but the model contains within it
knowledge that it has gleaned from the original dataset, which is typically very, very
large. Let's move onto the next technique that you could apply if you have insufficient
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 6/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
data, data augmentation where you take pre-existing samples and change them in
some way to create new samples. Data augmentation techniques allow you to increase
the number of training samples and it's typically used with image data. You take all of
the images that you have to work with and perturb and disturb those images in some
way to generate new images. You can perturb those images applying scaling, rotation,
affine transformations, you name it, and these image processing options are often used
as preprocessing techniques to make your image classification models built using CNNs,
or convolutional neural networks, more robust. They can also be used to generate
additional samples for you to work with. And let's move onto the last option here,
generating synthetic artificial data. Now synthetic data comes with its own set of
problems. Basically, you'll artificially generate samples which mimic the real world, so
you need to understand the characteristics of the data that you'll need to build good
models. You can oversample existing data points to get new data points, that always
an option, or you can use other techniques to generate artificial data, but there are
major pitfalls involved. You might introduce bias that does not exist in the real world
into your dataset, then your model will not be a good one.
data. It might seem strange that too much data is a problem, but what's the use of
data if it's not the right data. Now data that you work with might be excessive in two
ways. The first is the curse of dimensionality. You have samples which you'll use to train
your machine learning model and every sample might have too many columns, too
many features. In the simplest form, when you're dealing with the curse of
dimensionality, you might end up using irrelevant features which don't really help your
model improve. Another way that you could end up with too much data, well
organizations know that data is important so they keep all of the data around even
after they've lost significance. You might be dealing with outdated historical data and
you have too many samples or rows, many of which might be irrelevant. Working with
historical data is kind of a double-edged sword. Historical data is important, but how
important. If you have too much historical data which is not really significant, you might
encounter something called concept drift. This is where the relationship between the
features and the labels, that is the Y variables, change over time. Machine learning
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 7/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
models fail to keep up because they're dealing with too much historical data, and
consequently, their performance suffers. Concept drift essentially means that your
machine learning model continuous to look at the state of the world that is outdated,
that is no longer significant or relevant. Using this outdated view of the world, it's
forced to make predictions on data in the new world. So if you're working with
outdated historical data, remember use it with care. If not eliminated, it might lead to
concept drift. Outdated historical data is a serious issue, specifically when you're
working on machine learning models that work with financial data, especially if you're
modeling the stock market. Historical data is important, but you need to use some kind
of judgment to figure out what rows are actually significant. It requires a human expert
to judge which records need to be left out. The curse of dimensionality is a huge topic
which has been studied in detail by data scientists. There are two specific problems that
arise when too much data is available. The first is you have to decide in some way
which data is actually relevant. This might involve feature selection using statistical
outdated historical data and concept drift is fairly hard. The curse of dimensionality is an
easier problem to solve. You might perform feature selection using statistical techniques
to figure out which part of your data is relevant. You can use feature engineering to
aggregate your low-level raw and granular features into useful features that are less
granular, but contain more information. You can also combine features together to
improve their predictive power or you could perform dimensionality reduction. This is
where you reduce the complexity in your data without losing information. One way to
do this is to reorient your data along new axis to capture the maximum variance that
exists in your data. And when you perform this kind of feature engineering to reduce
the complexity in your data, you might come across the term concept hierarchy. This is
a mapping that combines very low-level features, such as lat/long information into
more general usable features such as zip codes. Concept hierarchy also involves
bucketing your data to get information in a less granular format.
the wrong features into your model, but there are other manifestations as well. It's
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 8/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
possible that the data that you've collected has errors. It's inaccurate in some ways and
your errors are such that it can have a significant impact on your model. This is why
cleaning and processing time to get your data into good shape. Now it's also possible
that your data is non-representative because it's biased. Let's say you're collecting data
from five sensors located in five different countries and there is one particular sensor
that doesn't work all of the time. Your data is biased because you don't have
proportional data from one of the sensors. When you're working with biased data, that
leads to biased machine learning models and these models perform poorly in practice,
they don't have the full picture in mind. You can mitigate using oversampling and
undersampling. So if you have less data from one of the sensors, you could oversample
the data that you do have so you have a representative sample. Oversampling or
undersampling can result in its own biases, so this is something that you need to be
careful about. Other problems that you might encounter are missing data and the
presence of outliers, which could be errors or genuine points. Before data is anywhere
close to being ready to feed into your machine learning model, there is often an
extensive data cleaning phase. Data cleaning procedures can help significantly mitigate
the effect of both missing data, as well as outliers. We'll discuss specific techniques that
you can use to cope with missing data and outliers in a later clip in this module. Let's
move on and talk about duplicate data. If you're collecting data, there might be
duplicates that are present and duplicates themselves can introduce bias or skewness in
your data. Now if the data can be flagged as a duplicate, the problem is very easy to
solve. You'll simply apply deduplication on your data before you feed it into a model,
but the world isn't that simple. Duplicates can be hard to identify in certain applications,
specifically, real-time streaming applications. You just have to live with it and account for
it.
problems. Let's dive into a little more detail on some of these techniques starting with
missing values and outliers. When you're collecting and working with data, you might
find that you have missing data in the form of missing values for fields or you might find
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 9/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
that your data contains outliers which don't really make much sense. Now when you're
working with data that is missing from the records that you have to train your model,
there are two approaches that you could follow in order to deal with this data. The first
of these is deletion where you get rid of data which has missing fields, the other is
imputation where you fill in missing values using sound techniques. Let's talk about
deletion first. Deletion is also record to as listwise deletion. This is where you delete an
entire record that corresponds to a row in your dataset if you have a missing value in a
feature that is a column of your dataset. This is a simple, hassle-free technique to get rid
of missing values, but it can lead to bias because you're getting rid of entire records,
even if an irrelevant field has a missing value. Listwise deletion is often the most
common method in practice because it's easy, but this can lead to several problems. It
can greatly reduce the sample size that you're working with. If you don't have very
many records to begin with, if you get rid of entire records due to a few missing fields,
you might get into a situation where you have insufficient data to train your machine
learning model. Now there are other nuances that you have to worry about as well. If
the field values are not missing at random in your dataset, there is one sensor which
never has a value for a particular field. If you go ahead and drop all records from that
sensor, that can lead to a significant bias. So it's pretty clear that simply dropping entire
records which have a few fields missing is not a great option, which is why we move
onto imputation where you fill in missing column values, rather than deleting records.
Missing values can be inferred from the data that is already available from known data.
Once you've decided that you want to use imputation to fill in missing values, there are
a number of methods that you can choose from, they range from the very simple to
very complex. The simplest possible method is to use the column average. You assume
that the missing value is essentially equal to the mean value in that column or for that
feature. Other very similar options are to use the median value of that column or the
mode for that column. Another way to impute missing values is to interpolate from
other nearby values. This technique is useful when your records are arranged in some
kind of inherent order. Imputation to fill in missing values can be arbitrarily complex, in
fact, you can build an entire machine learning model to predict missing values in your
data. Now you might want to perform imputation in a variety of ways. Univariate
imputation relies only on known values in the same feature or the same column.
Multivariate imputation, on the other hand, uses all known data that you have to infer
missing values in your data. For example, you might want to construct regression
models from other columns in your data to predict missing values in a particular column.
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 10/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
This is an example of multivariant imputation. You'll iteratively repeat this for all columns
that contain the same values. In the next module, you'll see a hands-on example of
apply to fill in missing values as well, such as hot-deck imputation. You'll sort all of the
records that you have based on a criteria that is important, and for each missing value,
you can use the immediately prior available value. This is referred to as last observation
carried forward. Fill in missing values using the previous value once your records have
been ordered. This is especially useful for time series data where progression in time has
meaning. When you're working with time series data, this last observation carried
forward is equivalent to assuming that there has been no change in this value since the
last period of observation. A common technique that is often used with as an example
of univariant imputation is for each missing value substitute the mean of all available
value. Mean substitution has the effect of weakening correlations between columns that
exist in your data. When you essentially say this is an average data point, there is
nothing special about it, you weaken correlations and this can be problematic when
you're performing bivariate analysis, analysis to determine the relationship between two
variables. If you want to be able to intelligently predict missing values in your data, you
might want to use machine learning fit a model to predict missing columns based on
other column values. Applying this technique tends to strengthen correlations which
exist in your data because you'll essentially see that this column is dependent on the
other columns. Thus, regression and mean substitution to fill in missing values have
complementary strengths. You need to be aware of the nuances of applying these
your data set, a data point that differs significantly from other data points in the same
data set. It might be that the entire record is an outlier in some manner or there are
certain fields with outlier values. When dealing with outlier data, it's a two-step process.
The first step is to identify outliers that exist in your data, the second step is to use
techniques to cope with these outliers. Just like there are machine learning algorithms,
there are specific algorithms that have been built for outlier detection, but at the very
basic level, you can identify outliers by seeing the distance of that data point from the
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 11/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
mean of your data or the distance from a line that you fit on your data. Once you've
identified outliers, you can cope with outliers using three broad categories of
techniques. You can drop records with outlier data, you can cap and floor outliers, or
set outliers to the mean value of that feature. Let's start our discussion off by seeing
how we can identify outliers using the two techniques that we discussed. The mean or
the average of any feature in your data is basically a measure of central tendency. That
is the point around which the remaining points are clustered. If you have a data point
with a value far from the mean, that can be considered to be an outlier, or you can
perform some kind of regression analysis and find a line or a curve that follows the
patterns in your data, and if you have a point that is far from this fitted line, that is an
outlier. When you want to quickly summarize any set of data that you're working with,
the first measure that you'll indicate is the mean of that data. The mean value of any
data is the headline, it's the one number that represents all of the data points best. The
mean or the average of any set of data points is essentially the sum of all of the
numbers divided by the count of the numbers. Hopefully, you remember this from high
school. However, along with the mean, the variation that exists in your data is also
important. The variation is a measure of whether the numbers in your data set jumps
around. One important measure of the variation in your data is the range, which is
simply the maximum value minus the minimum value. However, the range completely
ignores the mean and is very swayed by outliers that are present in your data, which is
why often another measure of variation is used that is the variance. The variance of
your data is the second most important number to summarize any set of points that
you're working with and the formula for variance is as you see here on screen. You
don't really have to worry about the formula though. The variance is a measure of how
your data jumps around or varies. You need to have an intuitive understanding of what
mean and variance represent. They succinctly summarize a set of numbers, any set of
numbers. Now along with variance, another term that you might encounter is the
standard deviation. The standard deviation is nothing but the square root of variance
and is a measure of variation in your data. The standard deviation helps you express
how far a particular data point lies from the mean. Points that lie more than three
standard deviations from the mean are often considered outliers. The standard
deviation threshold for outliers is often based on your use case, so you'll have data that
is one standard deviation, two standard deviation, three, four, five standard deviations
from the mean, and you can determine the threshold for outliers. Another way to
identify outliers in your data is to measure their distance from a fitted line. So you have
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 12/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
a huge number of points and I'm going to represent this in two dimensions because
that is by far the easiest. You'll try and fit a line using some kind of regression analysis
on this data. So outliers are essentially data points that do not fit into the same
relationship as the rest of the data. And now based on these two principles, there are a
variety of algorithms that you can use to identify outliers, but once you identify these,
you need to figure out how you want to deal with them, how you want to cope with
them. Now here, there isn't really one technique which is the one that you should use.
You need to scrutinize and understand the outliers that exist in your data. Are these
outliers because of incorrect observations or errors that exist in your data? You might
want to get rid of that record entirely if all of the attributes of that record are
erroneous. Or if for a row or a record, if you feel that there is just one attribute that has
been erroneously recorded, you might want to set that outlier value to be the mean
and not drop the record as a whole. Now it's quite possible that your outlier data is not
an incorrect observation, it's a genuine legitimate outlier. If your model is not distorted
through the presence of such outliers, leave it as-is. The outlier actually conveys
important information that your model might need to recognize or if you don't want to
leave it as-is, you can cap or floor the outlier if your model is distorted. This might
require that you standardize your data first, that is express all of your data points in
terms of standard deviations from the mean. Standardization involves subtracting the
mean from all of your values so that your resulting scaled data has a mean of 0 and all
values are expressed in terms of standard deviations. Once you've done this, you can
cap positive outliers to be just three standard deviations from the mean to +3 and you
can floor negative outliers to -3. They are once again three standard deviations away
from the mean.
statistical analysis, what you're working with is just a sample of the data. What you're
trying to infer is characteristics of the population that is all the data out there in the
universe from this sample. In order to correctly understand the population
characteristics, the sample of data that you're working with, which is a subset of the
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 13/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
population, hopefully it's a representative subset. When you apply your statistical and
machine learning techniques on a representative sample, you'll find that your model
performs well. Now it's possible that when you have drawn the sample of data from
your population, it is a biased sample. When you train your model on a biased sample,
the model itself will be biased. It's pretty simple in theory to say that you should work
with unbiased samples, but in practice, this is really, really hard. Let's talk about when
unbiased samples make it hard for you. Let's say you have a study that you're
performing that seeks to measure the health effect of a certain chemical. Now let's
assume that this chemical is extremely rare and humans or animals exposed to this
chemical are even rarer, so exposure to the chemical is random and extremely rare. In
order to perform a meaningful test, you'll need an unbiased sample of people exposed
to the chemical and people who haven't and this sample would need to be huge. And
this large unbiased sample that you need to build a meaningful model may just not be
available in practice. So is it possible that we focus on the few exposed instances and
build a model meaningfully from them. Let's consider another example. You're building
an image classification algorithm and you want this classification algorithm to be able to
identify photos of the Hawaiian Crow. Now this is one of the rarest birds on earth and
it looks a lot like the common crow, which as you might imagine is really common. You
can't do anything about this, so you build your image classifier with the training corpus
that has millions of images containing the common crow and only a few dozen images
which contain the Hawaiian crow. These images of the Hawaiian crow are just not
available in the real world. Is it possible to reuse images of the Hawaiian Crow to get a
more unbiased sample? In order to work around the situations that we just described,
you could use oversampling and undersampling. These are techniques that intentionally
add bias to the data so your data is more balanced and your machine learning models
are more robust. For the rare cases in the real world, your machine learning model
needs more data that it can work with to find patterns. Now there are two ways or
two approaches to balancing datasets. You'll oversample uncommon x or y values, that
is oversampling, or you'll undersample common x or y values. In both cases, you're
trying to make the proportional samples for each category more or less even. In order
to build a meaningful model, you might be forced to balance your datasets, but this has
an impact on your model and you should understand what it is. Oversampling and
undersampling tend to reduce the accuracy of your classifier. However, both of these
techniques will improve other metrics of your classifier, such as precision and recall.
Precision and recall are measures used to evaluate how well your classifier deals with
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 14/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
skewed data or works with rare instances. If you can't balance your datasets using
oversampling and undersampling, you can use other related techniques to explore rare
cases, such as case studies where you'll study a rare instance in detail using human
judgment. Another alternative is to use stratified sampling for your data. This is where
you'll divide the data that you have based on some criteria or category and make sure
you have the right number of samples from each category using your judgment.
overfitted or underfitted on your data. Let's understand what both of these means.
Let's take a very simple example that we visualize here. Here are dots present on a
two-dimensional plane. Here is a challenge for you. Find the best curve that passes
through these points and imagine that this is the curve that represents your machine
learning model. Now let's take an example here. Is this a good fit? A curve can be
considered to be a good fit on your points if the distances of the points from your
curve are small. This seems like a good logical way to determine what a good fit is, but
things can be taken a little far. We could draw pretty complex curve that pass through
these points. You can even have your curve pass through every single point in your
data. Now let's say your machine model under the hood draws a curve over your data
points that look like this. Would you call that a good machine learning model? Well, not
really because given a new set of data points, this curve might perform quite poorly
when used for prediction. What you really want is for your curve to represent your
data, find patterns, generalized patterns in your data, but not memorize the data
points. That's what this complex curve has done. Imagine that this new data
represented in blue is your test data and the original set of data points are the training
data points for your model. The curve that we fit has not extracted patterns from the
training data, it has simply memorized the training data. So this very complex curve
that we have here will have great performance and training, but poor performance in
real usage on your prediction instances. On the other hand, if you'd gone with
something simpler such as a straight line to represent your training data points, it might
perform worse in training, but work better with test data in the real world. And this
example here demonstrate what it means to overfit your model on the training data.
This is when the model has memorized the training data and it has very low training
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 15/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
error. When you simply test your model on the training data, it is amazing, it's fantastic,
but such a model does not work well in the real world. It has a very high-test error.
That's where you'd say your model has overfit on the training data. Overfitting is a real
problem when you don't have very much training data and the model that you've
chosen is fairly complex, one with many parameters such as a decision tree. On the
other hand, the flipside of overfitting is underfitting your model on the training data.
This is when your model is overly simple and it's unable to capture relationships that
exist in data. Here, your model does not work well on the training data itself and
underfitting is also a significant problem when you don't have enough data for your
model to learn from. Underfitting results in a model that is far too simple to be useful.
This is a model with not much predictive power, that's why it's useless.
Module Summary
And with this, we come to the very end of this module where we discussed the need
for data preparation in machine learning. When you're working with data in the real
world, the data is not often in the format that you want to build and train models. Ask
any data scientist or data analyst who work in the real world and they'll tell you that an
inordinate amount of time is spent cleaning and preparing data for ML. We'll discuss the
problems of insufficient data where there simply isn't enough data to train a model and
we also discussed a few techniques that you can use to mitigate this. Though this is a
hard problem to solve, it typically involves access to additional data sources. We then
moved onto discussing excessive or overly complex data when you have too much
data to work with. In this context, we discuss the curse of dimensionality and how
feature engineering techniques are required to reduce the complexity of data that
you're working with. We then discussed other problems, such as non-representative
data, missing data, and outliers and techniques that you can use to cope with each of
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 16/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
Implementing Data
Cleaning and
Transformation
Module Overview
Hi, and welcome to this module on implementing data cleaning and transformation. In
the previous module, we spoke of the importance of cleaning and preparing your data.
In this module, we'll apply many of the concepts that we studied in practice. We'll work
hands on with real world data and perform exploratory data analysis to understand the
relationships that exist in our data. We'll then use that data to build a very simple
regression model. We know that data in the real world is riddled with missing values
and maybe contains outliers. We'll see how we can perform univariate imputation of
missing values. You'll find that the scikit-learn library has a built-in imputer estimator for
exactly this. We'll then work with an experimental estimator available in scikit-learn,
which allows us to perform multivariate imputation of missing values. Multivariate
imputation allows us to use other features that we have in our dataset to fill in missing
values. We'll then see how we can bring all of this together into the same pipeline. We'll
construct a scikit-learn pipeline, which performs feature imputation and then fits a
classification model. In this way, you have a single pipeline that deals with data
preparation and cleaning, as well as the training of your ML model.
have a folder named datasets. This is where I've stored all of the data sets that we'll use
for the demos of this course. This is a subfolder. Let's head back to our current working
directory, that is where we'll create Jupyter Notebooks in order to write code. Click on
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 17/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
this New drop-down here and create a new notebook running the Python 3 kernel. The
version of Python that we'll be using for all of the demos is the latest available at the
time of this recording, that is Python 3.7. This notebook is currently untitled. I'm going to
rename it to something meaningful, ReadCleanExploreDataset. We are now ready to
get started. Many of the data preparation techniques that we'll use in this course are
available as a part of the scikit-learn library. Make sure that you have the latest version
installed using a pip install -U. We'll use a few other open source Python libraries as well,
but this if by far the most important and the first one that we'll work with. Set up the
import statement for the other modules that we'll need, and let's take a look at the
current version of the scikit-learn library. This is the version that I'm working with, the
latest version available at the time of this recording, 0.21 .3. Let's take a look at the
NumPy version that I'm using, 1.17 .1, and here is the version of pandas that we'll be
using. These version numbers are for your reference. If you're working with newer
versions and you find that something doesn't work exactly like how you'd expect it to,
one thing to do would be to check to see whether version mismatches are causing the
issue. Alright, let's go ahead and read in the csv file which contains our data. I have my
data located in the cars.csv file under the datasets folder. This is the automobile miles
per gallon dataset, the original source is here at Kaggle, but I have modified and
contaminated the data that's present here in this data set. I've added in missing values
for several fields, messed up the formatting of other fields so that we get some real
practice with data cleaning and preparation. If you take a look at some of the records in
this data set, you'll see that it's ideal for regression analysis. You have a bunch of
characteristics of automobiles and we'll use these characteristics to try and predict the
mileage for the individual cars, miles per gallon. Like I mentioned earlier, this data set is
contaminated or not really clean and we need to deal with that first before we can
explore this data. Observe that some of the fields have question marks indicating that
these values are missing. These are some of the issues that we need to take care of
before we can use this data set for regression analysis. Let's go ahead and explore this
data set further. You can see that there are about 400 records and a total of 12
columns. When you're working with Pandas, Pandas has a number of useful functions
that work with the data type nan. So let's replace all of our missing fields which have
question marks in them with np.nan. So automobile_df.replace? with np.nan. With this
done, if you see a sample of records in our data set, you'll see that all of the fields that
were previously question marks, these are missing values, have now been replaced by
the NaN, or not a number data type. Pandas offers special functions allowing us to see
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 18/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
how many NaNs are present in our data and that's what we'll use next. Df.isna .sum will
sum up all of the missing values or NaN values in each column of our data set. This will
give you a quick overview of how many fields have missing values in each column. You
can see that there are 9 values missing in the MPG column, 2 values missing in the
cylinders column, 2 values missing in the compression-ratio column, and so on. We'll see
some examples of a few different techniques that we can use to deal with these missing
values. As far as the miles per gallon field is concerned, I'm going to replace all of the
missing values with the current mean of the data. We're essentially seeing that for cars
for which we don't have mileage details, their miles per gallon is equal to the average of
the data set. Pandas offers this very useful function called fillna which we can use to fill
in missing values. Now you can fill in missing values using any constant. Here, we are
calculating the mean value of the mpg for all the cars for which data is available and
using that mean value to fill in the missing fields. With this done, if you run the isna .sum
function once again, you'll see that miles per gallon, that column no longer has missing
value. Another valid strategy that you can use to deal with missing values is to simply
drop all records with missing fields. If you feel that you have sufficient data for your
regression analysis and there are only a few records with fields missing, this is a
perfectly valid option and Pandas makes this easy by offering you the dropna function.
Let's take a look at how many records we are left with and you can see we have 387
records as opposed to 394 or so records that we had earlier. So there were seven
records with missing fields that have been dropped from our data set. An alternative to
the isna function in Pandas, is the isnull function. This checks for null values in your
columns and null values are deemed to be missing. Let's sum up all of the null values.
You can see that there are 0 null values present in our data set. Our data set no longer
has missing values. Now we examine the columns that exist in my data and for my
regression analysis, I feel that the model column is not really needed. It's not significant,
so I'm going to go ahead and drop the model column. Let's take a look at the columns
that we have left in our data set. There are three other columns here which I feel that
we can do about bore, stroke, and compression-ratio, so I'm going to go ahead and
drop those columns as well. The remaining columns of our data we'll use to perform
regression. Here is what our data looks like right now. You can see that we still have
some cleaning left to do. It's not just missing values that we have to deal with.
Specifically, you might observe here that the year and the original columns are not
really well formed. Let's see how we can clean this data next.
Cleaning Data
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 19/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
Cleaning Data
Let's deal with the year column first. I'm going to access the values in the year column
as a string and see how many of the values in the year column are numeric in nature,
so str.isnumeric .value_counts will give me all the numeric and non-numeric values. So
you can see that we have 351 numeric values for year and 36 non-numeric values. This
exploration is needed in order to understand how we can clean this year column. Let's
take a look at all of the non-numeric values in the year column where isnumeric is equal
to false. A quick explanation here, you can see that all of these values contain a numeric
year so those are the first two characters and sometimes they contain additional
characters in brackets and so on. And you can see that the data type for the year
column is of type object because not all fields are numeric. Given the nature of the
values in the year column, one technique that we could use to extract valid year
information is to run a regular expression in order to extract the first four characters
from all of the fields in that column. So I'm using the str.extract function in order to run
a regular expression pattern match and I'm going to look for the first four characters,
which are numeric. And the quick sampling of the results here shows me that this seems
to be a good way to extract the year information. So let's go ahead and check to see
whether the year field has any missing values. There are none as we know from earlier.
And let's reassign this clean .data to year column. We call pd.to_numeric to convert the
first four characters that we extracted from the year column to numeric form and
assigned this result to the year column once again. If you now quickly sample the data
set, you can see that all of the year values have been cleaned up. They are now in the
form of int 64s representing a valid year. Instead of using values in the year column
directly, I feel that it's more appropriate to represent a car by how old it is. I feel that
there is some relationship between the age of the car and the mileage that it offers. So
I'm going to subtract the current year from the year specified in the year column. I've
used Python's datetime function to get the current year and have subtracted the year
the car was made. I'm going to go ahead and drop the original year column and leave
the new age column. The age of the car is one of the features that we'll use for
regression analysis. To see if this data set requires further cleaning, one way to explore
this data set is to look at the data types of the individual columns and you can see here
that there are several data types that are object. Now some of these columns you
might expect to be numeric in nature, such as displacement or cylinders or acceleration.
The fact that these are objects tells us that there is some more data cleaning and
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 20/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
preparation that we need to do. Let's start with the first of these columns here, the
cylinders column, which is of type object. There are no missing values for cylinders. We
know that already. Let's quickly see a count of how many numeric and non-numeric
values are present in this column. There are 378 numeric values, 9 of them are non-
numeric. Let's take a look at the non-numeric values, that is where isnumeric is equal to
false, to see what they look like and you can see that they are dashes. This is another
representation of missing values in this data set, so we need to deal with these missing
values as well. I'll first extract all of the numeric values from the cylinders column, that is
all of the values which are not equal to the dash which represents missing values. Once
I have all of the numeric values, I'm going to convert them to integers and calculate the
mean or the average which I store in the cmean variable. All of the missing values are
now replaced by the mean value. At the end of this operation, all of the missing fields in
the cylinders column will be replaced by the average number of cylinders. And if you
take a look at the column data types now, you can see that cylinders is of type int64.
We perform this conversion as well. So we've dealt with cylinders, let's move onto
displacement. Displacement should be a numeric field and not an object, so I'm going
to use the pd.to_numeric function in order to coerce displacement to be a number.
Errors is equal to coerce. We'll convert invalid numeric values to NaNs, or not a number.
Let's go ahead and take a look at all of the data types. You can see the displacement is
now a float64. All of the values have been converted to numeric forms successfully. I'm
now going to go ahead and perform the same pd.to_numeric operation under values
in the weight column and the values in the acceleration column as well. Weight has
been converted to numeric form. Here, I convert acceleration to numeric form as well. If
you examine the data types for all of the column values, now you can see that its only
origin which is of type object. Before we go ahead and clean up this column, let's
examine some of the data that is present here so we understand what we are working
with. You can see that in some cases the origin is just the code or name of the country,
and in other cases, there are specific cities within the country listed as well. Some of the
origin values also have formatting issues. If you want a quick sampling of all of the
different kinds of values present in the origin field, you can invoke the unique function
and this will give you all unique values that are present in the origin column. This shows
us that the origin of a particular car can be the US, Europe, or Japan. So let's go with
these three values and we'll get rid of the other extraneous information here, such as
the city in Japan that a car was made or a city in Europe. Now for all values of origin
which contains the string US within its value, I'm going to say that the car was made in
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 21/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
the US. I use the np.where function to see whether the origin string contains the term
US. If it does, I'm going to say that the car originated in the US. Otherwise, I leave the
origin string as-is and I don't change it. Let's take a look at the unique values for origin
now. You can see that all references to the US have been replaced by the single simple
string US. Let's do the same thing for Japanese origin cars as well. If the original string
contains Japan, we replace it with just Japan, otherwise, we leave the origin as-is. You
can see that all origins which contain Japan in its original value have been replaced just
by Japan and the city names have been left out and we'll do exactly the same thing for
Europe origin cars as well. And at this point, the only values of origin are US, Japan, or
Europe. These are three specific values, we can deal with this in our regression analysis.
If you want to quickly explore the statistical information for numeric features in your
data, you can run the describe command on your Pandas data frame. This will give you
the count, mean, standard deviation, min/max, and quartile values for all numeric
features. Our features are looking good so far. They've been fairly cleaned up. Let's go
ahead and save out this as a csv file. This is our cleaned-up data set which I store in
cars_processed.csv. I'll now run a quick ls command in the datasets folder to ensure
that the file has been saved out, and yes indeed, it has.
Visualizing Relationships
At this point, we have a clean data set. We can now explore this data set for
relationships that exist between the variables before we move onto regression. I'm
going to use matplotlib here and plot a bar graph of the age of the automobile versus
the mileage of the automobiles. This will allow me to see whether there exists a
relationship between the average miles per gallon and how old a particular car is and
this bar graph shows that my intuition was right. You can see that for older cars, the
average miles per gallon is far lower. It makes sense that newer cars have more efficient
technology for better mileage so this is an important factor in our regression. Let's
move on and view a scatter plot to view another relationship between the acceleration
of a car and the mileage of the car. You can see from the shape of the resulting scatter
plot that there exists a linear relationship. Cars with higher acceleration seem to have
better mileage. Exploring your data and variables in this manner gives you a quick feel
for the relationships that might exist. I'm now curious about whether the weight of a
car affects the mileage of a car. Intuition tells me that it should. So I'm going to plot a
scatter plot of weight on the X axis and miles per gallon on the Y axis and you can see
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 22/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
that it's a negative relationship. Cars that are heavy, that is have a higher weight, than
to have lower miles per gallon. A scatter plot is typically used for visualizing bivariant
relationships. You can plot a third dimension on a scatter plot as well by specifying a
color for your data points. Here, I use the scatter plot functionality available as a part of
my Pandas data frame. Pandas uses matplotlib under the hood to allow you to
intuitively visualize the data in your data frame. On the X axis, I have the weight of the
car, on the y axis, I have acceleration, and on the third dimension represented by color,
I have horsepower. You can see that the data points in our scatter plot are colored and
you also have a color scale off to the right. You can see that there is a negative
relationship between acceleration and the weight of the car, that is cars with higher
weights have lower acceleration. Based on the color of the data points, it seems like
cars with higher weights and lower acceleration also tend to have higher horsepower.
Let's visualize another relationship that might exist between the number of cylinders in
your car and the mileage of the car. This bar graph that we see here on screen doesn't
really give us a clear pattern. It's hard to say what exactly the relationship is. I'm now
going to view the correlations that exist in our data, but before that, I'm going to drop
the cylinders and origin columns. Cylinders and origin have discreet values and not
continuous numeric values. Let's move on and view interrelationships in our data using
the correlation matrix. The corr function in Pandas quickly calculates the correlation
coefficients between every pair of numeric variables that exist in your data set. A
relationship that exists between two variables. A variable is always perfectly, positively
correlated with itself. It has a correlation coefficient of 1. You can see the other
coefficients here. You can see that weight and mileage are negatively correlated. There
is acceleration and mileage are positively correlated. A positive correlation indicates
that the variables move in the same direction, that is if one increases, the other does as
well, and if one falls, the other does as well. A negative correlation coefficient indicates
that the variables move in opposite directions. Let's visualize this correlation matrix in
the form of a heatmap, which is the best way to view this information. A heatmap
converts this correlation matrix in the form of colored cells where the color of the cell
indicates whether the variables are positively or negatively correlated and how strong
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 23/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
Now that we have a clean automobile's miles per gallon dataset, let's use it for
regression analysis. Here we are on a new Jupyter notebook called
BaselineRegressionModel. Set up the imports for the libraries and packages that you'll
need. We'll use pandas and matplotlib as well. We already have our processed and
clean data stored in the cars_processed.csv file under the datasets folder. This is what
I'll read into the automobile_df data frame. This is the dataset which we cleaned up. It
has no missing values and the column values are all well-formed. Let's move on and
take a look at the shape of this data. We have 387 records in total. We'll first perform
simple regression using just a single feature. Our X variable or the single feature that
we'll use is the age column, and we'll perform regression to see if we can use age to
predict the mileage. When exploring our dataset, we saw that there was a relationship
between age and mileage. Let's plot a scatter plot and visualize this relationship once
again. You can see from the shape of the scatter plot that older cars tend to have
lower mileage so this is a good candidate for our simple regression. We'll use age to
predict mileage. We'll first perform regression using machine learning techniques. This is
a simple linear regression, and for that, we'll use scikit-learn's train_test_split function to
split our data set into training data that we'll use to train our machine learning model
and test data that we'll use to evaluate our ML model. Splitting our data in this fashion
is common practice in machine learning. In order to ensure that our machine learning
training, which is why it's important that we have a test set. The scikit-learn
train_test_split function makes this very easy for us. We split our X and Y data with a
test size of 0.2, meaning we'll hold out 20% of our data to evaluate our model. Eighty
percent we'll use to train our model. Train_test_split will shuffle our X and Y data and
split it into a training set and a test set. In order to perform linear regression, we'll use a
machine learning estimator object from the scikit-learn library. This is the linear
regression estimator. This estimator object is a high-level API which makes it very easy
to perform linear regression. You simply instantiate the object and we pass in normalize
equal to true. If you set normalize to true, our X features will be normalized before
regression is performed. Normalization here involves subtracting the mean and dividing
every feature by the L to norm. This will serve to center all of your X features around a
mean of 0. Centering your data around a mean of 0 makes it easier for machine
learning models to work with your data, so this is often performed. After we've
normalized our data, we call fit on this estimator object and pass in x_train and y_train
to train our machine learning regression model. Once we have a fully trained regression
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 24/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
model, we can call linear_model.score and evaluate our model on our training data. This
gives us the r square score of our regression analysis. R square is a measure of how
well our regression model fit our underlying data. The R square score is expressed as a
fraction between 0 and 1 or as a percentage and this code tries to measure how much
of the variance in the underlying data was captured by our regression model and here
you can see that the R square is 0.31, not really a great score. This was just simple
regression with a single feature, maybe this wasn't the best feature to choose. Let's go
ahead and use this linear model for prediction by calling predict on our test data. A
good measure of how robust our linear model is its r square score or r2_score on the
test data and this we can calculate using the r2_score function available in scikit-learn.
To this function, we pass in the predicted values from our model, that is my list
predictions, and the actual mileage from our original data set and you can see that the
r2_score and the test data is about 0.32. It hasn't changed much between the training
and the test data. It's not a great score, but it's fairly consistent. Our model has an
overfit on the training data. Let's move on and plot a scatter plot of the original data
points and our linear model's predictions. Our model's prediction will be in the form of a
straight line, it's a linear model after all. And here is a visualization of our linear model fit
on the original data. You can see that the fit is not great because there are many data
points that are far away from this line. Let's choose another feature in order to perform
simple linear regression. This time, I'm going to choose Horsepower and use
horsepower to predict the mileage of a car. Once again, I use the train_test_split
function to split my data set into training data and test data. Twenty percent of the
data I'll use to evaluate my model. I'll instantiate the LinearRegression estimator object
in exactly the same way and call fit on the training data and the corresponding y
values. I'll then print out the training score, that is the training r2_score for my model,
then use this model to predict on the test data and print out the test r2_score as well.
And with horsepower as a feature, you'll see that our linear model is far better. It has a
training score of .56 and the test r2_score is even better, .61. It seems like the
horsepower feature has far more predictive power for the mileage as compared with
H. Let's take a look at the scatter plot of the original data points and a plot of the
straight line that represents our linear model and this time, using this visualization, it's
pretty obvious that our straight line fits the underlying data points much better. The
points are closer to the line. That was simple regression using just a single feature. Let's
now perform multiple regression. We'll use multiple features to predict the mileage of a
car. Observe that all of the features here are numeric, except for the origin column. The
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 25/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
origin column contains discreet values which we'll need to convert to numeric form
because machine learning models can only work with numeric data. A common way to
encode categorical columns, that is columns with discreet values, into numeric form is to
use one-hot encoding. And the pd.get_dummies function allows us to one-hot encode
our origin column. You can see here that one hot encoding adds a new column for
each value of our categorical data, so there is a column for Europe, Japan, and US
origins. When there is a 1 present in the origin_US column, that car originated from the
US. A value of one in Europe will indicate that the car originated at Europe. Those
categorical values are converted to numeric form and can be fed into a machine
learning model. This time we'll train our LinearRegression model using all of the features,
except miles per gallon. Miles per gallon is obviously the target that we are trying to
predict. Split the data into training data and the corresponding test set. Fit a regression
model on the training data and calculate the training score. Use this model for
prediction on the test data and print out the r2_score for the test data. You can see
here that the training score training r square is almost 80% and the test r2 is in the
same range. This is clearly a much better model than the earlier models that use just a
single feature.
feature imputation. Univariate feature imputation involves inferring missing values from
the existing values that are known about a particular feature. We'll write our code in a
brand-new notebook here, set up the import statements for the various libraries that
we'll use. Pandas and NumPy are our usual libraries. In order to understand and use
feature imputation, we'll work with a brand-new data set. This is the diabetes data set.
This is for classification models. This data set is freely available at Kaggle at this URL that
you see here on screen. The features of this data set gives us health-related
characteristics of individuals, the number of pregnancies, glucose levels, blood pressure,
insulin, BMI, and so on, as well as the age of the individual, and the outcome here is
whether they've been diagnosed with diabetes or not. When we build a classification
model, we'll use these health features to predict the outcome. The shape of this data
frame shows us that there are about 770 records and 9 columns of data available. The
info function in a Pandas data frame gives you quick information about the various
columns of data. It'll give you whether there are null values present, the data types, and
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 26/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
so on. At a glance, it seems like this data set is fairly clean, there are no missing values.
Everything seems to be non-null, but let's dig a little deeper. I'm going to describe this
data to get a quick statistical overview of the numeric features in my data. This gives
me mean, standard deviation, min, max, and quartile values. Observe the minimum
values for the different numeric features. You can see that the minimum values for
many of these features are 0. In the case of pregnancies, this makes sense, but for
things like glucose and blood pressure, 0 clearly does not make sense here. This means
that missing values are represented using zeros in this data set. When you're working
with data in the real world, you should understand how your missing values are
represented so you can identify them. Now that we know this, let's replace all of the 0
values in our data set with NaNs. For each of these individual columns, I invoke the
replace function on our Pandas data frame and replace all 0 values with np.nan. With
this done, we can now use the isnull sum helper in Pandas to quickly see which columns
have missing values. It's clear that skin thickness and insulin, these columns have an
especially large number of missing values and a few other columns do as well. Let's
work on cleaning the data in the skin thickness column first. I'm going to reshape the
skin thickness values in the form of a two-dimensional array. So we have 768 rows and
a single column. In order to fill in missing values using inference from the current data
that we have available, scikit-learn offers this SimpleImputer estimator object. The
values. You can fill in using a constant, the mean of the existing data, median, mode,
etc. In the case of the skin thickness column the strategy that we'll use to impute
missing values is the most_frequent. This will use the mode or the most_frequent
observation for skin thickness that is in the data that we have available to fill in missing
values. Instantiate the SimpleImputer estimator object, specify that missing values are
represented using np.nan, and use the most_frequent strategy. The fit function will fit
under existing values of skin thickness and transform will fill the missing values using the
most frequent observation. When you use imputation techniques to fill in missing values,
the statistical properties of your data will change and this is something that you need to
be aware of. You can see that the mean of the skin thickness has now risen to 29 and
the standard deviation has changed as well. Imputation to fill in missing values will
change the characteristics of your data and you have to be okay with this change
based on your data set and your use case. In this situation, we don't want to drop
records with missing values. We don't have very much data to work with, so this
imputation may make sense. Now let's move on and see what null values we have left
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 27/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
in our data set. You see that skin thickness now has no null values. We have imputed
using the mode. Let's move onto another column. This time, I'm going to fill in missing
values in the glucose column using strategy equal to median. The median represents
the data point at the fiftieth percentile of our data and this is the data that I'll use to fill
in missing values for glucose. Call imputer.fit on the glucose column and call transform
to fill in the missing values. Let's take a quick look at how the mean and standard
deviation of our data has changed thanks to our imputation. You can see that the
mean remains more or less the same and the standard deviation is also unchanged. At
this point, we've filled in missing values for the glucose field and the skin thickness field.
We have no more missing values. We can move on to working with another column,
and this time we'll work with the BloodPressure column. The imputation strategy that
we'll use here to fill in missing values is to calculate the average of the existing data
called fit on the BloodPressure column values and call transform and this will fill in the
missing values with the average. Let's take a look at how this has changed the statistical
properties of our data. You can see that the mean remains more or less the same. That
is to be expected. We imputed using the mean value after all. The standard deviation
has fallen a bit though since we have imputed missing values with the mean. So mode
values are centered at the mean. Alright, let's move on and let's fill in missing values
using a constant. We'll do this for the BMI column. Strategy is equal to constant and
we'll fill values with 32. Thirty-two happens to be the mean value for BMI in any case.
Call imp.fit and then transform on the BMI values and let's take a look at how the
statistical properties of the data has changed. The mean remains more or less the
same. The standard deviation has changed a bit. Now at this point, if you do a missing
value check on our data, you can see that the only column with missing values is the
insulin column. We'll deal with this column soon using a different imputation technique,
but before that, let's save out our current diabetes data set into a separate file. I'm
imputation. Multivariate imputation algorithms use the entire set of available features to
estimate the missing values, not just values for that feature alone. We'll write our code
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 28/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
statements for the commonly-used libraries. The imputer from scikit-learn that we'll use
for multivariate feature imputation is called the IterativeImputer. You can fit this
IterativeImputer on an entire data set where multiple columns have missing values. It'll
model each feature with missing values as a function of other features in an iterative
round robin fashion. What it essentially does is fits a regression model on all of the
other features to find values for the feature with missing values. We're essentially using
machine learning techniques to impute missing values and features. Now this
IterativeImputer at this point in time at the time of this recording is experimental and
might change in backward incompatible ways. Because of this, in order to use the
IterativeImputer, you'll need to explicitly import enable IterativeImputer from
sklearn.experimental. Now this might have caused change later on. By the time you're
doing this course, it's quite possible that an IterativeImputer becomes an integral part
of scikit. Let's instantiate the IterativeImputer estimator object. We'll run it for a
maximum of 100 iterations and I've initialized it with a random state of 0, that's just so
that you can replicate these results. Before we work with the real data set, let's see
how this IterativeImputer works with some handcrafted features. Here are features that
is three features represented in three separate columns. Observe that I've manually set
up these features with a clear pattern. The value in the first column is twice the value in
the second column, and the value in the second column is twice the value in the third
column. Every row here can be thought of as a record and you can see that some
records have missing values indicated using np.nan. Here are records with just one
missing value and here is a record with two missing values. The fit function will run the
IterativeImputer which builds a regressor behind the scenes in order to impute missing
values. The regression models will analyze the patterns that exist in the data and let's
call imp.transform to see how the missing values have been filled in. You'll observe that
for records that have just a single field missing, our imputer did pretty well. Instead of 4,
it got 4.01, that's pretty good. Based on our manually generated pattern, the missing
value should have been 4 and it's 4.01. Here, the missing value should have been 4
once again and it's 4.01. And in this row, the missing value should have been 120, which
is the double of 60, but it is 119, so our imputer did pretty well. But here in this record
where we had two missing values, the imputer did not get it right. You can see that
with just three features and two values missing out of the three features, the imputer
doesn't perform well. Let's set up another test data set with the same three features in
the same pattern that we saw earlier and we'll use the imputer to transform this test
data called imp.transform on X_test and here is what the result looks like. Once again,
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 29/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
you can see that with records that have just one missing field, our imputer did well. The
missing field in the first highlighted row should have been 48 based on our pattern, 24
multiplied by 2, our imputer got 47.9. For the second highlighted record, the missing
field should have been 50 based on our pattern and our imputer got 50. something.
Now that we've understood how the IterativeImputer works, let's use it on our
diabetes data set. Read in our processed_incomplete.csv file. We've cleaned up almost
all of the missing values in the remaining columns. The only column that we have to deal
with is the insulin column which has several missing values. The isnull.sum function shows
us that there are 374 records in our data set where a value for the insulin field is
missing. This is what we are trying to impute using the multivariate feature imputer. In
order to impute the value for insulin, I'm not going to use the outcome column. So I'm
going to go ahead and drop the outcome and assign the resulting data frame to
diabetes_features. We'll impute values for the missing insulin feature by using all other
features, but not the outcome. Go ahead and instantiate an IterativeImputer estimator
object. We'll run it for a maximum of 10, 000 iterations. In order to perform the
imputation, call fit on the diabetes_features data frame, and once we fit on the
features, we can transform the diabetes_features data frame and we'll get an array
with missing values all filled in. I'm going to convert this array to a data frame once
again. It has the same columns as the diabetes features data frame that we had earlier.
And a quick look at the new diabetes features data frame with the missing values filled
in shows you that we have all values for the insulin field, they have been imputed using
the IterativeImputer. Now that I've completed processing and imputing, missing values
for this data set, I'm going to concatenate the features and the labels to make one data
frame which I'll now write out to a csv file. Before this, let's quickly satisfy ourselves that
there are no missing values here. Isnull.sum gives us 0 for all columns. Insulin missing
values have been filled in. So write out to diabetes_processed.csv and here is our file
written out to our local machine. This is the file that we'll work with containing our
transform where this is useful. In order to transform a data set into the corresponding
binary matrix which indicates the presence of missing values in the data. When you're
using univariate or multivariate imputers to impute missing values, it's useful to have an
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 30/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
indication of where the missing values existed earlier and this is information that you can
get using the missing indicator available in the sklearn.impute namespace. Here are the
features that we'll work with, handcrafted features so we can see exactly what's going
on. Observe that instead of using NaNs to represent missing values, I've used -1. So all
the negative ones here can be considered to be missing values and you can instantiate
your missing indicator and specify what you've used to indicate missing values. Here,
missing values are represented using -1. You can mark missing values by calling
fit_transform on your features and the result is a binary matrix. Observe that wherever
there are missing values in a particular feature, that is indicated using true. By default,
the missing value indicator only indicates or represents those features which have
missing value. In our original matrix, only the first and the third column had missing
values, which is why only those features are represented in the missing values matrix. If
you want to know what features have missing values in your data, access the features
property of the indicator. Here, you can see that features 0 and 2 have missing values,
the feature at index 1 does not. If you want the missing indicator to give you a matrix
representation of all features, including those with no missing values, specify features
equal to all when you're instantiating the missing indicator. Once again, we can get a
binary representation of missing values in our data by calling fit_transform on the
features. Here is our binary matrix and this matrix has three columns corresponding to
the three features in our input data. The feature at index 1 which has no missing values
is also represented here. Indicator.features shows us that the features represented are
0, 1, and 2.
data before we fit a classification model. We'll build this data transformation pipeline
using the scikit-learn pipeline object. I'm going to deliberately contaminate our data with
missing values and we'll perform feature imputation using the simple imputer in our
pipeline. In order to fit our classification model, I'll work the diabetes_processed.csv file.
This is the process dataset, it contains no missing values, so we are going to have to
introduce a few missing values in here. I'm going to split our data into features and the
corresponding label in order to perform classification. Diabetes features are all features,
except the outcome, and the label is the outcome. In order to randomly introduce
missing values in this dataset, I'm going to use the np.random .randint function. This
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 31/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
mask array will have the same shape as our features data frame and it'll contain
random numbers between 0 and 100, but I want this to be a Boolean mask. When you
specify as type np.bool, all non-zero values will be true and 0 values will be marked as
false. I'll then perform a logical_not on this mask to get the inverse so that 0 values will
be true and non-zero values will be false. This will give me a Boolean mask with random
true/false values. So 1 in 100 values roughly will be true. I'll now use this mask to
contaminate my diabetes feature data. All values, all cell values where the mask is true,
I'm going to set to np.nan. Let's sample our existing features and you can see that
we've introduced a number of NaN values here. This is a hypothetical missing data
which we'll process using the simple imputer in a scikit-learn pipeline. Let's set up the
imports for the libraries and functions that we need. From sklearn pipeline, we'll import
the make_pipeline function, this is what we'll use to create a pipeline. The
processed our data by imputing missing values, we'll fit a decision tree classifier on our
diabetes data set. Before we train our machine learning model, let's use train_test_split
to split our data into training data that is 80% of the data we'll use for training and 20%
to evaluate our resulting model. The first processing that we'll do in our pipeline is to
transform our data to impute missing values, and for this, we'll instantiate a
ColumnTransformer. The ColumnTransformer allows you to specify a sequence of
transformations that you can apply on your data. Here I have just one transformation in
my sequence. The only transformation here is to use the simple imputer to fill in missing
values and I'll go with the simplest strategy here. I'll replace every feature with the
mean value of that feature. The transformer also needs to know what columns you
want to apply this transformation to. I want to apply this missing value imputation to all
of my existing features starting from column 0 and up to and including column 7. I'll
now create a scikit-learn pipeline to transform our data to impute missing values and
then fit a DecisionTreeClassifier on this transform data. Make_pipeline pass in the
transformer and your classifier. Call clf.fit on the training data in order to transform and
train our classification model. Once the model has been trained, call clf.score to calculate
the accuracy of this model and you can see it's about 79%. Seventy-nine percent of this
model's predictions were correct. Let's move on and let's see how this model performs
on the test data. Call predict on x_test and let's calculate the accuracy_score on the
test data as well. And the accuracy of this pipeline and model on our data is similar
around 79%.
Module Summary
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 32/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
Module Summary
And with this demo, we come to the very end of this module where we got hands on
with cleaning and transformation techniques applied to our data. We started off by
reading in our data using a Pandas data frame and exploring it using different
visualization techniques. This allowed us to get a feel for our data and understand the
relationships that exist. We then used this data to build a very simple regression model.
We've discussed in detail in an earlier model that data in the real world is not really
clean and ready for us to use. We then saw how we can deal with missing values in our
We also saw how we could perform multivariate imputation of missing values using an
certain column. We then put all of our data transformation and model training steps
together into a single machine learning pipeline. Our pipeline performed feature
imputation to fill in missing values and then fit a classification model. In the next module,
we'll talk of another step in data preparation, feature selection, to reduce the
complexity in your data.
Transforming Continuous
and Categorical Data
Module Overview
Hi, and welcome to this module on transforming continuous and categorical data. The
data that you'll work with when you're building and training your machine learning
models can be divided into two broad categories, categorical data versus continuous
data. We'll first study and understand the differences between the two. Categorical
data, which is made up of discrete values or categories can be further divided into two
subcategories, nominal data and ordinal data. We'll understand the differences
between these two as well. When you're working with numeric continuous data in
machine learning, we'll discuss how machine learning models don't really work well
when data is at different scales. We'll see the different kinds of techniques that you can
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 33/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
use to scale numeric features for data analysis. We'll then talk about how categorical
data needs to be numerically encoded before it can be used in ML models. And we'll
talk about two specific techniques here, label encoding and one-hot encoding. And
finally, we'll see how you can use discretization to convert continuous data to
categorical values. When you're building machine learning models, you'll encounter two
types of data, categorical data, which is made up of categories, male, female, month of
the year, day of the week, all of these are examples of categorical data. Or if your data
can take on continuous numeric values that is referred to as numerical continuous data.
data that you'll work with, text data, image data, must be converted into one of these
forms before machine learning models can work with them. Before we dive into more
detail, let's quickly compare and contrast numeric continuous data versus categorical
data. Numeric data can take on any value from an infinite range. Categorical data can
only pick values from a discrete set. Categories are a finite set of permissible values.
When you're working with numeric data, machine learning models that can be used to
predict continuous values are regression models. When you're predicting categorical
values of the output, those are classification models. Numeric values can always be
sorted based on magnitude, they have an inherent ordering or ranking. Categories may
or may not be sortable.
Numeric Data
In this clip, we'll discuss some of the data transformation techniques that we would use
when we're working with numeric data in machine learning. We've already discussed
the two broad categories of data that you'll encounter when you're working in ML,
numeric data and categorical data. Numeric data can be further divided into two
categories, ratio scale data and interval scale data. And categorical data also has two
subcategories, ordinal data and nominal data. Before we discuss ratio scale and interval
scale data further, let's talk about two other ways to think of numeric data. Numeric
data can be discrete in that you cannot measure it, but you can count it. Or you can
have numeric data be continuous and this is the most common example of numeric
data, data that cannot be counted, but can be measured. Examples of a discrete
numeric data that can be counted, but not measured are number of visitors to a
website in an hour, number of heads that show up when a coin is flipped 100 times.
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 34/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
You can count these numbers and express them and compare them, but the numbers
are discrete. On the other hand, continuous data is the more commonly used form of
numeric data, the height of an individual, home prices, stock prices, all of this data is
continuous. When you're working with numeric data, whether it's discrete data,
continuous data, ratio scale, or interval scale, the data always has an intrinsic order.
can be expressed as a ratio with respect to one. The number seven, for example, is
seven to the ratio one. When you're working with continuous data, all arithmetic
operations apply. You can perform addition, subtraction, multiplication, division, all of
these operations are meaningful on ratio scale numeric data. When you have two bits
of data expressed in the ratio scale, then magnitudes can be compared using arithmetic
chart that has a meaningful value for 0. A weight of 0 lbs is equal to no weight at all.
What we referred to as discrete numeric data where you can count, but not measure,
that is often interval scale data, ordered units that have the same difference or interval
per hour, per day, per number of coin flips. As you can see, data is still numeric here for
interval scale, but operations such as multiplication and division no longer make sense,
and also, there is no meaningful 0 point. Let's take the example of temperature
hour makes sense, but 0 Fahrenheit does not. So you can't always rely on interval scale
data to have a valid value for 0, but what's common to all kinds of numeric data that
you work with in machine learning is that they draw from an unrestricted range of
continuous values. And when you're working with numeric data, you can use some of
these statistics to describe that data, mean mode, standard deviation, correlation, co-
variances, all of these summary statistics work. Now machine learning algorithms
typically do not work well with data that is expressed at different scales. This is
especially true when you're working with neural networks. So if you have one feature
that is in a range 10, 000 to 100, 000 and another feature in the range 2 to 3, that's a
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 35/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
problem for your ML model. When you want to feed numeric data to your machine
learning models, it's common practice to apply feature scaling techniques to express all
of your features so that they are in the same scale and there are two approaches to
feature scaling, one is just scaling where you specify a min/max value and all of your
data is expressed within those values. Another commonly used technique or feature
scaling is to standardize your data. Standardization involves centering your data around
the mean and expressing all of your values using standard deviations. All of the records
have a number of features. These are the columns in your data. You will shift and
rescale all of the numeric values so all features have the same scale, that is all feature
values lie within the same minimum and maximum. That is feature scaling.
Standardization centers your data around the mean and divides each value by the
standard deviation so all of your features have 0 mean and unit variance or unit
standard deviation. This involves subtracting the average value of a column from each
value in that column and dividing each value by the standard deviation. The resulting
values are expressed in terms of number of standard deviations from the mean or z
scores as they are called.
learning, and in this demo, we'll see a few different techniques that you can use for
feature scaling and transformation. We'll start writing our code in a brand-new
notebook. Set up to import statements for the data science libraries that you'll use. I'm
going to call np.setprintoptions with precision 3. This will display floating point values
and NumPy arrays rounded off to three places after the decimal. The data set that we'll
use to perform scaling and transformation is the diabetes_processed.csv file. This is the
data that we cleaned up earlier. This is the data set that we'll use to fit a classification
model. The features for our classifier will be all column values, except outcome, set up
the features, and the target like you see here on screen. If you take a look at the shape
of our data, you'll see that we have 8 feature values that we'll work with. Let's invoke
the describe function in order to get a statistical overview of our numeric data. This
allows us to see that the different individual features in our data have very different
ranges. The minimum values are very different, some are positive, some are negative,
some are 0, and the maximum values are also different. When you're building and
training machine learning models, they work better and are more robust if all of the
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 36/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
features are in the same range and scale. One way to scale and transform your data is
Here I'm going to scale all of my numeric features to be in the range 0 to 1. In order to
rescale all of your numeric features, you can call fit_transform on your features data
frame and you'll get the rescale features as a result. The rescale features are in the form
of a NumPy array and let's sample this NumPy array to see what the resulting data
looks like. You can see that all of our original feature values are now just decimals
between 0 and 1. I'm going to instantiate these rescale features in the form of a Pandas
data frame. The columns of this data frame are the same columns that existed in the
original data. If you run the describe function on these rescaled features, this will allow
you to see that each numeric feature has been rescaled to be in the range 0 to 1,
minimum value of 0, maximum value of 1. You can visualize these rescaled features in
the form of a boxplot as well, and this boxplot shows you that the minimum and
maximum values are 0 and 1 respectively. Rescaling features using the min/max scaler is
very sensitive to the presence of outliers in your data set so that is something that you
need to watch out for. Another transformation technique that is commonly applied to
and divide by the standard deviation. Standardization centers all of your numeric
features around a mean of 0 and expresses each value in terms of the standard
deviations, multiples of the standard deviation. After having standardized our data, let's
take a look at a sample of the standardized features. You can see that there are
positive, as well as negative values. Negative values are values that were below the
mean, positive are values above the mean. Let's convert these standardized features in
the form of a data frame called standardized_features_df and let's run the describe
function. Observe that every numeric feature now has a mean of 0 and a standard
deviation of 1. Standardization is one of the most commonly used techniques to
prepare data for machine learning. This allows you to build robust models with features
that are comparable to one another with more or less the same scale. All values are
expressed in terms of standard deviations. Let's view the standardized features in the
form of a boxplot. You can see that the mean of every feature is around 0. The center
line of the box in the boxplot represents the median value and you can see that the
medians are very close to 0 as well. Medians tend to be close to the mean, unless your
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 37/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
numeric data is normalization, and for this, we have scikit-learn's normalizer estimator.
Normalization involves converting vectors to their unit norm representations and there
are different kinds of unit norms, such as the l1 norm. If you want to normalize your
feature vectors using the l1 norm, instantiate the normalizer by specifying norm is equal
to l1 and call fit_transform on your features. Every record or row in your data set is a
feature vector and normalization is a technique that converts this row or record to
have unit magnitude where there are different kinds of magnitude. Here, we have
normalized our feature vectors to have unit l1 magnitude. I'm going to convert our
normalize features to a data frame format and let's view one record in this data frame
that I called it index 0. Here are the normalized_features, and if you sum up the
absolute values of these normalized features, you'll find that they'll be equal to one.
When you normalize your feature vectors using the l1 norm, the sum of the absolute
values of your features for a specific feature vector will be equal to one. Using the
normalizer estimator object in scikit-learn, you can also normalize your features using
the l2 norm. Simply specify norm is equal to l2 when you instantiate the estimator, call
fit_transform on your features, and let's convert our feature vector representations to a
data frame. Every feature vector or record in your data frame will have unit l2 norm,
that is the sum of the squares of the individual features will be equal to 1. So let's
calculate the squares of the individual features, that is the result here, and if you
calculate the sum of the squares of the features, simply invoke the sum function at the
end, you can see that that is equal to one as well for l2 normalize feature vectors. The
third normalization technique that you can apply to your feature vector is the max
norm. Here the maximum value in a particular vector is represented as one and other
values or other features are represented in terms of this maximum. Call fit_transform on
your data to apply max normalization to your input features and let's represent our
features in the form of a data frame. If you take a look at the max normalize feature,
you can see that one feature in every record will be equal to one, that is the maximum
feature value. The remaining values are expressed in terms of this max. Now it's also
possible that you want to discretize your continuous numeric features to be in the
categorical form and one way to do this is by using the binarizer. The scikit-learn
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 38/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
binarizer uses a threshold value to convert features to 1, 0 values or binary form. In this
example, we apply the binarizer to the pregnancy column to specify the number of
pregnancies above the mean and below the mean. The threshold here is the mean
value for pregnancy. Values above the average will be represented by 1, below the
average by 0. Call fit_transform on the values in the pregnancies column to get the
binarized result. Let's take a look at a sample of these binarized features and you can
see that the values are 1 or 0. A value of 1 indicates that that individual has had
pregnancies above the average in the data set, a value of 0 is below the average. I'm
now going to use a for loop to apply this binarizer to every numeric feature that exists
in my data, that is all of the input features. Just to keep things simple here, I'm going to
use the average value for each feature as the threshold for the binarizer. Instantiate a
binarizer estimator with the average as a threshold called fit and then transform the
specific feature and append this feature to the binarized_features array. Every feature
in our input data set is now represented using binary feature values. Values above the
mean are represented using one, values below the mean for a feature are represented
using zeroes. Now that we've transformed our data using a number of different
techniques, let's build a classification model using these different kinds of transform
features. We'll build a simple logistic regression model for classification and calculate the
accuracy score for each model using this build_model helper function. It takes in X
values, that is our transform features, the corresponding Y labels, and the fraction of
the data that we want to use to evaluate our text_frac model. We split the X and Y
values that are passed into this helper function into training data and test data, fit a
LogisticRegression model on our training data, and use this model for prediction. And
we'll evaluate this model by calculating the accuracy score on the test data. The first
model that we'll build and train will use other features scaled using the min/max scaler.
This is where we rescale all of our input features to be in the range 0 to 1. And the
accuracy of this model was 75%. This is the model on rescaled_features. Let's build a
train, a model now using standardized features, features that are centered around the
mean and expressed in terms of standard deviations from the mean. And on our data,
standardized_features produces a model with an accuracy of 79%, an improvement
over the previous model. Let's now build a classification model using
normalized_features and the accuracy of this model is just around 70%.
can try building a model using other kinds of normalized features as well, such as l2 and
max and see what the result is. And finally, we'll build our last model using
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 39/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
binarized_features. Binarized_features are just 0, 1 values for values above the mean
and below the mean and this model also does pretty well with an accuracy of 72%.
Categorical Data
When you feed data to train your machine learning models, another kind of data that
you'll encounter is categorical data. So we've spoken of data being either numeric or
categorical. We've understood numeric data. Let's focus our attention in this clip on
categorical data. Categorical data can either be ordinal or nominal. Categorical data is
discrete in nature. It can only draw from a specific, very restricted set of values. With
categorical data, it can only hold certain values. Values are not drawn from an infinite
set. Categorical data is typically used to express categories or classes, so it's not
meaningful to calculate mean, standard deviation, correlation. Categorical data can be
true or false, boy or girl, spam or ham, the letters A, B, C all the way through to Z. All
kinds of analysis that you'll perform is tabulation and then use count frequencies, such
as histograms or percentage representations, such as using pie charts. Now within
categorical discrete data, your data might be ordinal or nominal. Ordinal data is
categorical, but has some kind of inherent order. There is a ranking or ordering to your
data. Examples of ordinal categorical data are months of the year, ratings on a scale of
1 to 5. You know that June comes after March. You know that December comes after
January. You know that a five-star rating is better than a three-star rating. There is an
inherent ordering or a ranking to the categories, but the differences between the
categories are not necessarily meaningful. For example, if you have height categories,
the difference between extremely tall and tall may not be the same as the difference
between medium and short. Or if you consider Michelin star ratings for restaurants, the
difference in quality between three, two, one and no Michelin stars for a restaurant are
not uniform. A restaurant which has one Michelin star may be any magnitude number
of times better than a restaurant with no Michelin stars. If your data is in the form of
categories, but the categories have no inherent rank or order, that is nominal data.
Nominal data is as far away from numeric data as possible. Categories, such as true or
false, boy or girl, spam or ham, cat, dog, bird, fish, there is no ordering to these.
Nominal data cannot be ordered. How do you remember this? Ordinal data can at
least be ordered, nominal data are simply names. I gave you a bunch of examples for
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 40/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
nominal data. Another one is here, brand names of cars, Ford and Honda, there is no
inherent ordering between these. Your machine learning models can glean a lot of
information from categorical data, but machine learning algorithms understand just
numbers, which means that your categorical data has to be numerically encoded
categorical data in numeric form. Let's say you're using categorical data to represent
city information and the cities are New York, London, Paris, and Bangalore. Categorical
data often involve classes or categories represented in the string format. If your data
first of these is label encoding where you'll have a unique numeric id for each category.
This involves transforming your original categorical column to a single new column. A
single column is sufficient to represent your data using label encoding. Another
categorical value and a value of 1 or 0 will indicate the presence or absence of that
category. Let's take a look at how it is to label encode your categories where we'll use
a unique numeric id for each categorical value. Let's say you have the city New York,
you'll encode it using a special numeric identifier, X0. The category is W0 and the
numeric encoding is X0 and these numeric identifiers can be generated at random if
you want to. For example, the number 32 could represent New York and you'll have
another number for each city. Fifty-five could represent London, you'd have another
number, let's say it's 1056 to represent Bangalore. Label encoding involves using just
one new column and just converting your categories to unique numbers. Let's move on
and discuss one-hot encoding because this is by far the most popular way to
numerically encode nominal data. Nominal data is categorical data with no inherent
ordering. One-hot encoding involves converting all of your categorical values to
columns. So we have four categories here. We'll create four new columns, one
corresponding to each city. We'll then use a value of 0 or 1 to indicate the presence or
absence of a category. When you one-hot encode your categorical feature, every
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 41/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
unique categorical value has its own separate column, so you need as many columns as
there are categories in your data. Let's say the categories that we are working with
corresponds to the four cities that we saw earlier. For each categorical value in that
feature, I set up a new column. I have four columns corresponding to four cities. I need
to use all four columns to represent each city in my categorical value. Let's say I want
to represent New York, it'll have a value of 1 corresponding to the column for New York.
All other columns will have the value 0. If you want to represent London, you'll have a 1
corresponding to the London column, other columns will have value 0. If you want to
represent the category Paris, you'll have a 1 corresponding to the Paris column, other
columns will have value 0. And finally, Bangalore will have 1 corresponding to the
Bangalore column. Other columns have zeroes as you see here on screen. Now that
we've understood how label encoding and one-hot encoding works, let's compare and
contrast both of these. Label encoding requires just a single column to represent
categories. With one-hot encoding, you need as many columns as there are categories
in the data. With label encoding, every category is represented using a unique numeric
identifier. With one-hot encoding, each category is a row with a single 1, the remaining
columns are zeros. It's pretty clear from our simple example here that label encoding is
far more concise than one-hot encoding. One-hot encoding is verbose. If you were to
add a new category, that would involve adding an entire new column. However, one-
hot encoding is often preferred over label encoding, especially for nominal data. The
presence of numeric identifiers give people the illusion of sortability. When you feed
label encoded values to machine learning models, they might pick up on the pattern
that 10 is better than 1 in some way when that may not be true. One-hot encoded
vectors are clearly not sortable. When you're working with categorical data, it's good
practice to use label encoding for ordinal categorical data where there is an inherent
ranking or ordering in your categories. One-hot encoding can be used for both nominal,
as well as ordinal categorical data.
and we'll use a slightly different data set here, one which has many categorical values.
This is the tent purchases dataset and the original source of this data is at this URL that
you see here. If you take a look at a sample of this data, you can see that it has a
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 42/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
number of categorical features. Age is the only numeric feature here. Gender,
marital_status, and profession are all three categorical features. If you're using this data
set to build a classification model, you'll use this customer information to predict
whether the customer purchases a tent or not. Now this is a rather large data set. If
you take a look at the shape of the data frame, you can see that it has around 60, 000
records. Let's run describe on this data frame. The only numeric value here is H. The
average age of the customer here is around 34 years and the standard deviation is 10
years. Let's move on and let's visualize this data before we proceed further. The first
thing that we'll view here is a bar graph of whether a customer purchased a tent or
not. You can see from this bar graph that over 50, 000 customers did not purchase
tents and a few did under 10, 000. Let's explore this data further. Let's view the
distribution of our customers based on their marital_status using a bar graph once
again. You can see that most of our customers are married, many are single, and a few
haven't specified their marital status. Another point that might be important is how our
customers are distributed based on their gender specification. Our bar graph here
shows us that the customers are evenly distributed between males and females, over
30, 000 males and a little under 30, 000 females. One last categorization we can use
to understand our customers is to split them based on their profession. You can see
that most of our customers have categorized their profession as other. We don't have
additional information. Some are in sales, a few are executives, few are retired.
Categorical values in the form of strings are not understood by machine learning
models, which is why we need to encode them in numeric form. We'll now encode the
gender column, which contains unique values, male and female represented using M
and F using the label encoder. The label encoder estimator object in scikit-learn will
assign unique and numeric integral ids to every category present in our data. Here, we
fit the label encoder on the gender data represented by M and F and this will generate
a unique id for males and a unique identifier for females. The label encoder generates
numeric identifiers starting with 0 and we'll use this label encoder to transform the
gender column in our data set. After we fit the label encoder on our data, the classes
property and the label encoder will show you the categories that have been encoded.
F will be represented using the numeric integer 0 and M will be represented using 1.
We've label encoded our gender columns and you can see this by sampling our data
set. You can see that gender now has numeric values, 0 for females, 1 for males. When
a feature that you're using in your ML model has more than two categories and you
label encode your features, scikit-learn estimators will often assume that the numbers
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 43/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
that you assigned as labels have some implicit ordering so you need to be careful when
you're using the label encoder with scikit-learn estimators. This is generally not a
problem with binary categories represented using just 0 and 1, but let's say you had
categories from 0 through 9. Scikit-learn estimators will assume that 9 is larger than 0.
Another categorical feature that we have in our data set is the MARITAL_STATUS. If
you take a look at a sample, you can see that the marital status can be single or
married. There was also unspecified if you remember earlier. When there is no implicit
ordering in the categories for your data, for example, married is not better than single
or vice versa. You might want to represent your categories using one-hot encoding,
and for this, you can use the scikit-learn one-hot encoder estimator object. Instead of
using a separate array to represent your categories, you can fit on the marital status
column itself. Once you fit on the marital status, you can take a look at the categories
property to see what categories have been encoded. Married, single, and unspecified
are the three categories for marital status. Let's get MARITAL_STATUS in one-hot
encoded form by calling transform on the MARITAL_STATUS column. The one-hot
encoder expects the input features to be in two dimensions, which is why we need to
reshape the values in that single column to be two dimensional. Let's take a look at the
labels here and you can see that each category is represented using one-hot encoding.
The categories property showed us that the categories were in the order married,
single, and unspecified, which is why the first column represents the married status, the
second column, the single status, and the third column, the unspecified status. The one-
hot encoded values are in the form of a NumPy array and I'm now going to assign
these as columns in our data frame. I'll have three columns here for status married,
status single, and status unspecified. And from our one_hot_labels array, I extract the
columns at index 0, 1 and 2, which represent these marital statuses. Let's go ahead and
take a look at the resulting data frame. If a customer is married, it's represented using a
value of 1 under MARITAL_STATUS_Married, and all other columns will have the value 0.
We are now ready to drop the original categorical column for marital status from our
original data frame and use this one-hot encoded column. Let's take a look at the
resulting data frame, which has marital status, as well as the gender columns in its
numeric encoded format. With one-hot encoding, the number of columns that you
need to represent a particular feature is equal to the number of categories, unique
categories in that feature. We had three unique categories for marital status, which is
why we need three columns for one-hot encoding. If you're working with your data in
the form of a pandas data frame, an easier way to one-hot encode your categorical
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 44/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
values is to use the pd.get_dummies function. You simply pass in the original data
frame and specify the column that you want one-hot encoded. Here, we one-hot
encoded the profession column. Pd.get_dummies will very conveniently get rid of the
original profession column and replace it with its one-hot encoded columns. When you
have your data in a data frame object, use pd.get_dummies. Let's see another example
of how easy it is to use. I'm going to reread the original data with all of the original
categories into a new data frame gosales. Here is the original data that we started off
with. I'm going to one-hot encode all of the categorical values in this data set. I simply
call pd.get_dummies and pass in the entire data frame without specifying any columns,
and you'll find from the result that all of the values, which are categorical in nature, have
been converted to their one-hot encoded form. Observe that Pandas has also
removed the original categorical column and replaced it with the one-hot encoded
columns.
process is called discretization and that's exactly what we'll do here in this demo.
Discretization or bucketization of your data is often performed in order to simplify the
data that you're working with. Set up the import statements as before and let's
generate an array here using np.array. This array contains continuous numeric integers
and I'm going to categorize or discretize this array into four categories using pd.cut.
This is the Pandas cut function that allows you to bucketize or bend your data. Here are
the categories that have been generated from this simple array here. The cut function
has examined our numeric continuous values and generated four different intervals and
each of these intervals represent a bucket or a category, for example, the number -11
belongs to this category here, the first in our list. Observe the parentheses on the left
and right side of the interval, on the left, we have the regular parentheses, and on the
right, we have the square bracket. What does this indicate? Let's take a look at the
categories attribute of the categories that we generated. Every interval range for the
category is generated using pandas.cut has a closed right interval indicated by the
square brackets. So for the first category, which includes the range - 11.025 to - 4.75 on
the right side, we'll include all points up to and including - 4.75. A closed interval
includes points at the limit and open interval does not. The categories.codes property
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 45/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
will give you numeric identifiers for each bucket. You can see that the first value -7
belongs to the bucket 0. The last value 8 belongs to the bucket 3 specified by the
interval 7.75 to 14. There is an inherent numeric ordering to our categories so
categories.ordered is equal to true. If you want the pandas.cut function to return bin
information, specify the bin intervals for your discretization, simply pass in retbins equal
to true. This will return an additional bit of information, an array with the bin edges. The
first bin edge is at - 11.025, then at - 4.75, then at 1.5, and so on. Let's instantiate
another simple continuous value. These are the marks code by students and you can
see that the marks range from 0 to about 100. I want to discretize these marks using
pandas.cut and I want to assign specific categories to the resulting bins. I'll discretize
these marks into four bins retbins is True and here are the labels that I want assigned to
these four bins. I want to categorize these marks as poor, average, good, and excellent.
Let's take a look at the resulting categories and you'll find that pandas would have very
helpfully assigned the right labels to the right marks. Twenty and 30 have been
categorized in the poor bin, whereas 99 and 80 have been categorized as excellent.
Thus using labels along with pandas.cut, you can get meaningful categories for your
integer values, which is why encode is set to ordinal here. This estimator supports other
encoding techniques, such as one-hot encoding for your bin values as well. The
KBinsDiscretizer allows you to specify different strategies to discretize your numeric
values. When you say strategy is equal to uniform, all bins in each feature will have
identical widths, their intervals will be the same. In order to encode our marks data, we
call fit on our 2D marks list and after fit, we can call transform on the same marks list
once again to get the discretized or categorized values. You can see that each bin is
represented using numeric identifiers. Let's compare these numeric ids with the original
marks array. You can see that 20, 30, and 16 have been categorized in the bucket
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 46/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
represented by 0. If you want to know the intervals that generated these bins, you can
access the bin_edges property. This will give you the edges of each bin and you can
see that the interval range for each bin is identical. The interval for each bin here is
20.75. You can calculate the difference between corresponding bin edges and get this
information. Let's consider another manually generated data set. This data has three
columns corresponding to three different features. We have just four records in this
data. I'm going to use the KBinsDiscretizer to bucket all three features into four
separate bins. I use the ordinal encoding as before. The bucketing strategy that I want
to use though is quantile. Here, all bins in each feature will have the same number of
points. The bin intervals will be different to accommodate the same number of points in
each feature. Let's use this KBinsDiscretizer and call fit on our X data with three
features, and once we fit the encoder, let's transform our X data. Here is the transform
data encoded into buckets. Buckets are different for each feature. Because we used
ordinal encoding, every bucket is represented using a numeric identifier. Highlighted
here are the bucketed forms of feature 1 and here are the buckets for the third feature.
We have four bins and four records and you can see that for each feature, there's
exactly one data point in each bin. If you take a look at the bin_edges generated using
quantile discretization, you can see that the bin intervals are different. The bin intervals
change to accommodate exactly the same number of data points in each bin for each
mean or the average of the bin edges. The original point was -21, which was bucketed
in the interval -21 to -15. The average of -21 and -15 is -18. So the inverse transform gave
us -18 as the data point and you can see that this holds true for the other inverse
transform data points as well. The average of the bin_edges gives us the data point.
work with the same automobile's mileage data set that we've seen earlier in this course.
Here are the columns of data present in our data set. I'm going to go ahead and split
them into X variables and Y labels. I'm going to perform just simply regression with a
single feature and the feature that I've selected here is the horsepower of the
automobiles. As you can see here, the horsepower data is in the form of continuous
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 47/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
numeric values. You can visualize this data using matplotlib and see what its relationship
is with the mileage of a car. Horsepower versus mileage shows a negative relationship.
It seems from here that cars with a higher horsepower tend to have lower mileage, but
really is there a difference in horsepower between 95 and say 100? Probably not. The
mileage for both of these cars might be around in the same range. Before we discretize
the horsepower data, let's fit a simple linear model on the original continuous data. I've
used the train_test_split here to split our data into a training subset and a test subset,
instantiate the LinearRegression estimator object, call fit on our training data, and let's
use this model for prediction. Here are the predicted values for our test data from a
simple linear regression model. We had just one feature here, the horsepower. Let's
calculate the r2_score on the test data and it's .57. That's a pretty decent r square
score, and if you visualize the scatterplot of the original data and our fitted line, you can
see that the line models the underlying data quite well. Given that we have just one
feature, horsepower, this model has pretty good predictive power. Now let's use the
KBinsDiscretizer to discretize our horsepower values into 20 separate bins. What we're
essentially seeing here is that horsepower matters, but there is no difference between a
car with a horsepower of 95 and 96 as far as mileage is concerned. If the horsepower
differences are huge, that's when it matters. So let's get our x_binned, this is our
horsepower information represented in the form of bins. We've ordinal encoded our
data, which means each of these numeric identifiers represents a different bin. If the
training data that we are going to use to build our model is discretized, we need to
discretize our test data as well called transform on x_test to get the binned
representation of our test data, that is x_test build. Let's instantiate the
LinearRegression estimator object and call fit. Observe that the X value that we've
passed in here is the bin representation of our horsepower data. Now that we have a
trained regression model, let's use this model for prediction on the test data. We have
to work with the binned test data once again. Here are the predicted values from our
model. Let's calculate the r2_score and see how this model does and you can see that
the r square here is even higher, 70%. Clearly discretizing the horsepower values has
improved the predictive power of our model. Let's view a scatter plot representation of
the original data and let's view the bin edges on this same visualization. We have our
original training data represented in the greenish yellow color. The test data
represented in the black color and we have the bins plotted in the form of vertical lines
as well.
Module Summary
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 48/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
Module Summary
And this brings us to the very end of this module where we discuss how we can
prepare continuous and categorical data to feed into machine learning models. We
started this module off with a discussion of what exactly categorical data means and
how it's different from continuous data. [00:45: 25.258 ] And within categorical data,
we discussed nominal data which cannot be ordered versus ordinal data, which has an
inherent ordering. We also discussed how machine learning models don't really work
well with data at different scales. And we saw a number of different techniques that we
could use to scale numeric features for data analysis. We studied and applied scaling, as
well as standardization and understood the differences between the two. We then saw
how categorical data needs to be numerically encoded before we can use them in ML
models. And we saw how we could use label encoding, as well as one-hot encoding to
encode categorical data. And finally, we rounded off this module by studying and
applying discretization to convert continuous data to categorical values. In the next
module, we'll focus our attention on working with too much data. We'll see how we can
apply feature selection techniques to reduce the data that we are working with and
Understanding Feature
Selection
Module Overview
Hi, and welcome to this module on understanding feature selection. We'll see how you
can select and use only those features to train your model that are the most relevant
or significant. Now when you're working with data, too much data can be a curse. It's
often referred to as the curse of dimensionality when all of your records have many,
many features. We'll discuss the problems that we encounter when we work with high
dimensional data and we'll then discuss techniques that we can use to reduce the
complexity of data, feature selection techniques, as well as dimensionality reduction
techniques. In this module, we'll focus our attention on feature selection. We'll
understand the different techniques that you can use to select the most relevant
features. We'll discuss three broad categories of techniques starting with filter methods
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 49/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
where you use statistical tests to determine which features are relevant. We'll also
discuss embedded methods which involve the use of building machine learning models
which assign an importance measure to individual features. We can then select those
features which are the most important. And finally, we'll also discuss wrapper methods
which lie somewhere in between filter and embedded methods. Wrapper methods
involve building many candidate models on different subsets of features and choosing
that subset which gets us the best model.
predict the height. This is a simple regression model. Now you might increase the
number of features and have two X variables. You feed in the weight of an individual
and the height of that individual's parents and predict the height of that individual. In
today's day and age though, machine learning models are used for far more complex
problems. You might have a classification model where you feed in a video clip and you
perform classification to identify the faces of all individuals in that clip. This leads to
dimensionality explosion. The video clip contains a huge amount of data. This explosion
in data has to be dealt with the right way to make your models useful. This is the curse
of dimensionality. As the number of x variables grows, several problems arise when
you're working with this data. When you're working with records that have many, many
characteristics or your rows have many features, you have problems visualizing your
data. You'll then have problems feeding in such data to your machine learning models.
You'll have problems during training. And finally, when you use model built with such
data in prediction, you'll encounter problems there as well. Let's talk about the first
because it's the most intuitive. As someone who is working with data, you know that an
essential first step is to understand your data and you do this through something
known as exploratory data analysis. This is an essential precursor to model building.
Without exploring your data, you won't understand it, you won't be able to clean and
prepare it for ML. ED is essential for identifying outliers that might exist in your data,
detecting anomalies and dealing with those anomalies, and also for choosing the
functional form of your relationships. What kind of data do you have, what kind of
model can you build with this data? When you're exploring your data, you'll use
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 50/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
common kinds of visualizations, boxplot, scatter plots, etc. to explore the relationships
that exist. Two-dimensional visualizations are powerful aids in EDA. If you add even a
single additional dimension, that is you go from two-dimensional to three-dimensional
data, that makes it hard to meaningfully visualize what's going on. Extrapolate these 2
Training is the phase where your machine learning algorithms learn from data. It's the
process of finding the best model parameters. Now complex models have thousands of
parameter values. During the training process, all of these thousands of parameter
values, especially neural networks, have to converge to the right model parameter
values and this convergence takes time. Training for too little time leads to bad models,
imperfectly trained models. As your data is more complex, the complexity of your
model grows as well. The number of parameters that need to be found grows rapidly
with dimensionality and this makes the training process extremely time consuming and
time is money, especially when you refer to compute resources. Typically, machine
learning models are trained at scale on a cloud platform. When you're performing on
cloud training using resources for a long period can prove very expensive. And finally,
when you're working with data with a very large number of dimensions, you'll find
problems in prediction as well. What is prediction? Once you have a fully trained model,
prediction in its simplest form is basically looking at a test instance and finding what
training instances this particular instance is similar to. Now as dimensionality grows, the
size of the search space that your model has to look within explodes. When your
instances have a large number of dimensions, every instance is very far away from
other instances. Your data set then becomes sparse and it's harder for machine
learning models to find patterns in sparse data. Higher the number of X variables,
higher the risk of overfitting on your training data. The curse of dimensionality is real,
but it's an easier problem to solve than the problem of insufficient data. You use feature
selection, feature engineering, and dimensionality reduction. This is something that
we've discussed earlier.
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 51/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
at some of the solutions that we can use to reduce complexity in our data. You can
reduce complexity using two broad categories of techniques, feature selection and
dimensionality reduction. And within feature selection, you can use filter methods to
select relevant features, wrapper methods, or embedded methods. If you want to use
dimensionality reduction to reduce complexity, you have choice there as well. You can
use projection techniques, manifold learning, or auto-encoding. When all your records
have a large number of features, you'll use feature selection to choose just a subset of
the original X variables. You'll try and use these techniques to select the most relevant
features from your data. Dimensionality reduction, on the other hand, involves changing
the features that you're working with in some way. You'll transform the original X
variables and project them onto new dimensions. We won't work with dimensionality
reduction techniques in this course. Let's do a quick overview. When you use projection,
you'll try and find new better axes and reorient your current data to be expressed
along this projection. Examples of dimensionality reduction techniques include principle
twist and turns in higher dimensions are smoothened out when the data is expressed in
lower dimensionality. Manifold learning works best when your data lies along a rolled-up
surface, such as a swiss roll or an S-curve, the data has simpler form. In lower
dimensionality, it's curved into a more complex form and higher dimensionality.
Examples of manifold learning algorithms are multidimensional scaling, Isomap, locally
linear embedding, Kernel PCA, etc. And finally, autoencoding is a dimensionality
reduction technique that works with neural networks. You build neural networks to
simplify the data. These neural networks try and find latent features or significant
features in your data and extract efficient representations of complex data, efficient in
that the resulting data has lower dimensions. When you reduce complexity in your data,
you're making a tradeoff. There are certain drawbacks to reducing complexities, such as
loss of information. When you use just a subset of features to train your model, there is
definitely information loss. You're losing the information in the features that you've
dropped. There is also performance degradation. It's possible that your machine
learning model will not perform as well as a model that has been trained on the entire
data. Reducing complexity is an important preprocessing step that you need to apply
to your data and these transformations can be fairly complex, which means they can be
computationally intensive. The data preprocessing steps become part of your machine
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 52/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
learning pipeline and your pipeline can get to be arbitrarily complex when you include
preprocessing. And finally, it's possible that when you use dimensionality reduction, the
transform features become hard to interpret so debugging and understanding why our
model works in a certain manner can be hard.
that you're working with data that exhibits these characteristics. You have many X
variables, that is many features and most of those features are irrelevant or contain
very little predictive power. You have a few features which are very meaningful. You
definitely want to use those and these meaningful variables are independent of each
other. That's when you choose feature selection. Here is the tree that we saw earlier
showing us the variety of options that we have to reduce complexity in our data. We'll
focus on the left branch of this tree, feature selection using filter methods, wrapper
methods, and embedded methods. Let's start off with the first of these filter methods
to select relevant or significant features. These are techniques where the features or
columns in your data are selected independently of the choice of machine learning
model. It doesn't matter what algorithm you're going to train on your data. The model
that you choose has no bearing on the features that you select. When you use filter
methods for feature selection, you're relying on the statistical properties of the features.
You'll use these statistical properties to see the correlations that exist between the
features and your target. And when you're selecting features, you can perform
univariate statistical analysis or multivariate statistical analysis. You can consider your
features or X variables individually or jointly. When you use embedded methods for
feature selection, your features or columns are selected during the actual model training
process. They are referred to as embedded techniques because the process of
selecting relevant or significant features are embedded within the actual modeling of
your data. So how does this work? There are certain machine learning models, such as
decision trees and lasso regression, which when trained on your data assign feature
importance or significance to your features. You can then use this feature importance
information to select features to train your final model. Now not all models perform
feature selection, only specific types of models do this. A third kind of technique you
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 53/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
can use to perform feature selection is wrapper methods. This is a technique that lies
different subset of features. Of all of these candidate models, you'll find the model that
works best on the test data and you'll basically say the features used to train this model
are the features that we need to use for the final model. Examples of wrapper methods
to train models are forward and backward stepwise regression. In the demos that
follow in the next module, you'll see some more examples. When you use wrapper
methods for feature selection, this can get computationally intensive because you're
building and training many different candidate models and candidate model is trained
on a different subset of features. However, all candidate models are similar in structure.
For example, if you're using decision trees, all of your candidate models will be built
using decision trees. Across wrapper methods, the way features are selected can be
different. Features may be added one at a time to the model or dropped one at a time
to see whether the model improves.
Filter Methods
Let's discuss in a little more detail filter methods for feature selection. Filter methods are
feature selection techniques where you select the features that you'll use to train your
model independent of the choice of model. This typically involves the use of statistically
methods for hypothesis testing to figure out which features are relevant or significant.
In the world of statistical techniques, there are a number of different approaches that
you can use to filter which features work best for your model. For example, variance
thresholding where you'll choose only those features which have a high variance. What
you're basically seeing here is that features that have a higher variance contain more
information. Another statistical technique that you could apply is the Chi-square test.
The Chi-square test is a technique that allows you to test whether two variables X and
Y are independent of one another. If you have a variable X that is independent of Y, X
obviously has no predictive power to predict Y. Another statistical technique is the
ANOVA, or the analysis of variance, which is a very popular hypothesis testing
technique. Or you could choose to use mutual information between variables. The
mutual information is the amount of information that you can glean from one variable
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 54/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
by observing another random variable. Let's discuss each of these in a little more detail
starting with variance thresholding. This is based on the principle that if all points have
the same value for an X variable, then that variable adds no information. Extend this
idea and then you go ahead and drop columns which have a variance below a certain
threshold. That's you build your model with only those features that have a high
variance above your minimum threshold. The chi-square test is a very commonly used
test for feature selection. For each X variable or feature, use the Chi-square test to
evaluate whether that variable and the target Y are independent. If they're
independent, drop that feature. The chi-square test is typically used in classification
models for categorical X and Y. When you perform chi-square analysis on your data,
what the test tries to do is check whether the observe data deviates from the expected
in a particular analysis. If there is a deviation from the expected based on the value of a
certain variable, that variable is significant or relevant. It tests the effect of one variable
on the outcome, so it's univariate analysis, analysis using just one variable. Under the
hood, the chi-square test calculates the sum of the squared difference between the
observed and the expected data in all categories and checks to see how different they
are and this difference is captured in terms of a chi-square statistic and a p value that
gives us the significance of that statistic. Another common statistical technique used to
select relevant features is the ANOVA, or the analysis of variance. The ANOVA looks
across multiple groups of populations and compares the means or the averages of
these populations to produce one score and one significance value indicating how
different these populations are. The ANOVA statistical tests allows you to compare
multiple groups of populations, not just two groups, and this is significant. When you
use ANOVA for feature selection, under the hood, the library will use the ANOVA F-test
to check whether the mean of the Y category varies for each distinct value of X. If the
average Y value for each X category is not significantly different, it means that X does
not influence Y. In that case, we can simply drop the X variable. And finally, we can use
mutual information between two variables to select features as well. This measures the
amount of information obtained on a random variable by observing another random
variable. So in some ways, this is a measure of the strength of the relationship between
two random variables. Now mutual information is conceptually similar to using the
Embedded Methods
This clip will dig a little deeper into embedded methods for feature selection. As we
discussed earlier, embedded methods involve the use of machine learning models to
select the best features. The features are selected during the process of training your
model. While the model is trained, the model assigns importance or relevance to each
feature and you can then select the most relevant features for your final model. There
are a few machine learning algorithms that automatically perform feature selection.
Decision trees are one amongst these and the other is lasso regression. Let's briefly talk
about how both of these work starting with decision trees. Let's say you're building a
classifier to determine whether a sports person is a jockey or a basketball player. Now
there is some stuff that you know automatically intuitively. Jockeys tend to be light to
meet horse carrying limits. On the other hand, basketball players tend to be tall, strong,
and heavy. Now your training data will contain a number of different sports persons
and their physical characteristics and your decision tree model will set up a tree
structure on this training data which will allow it to make decisions based on rules that it
has gleaned from your data. Decision trees typically consider each feature in turn. Each
feature is associated with a threshold for a decision and each feature becomes a node
in the decision tree. More important our significant features will be closer to the root of
the tree. So if weight is more significant than say the height of the individual, the weight
will be the root node, it has a threshold. Here the threshold is 150 lbs. If the sports
person is about 150 lbs, he or she is more likely to be a basketball player, under 150 lbs,
let's make a decision based on height, greater than 6 feet, well that's a basketball
player, under 6 feet, more likely to be a jockey. That's when decision tree machine
learning models are trained, what the decision tree tries to do is to fit knowledge into
rules and these rules are used to construct the tree and every rule involves a threshold.
How well your decision tree works depends on the order of the decision variables. The
rules and order are found using machine learning techniques during the training
process. The order of the decision variables determines the importance of the features.
Decision variables are X variables that are closer to the root are more important and
are considered first while making a classification. Now that we've understood how
decision trees assign feature importance values to your individual X variables, let's move
onto understanding lasso regression. We know that the linear regression model involves
finding the best fit line that passes through this data. This is the optimization problem
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 56/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
that linear regression tries to solve, it tries to find the best fit line, one that minimizes the
mean square error. The best fit line is one where the sum of the squares of the lengths
of these dotted lines is minimum. The formula for a line is of the form, y is equal to A +
Bx where A is the intercept and B is the coefficient for the X variable. So when you
perform ordinary mean square error regression, what you're trying to minimize is the
tries to minimize is referred to as an objective function. When you use lasso regression
instead of ordinary lee squares regression, the objective function changes somewhat. A
new term is added to the objective function. This is referred to as a regularization term
or a penalty. In the case of lasso regression, this penalty term is the L-1 norm of
regression coefficients, the sum of the absolute values of all of the regression
coefficients multiplied by an alpha, which determines the strength of this penalty. Alpha
is something that you control when you fit this regression model. The use of this penalty
forces the regression model to keep things simple and keeping things simple allows your
regression model to mitigate overfitting on the training data, which is why lasso
regression is often used to mitigate overfitting. When alpha is equal to 0, the objective
function reduces to regular mean square error regression. When alpha is equal to
infinity, that forces small coefficients to be equal to 0 and this forcing of small
coefficients to 0 is what allows lasso regression to select relevant features. Tweaking
the value of alpha allows you to perform model selection by selecting only relevant
features with large coefficients. This results in the elimination of unimportant features.
You'll only choose those features that have large coefficients, those are the significant
features.
Module Summary
And this discussion of feature selection techniques brings us to the very end of this
module. We started this module off by discussing the curse of dimensionality and the
problems that it can cause us in machine learning, problems in visualization, training, as
well as prediction. When you have too much data to work with, you need techniques to
reduce the complexity of data. You can use feature selection techniques or
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 57/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
simply select those features that are the most relevant or significant and there are
three broad categories of feature selection techniques, filter methods, which involves
the use of statistical analysis to determine which features are the best. Filter methods
for feature selection are completely independent of the choice of machine learning
algorithm. We then discussed embedded methods, which involve actually building and
training machine learning models that assign importance to features. In this context, we
discussed decision trees and lasso regression, both of which have ways in which they
indicate the importance of specific features. We can then select those important
features for our final model. And finally, we briefly discussed wrapper methods that lie
midway between filter and embedded methods. Here, we build and train candidate
models on different subsets of features and we select those features that correspond
to the best model. In the next model, we'll apply all of these feature selection
techniques in practice.
Implementing Feature
Selection
Module Overview
Hi, and welcome to this module on implementing feature selection. In the previous
module, we got a conceptual understanding of how feature selection works using filter
methods, embedded methods, and wrapper methods. In this module, we'll apply all of
these concepts in practice. We'll start off by calculating and visualizing the relationships
that exist in our data. We'll calculate feature correlations to measure the strength of
relationships between variables. We'll then see how we can detect and handle
multicollinearity in data. When you use X variables that are correlated to train your
machine learning model, that results in a less robust model. We'll then apply in a hands-
on manner all of the feature selection techniques that we discussed in the previous
module. We'll see how we can perform feature selection using the missing value ratio
and the variance threshold. You might have data with missing values and certain
features might be more prone to missing values. You'll simply eliminate those features
based on a threshold that you specified. That is feature selection by specifying a missing
value ratio, the variance threshold we discussed earlier. We'll then move onto filter,
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 58/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
Feature Correlations
If the features that you use to train your machine learning model are highly correlated,
the model will not be a robust one. In this demo, we'll see how you can visualize and
view feature correlations that exist in your data. We'll write our code in a brand-new
notebook called feature correlations, and in this demo, we'll use a Python library that
we haven't used earlier. This is the yellow break library. This is an open source Python
project, which extends the scikit-learn API with visual analysis and diagnostic tools. You
can use the analysis tools that Yellow Brick offers in order to select the right features to
train your model. At the time of this recording, the latest version of Yellow Brick that's
available is 0.9 .1 and that's what we're using here in this demo. We'll view the feature
correlations that exist in our diabetes dataset. We'll read this in from our CSV file into a
Pandas data frame. We'll use the health characteristics of these individuals to build a
classification model to predict whether they have diabetes or not. If you want to view
the correlations that exist between the different features in this data set, the easiest
way is to use the core function available on a Pandas data frame. The correlation matrix
shows you the correlation coefficients that exist between the different pairs of variables.
As we've discussed earlier, the correlation coefficient is a number between -1 and 1
where the value of -1 indicates that 2 variables are perfectly negatively correlated and
the value of +1 indicates that 2 variables are perfectly positively correlated. A correlation
value of 0 indicates no correlation. For example, in this data set, you can see that the
number of pregnancies and the glucose levels of an individual are positively correlated
with the outcome whether the person has diabetes or not. You can see the correlation
coefficient between age and the number of pregnancies is also positive indicating that
an older person is likely to have had more pregnancies. As we've discussed earlier in
this course, a very useful visualization tool for the correlation matrix is the heatmap
available in Seaborn. A heatmap is essentially a colored grid or matrix where the
different colors indicate whether two variables are positively or negatively correlated
and also the strength of their correlation. Let's see another technique that we can use
to visualize feature correlations using the Yellow Brick library. I want to view the
correlations between these features, Insulin, BMI, BloodPressure, and
DiabetesPedigreeFunction, and the age of an individual. I want to see whether these
features are positively or negatively correlated with age. The columns in my X data
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 59/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
frame will give me the feature names. The features that I've specified here are the four
that we've seen earlier. The Yellow Brick Python library offers a FeatureCorrelation
object allowing you to view the correlation that exists between pairs of variables. The
FeatureCorrelation object, by default, calculates the person's correlation between two
variables. The pearson's coefficient is a measure of the linear correlation between two
variables where one indicates total positive correlation and -1 total negative correlation.
A value of 0 indicates no correlation at all. When you're working with correlations, you
should remember that correlation does not mean causation. It's not possible to derive a
causal relationship purely based on the correlation data. Correlations are also very
sensitive to outliers. A single unusual observation may have a huge impact on a
correlation. Visualizer.fit calculates the person's correlation coefficient between all of the
X variables that we have specified and our Y value, that is the H. Visualizer.poof will
actually display the visualization. And here is a nice representation of the correlation
data. You can see that insulin levels are negatively correlated with age. Older people
tend to have lower insulin levels, whereas, the other three features, BloodPressure, BMI,
and DiabetesPedigreeFunction are positively correlated with age. If you want to view
the individual correlations course that's available in this course property on the
visualization and here are the correlation scores for the X features that we passed in.
What are those features? Well, that's available in the features property. As we've seen
in the visual, insulin is negatively correlated, the other features are positively correlated.
If you want a side by side representation of the features and the corresponding
correlations, you can set up a data frame where the first column has the feature names
and the second column has the correlation coefficients. Let's now view the correlations
between all of the features in our input dataset versus the outcome, whether the
person was diagnosed with diabetes or not. I'm going to use the columns property in
my X data frame to extract all of the feature names that I'm going to use to calculate
correlations. Here are all of the feature names. Let's use the feature correlation object
from our Yellow Brick library. We'll use this to calculate the pearson's correlation
coefficient once again. Call fit on X and Y to calculate the correlations of all X values
versus Y. That is the outcome. It's important here to realize that pearson's correlations
are meaningful only for metric variables, typically continuous numeric variables. They
can also be used with dichotomous variables, that is variables that have two discrete
values, such as outcome. Correlations between variables can be calculated using other
techniques as well, such as neutral information. Here are the feature names that we are
working with. You can think of pregnancies as a discrete variable. If you really think
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 60/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
about it, pregnancies are discrete numeric values. When you're calculating correlations
using mutual information, you need to specify what values are discrete. In order to
calculate correlations using mutual information, I'm going to set up a Boolean vector
with true/false values representing whether a feature is a discrete feature or not. I'll
initialize this list to false first assuming all features are continuous. I'm going to set
feature 0 to True. That is pregnancy, that is a discrete feature. Next, I instantiate a
feature correlation object to calculate correlations using mutual info for classification
problems. First, let's understand what mutual information means and represents. This is
a measure of the dependency that exists between two variables. Mutual information is
equal to 0 when 2 variables are independent. It's non-zero otherwise. Mutual
information tries to quantify the amount of information obtained about one random
variable through observing the other random variable. When you use this mutual
information technique to calculate feature correlations for classification, that is when the
output is a discrete variable, you need to specify which of the input features are
discrete. That's because the calculation for correlation for discrete features is different
from that of continuous numeric features and you can see that for correlations
calculated using mutual information, all of the features are positively correlated with the
outcome. It's possible for you to calculate the correlations for all features with the
outcome, but visualize only a few of these features. Set up the features that you want
to visualize and pass this into your feature correlation object. The feature names input
argument will take in only the features that you want to plot. I also want them sorted,
so I've passed in sort is equal to True. Call visualizer.fit to calculate the correlations
between X and Y variables and here are our sorted correlated features. Glucose is the
most highly correlated feature, then comes BMI and the last is BloodPressure.
features where your model is not going to be very robust. So how do you detect and
handle multicollinearity in your data? That's what we'll do now. We'll work with the
automobile data set that we are familiar with and that we have seen before. Here are
the columns in our automobile data set and we'll use the information that we have on
the different cars to predict its mileage. That is our regression analysis. If you invoke the
describe function on the Pandas data frame, you'll be able to see that our features are
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 61/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
very different. You can see that the mean values and the standard deviations of the
different features are very different, they're all at different scales. Let's standardize all of
our input features before we use these features to train our machine learning model.
We'll standardize our feature by using the scale function available in scikit-learn. The
scale function will express each value by subtracting the mean and dividing by the
standard deviation so all feature values are centered around the mean of 0. If you run
the describe function on the Pandas data frame now, you'll see that all features now
have a mean of 0 and a standard deviation of 1. We are familiar with this data. We have
387 records in our data set, which we'll use for regression analysis. I'll now set up the X
variables and the Y values to build and train our regression model. For the X variables,
I'm going to use all of the columns, except miles per gallon, that is the label, and the
origin column. The origin column contains categorical values. The only reason I'm
leaving that out is for convenience. You can one-hot encode these categories and use
the origin column as well if you want to. Next, I split our data into training and test
subsets. I'll use 20% of the data to evaluate the model that we'll build. We'll use simple
linear regression and we'll fit this estimator object on our training data. Once our
regression model has been trained, let's calculate the r2_score on the training data and
it's 77% so that's pretty decent. Let's use this model for prediction on the test data.
And once we have the predicted values from our model, let's calculate the r2_score of
the test data and it's even higher, it's 84.4 %. Our model here is a pretty robust one.
When you have multiple predictors or multiple features in your regression model, a
better measure of how good your model is is the adjusted r2_score and that's exactly
what we're calculating here. The adjusted r square is calculated from the r2_score and
is a corrected goodness of fit measure for linear models. This is an r squared that has
been adjusted for the number of predictors that you've used for your regression
analysis. The Adjusted_r2 increases only if a new predictor that you've added to train
your model improves your model more than the improvement that can be expected
purely due to chance. Here is the mathematical calculation for the Adjusted_r2 given
the r_square and the number of predictors in our training data. We know the r_square
of our linear model on the test data, let's calculate the Adjusted_r2 as well and you can
see that the Adjusted_r2 is a little lower, 83% as opposed to 84. When we explored
and visualized this data earlier, we saw that there were many features that were highly
correlated. If you look at the correlation matrix, you can see that cylinders, horsepower,
and weight are all three highly correlated with the displacement of a car. This high
correlation coefficient almost at .9 and about indicates that these variables are likely to
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 62/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
be colinear. Another way to look at this is that all of these variables, cylinders,
horsepower, weight, and displacement give you the same information so you don't
really need to use all of them in your regression analysis. Using this correlation matrix,
let's say you want to identify all of the features that are highly correlated with the
correlation coefficient of greater than 0.8. And here in the resulting Boolean data
frame, you can see that the four features that we mentioned earlier are highly
correlated features. In order to avoid the multicollinearity that exists in our feature
variables, I'm going to drop the features for cylinders, displacement, and weight and
leave only horsepower in there. Here is the correlation matrix representation of our
trimmed_features, and you can see that for our X variables, the only X variables left
here are horsepower, acceleration, and age. The correlation values are fairly low. If you
go ahead and check the correlation coefficients amongst our trimmed_features, you
stats model's library. If you don't have this package installed on your machine, you can
get it with a simple pip install stats models. The variance inflation factor for a particular
feature i is calculated as a relationship between all other features, that is all features
other than i, and that feature i. I'm going to calculate this variance inflation factor for all
features here and store it in a Pandas data frame. You have to calculate this factor for
each feature separately. So for a particular feature i, use all of the other values to
calculate this VIF Factor. I'll now set up an additional column in this data frame called
features, which will hold the names of the different features, the VIF factor is the other
column, and I'm going to round off all of the values to two decimal places. And here
are the scores for the individual features in this data set. We have the VIF factor as the
first column and the feature name is the second column. A VIF of one indicates that a
feature is not correlated with the other features. A value between one and five
indicates that a feature is moderately correlated with other features, and a value
greater than five indicates that a feature is highly correlated with the other features in
our data set. Displacement and weight, as you see here, have the highest VIF values,
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 63/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
the variance inflation factor. So I'm going to go ahead and drop displacement and
weight from my X data. I'll now calculate the VIF values for each of the features that
remain in exactly the same way as before. So I'll have a features column and a VIF
Factor column, which I'll round off to two decimal places. Having dropped the weight
and displacement features, you can see that the remaining variance inflation factor
scores are below five. So based on the variance inflation factor scores, these are the
features that I'm going to use for my regression analysis. Now let's set up our X values
and our Y values. For the X data frame, we drop MPG, which is the label. We'll drop
calculate the r2_score on the training data and the score here is 0.72, it's a fairly high
score. Let's calculate the R2_score for the test data, as well, but before that, we need
to call predict on the x_test data, get our predictions. Let's calculate the r2_score. It's
again around 72%. Let's calculate the adjusted r2_score on this test data and it's almost
72% once again. When you don't have highly correlated features, you'll find that your
Adjusted_r2_score is very close to your original r2_score.
code for this demo in a brand-new notebook. Go ahead and set up the import
statements for the libraries that we'll use. We'll work with the diabetes_data that we've
seen before. We'll use the health characteristics specified in these records to predict
whether an individual has diabetes or not. Observe that I've read in the diabetes.csv file
and a number of fields have values of 0. We know that 0 represents missing values
here in this data set. I'm going to replace every occurrence of 0 with the np.nan. We
can now invoke the isnull.sum function to see the number of missing values in each of
our columns. As you can see, there are a large number of missing values here. For each
column in our data which contains missing values, let's see what percent of the values
are missing. So I'll calculate isnull.sum, that is the numerator, divide by the length of my
data, and multiply by 100 to get a value in the form of a percentage. For Glucose, this
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 64/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
value is .65, which means that very few values, less than 1% of our data is missing. Let's
perform this same calculation for the BloodPressure column and you can see that
roughly 4.5 % of the data contained missing values here. If you were to perform this
calculation for the SkinThickness column, you can see that there are a large number of
values missing, 30% of our data is missing for this particular column. For insulin, this
goes up to 48, almost 49%. Now you might say that our data set is fairly small and skin
thickness and insulin have so many values missing, should we really use these features.
Let's take a look at another column here for BMI, only 1.4 % of the data is missing. Here
are the original columns in our data set. We've explored all of the features with missing
value and we have an idea of the percentage of missing values in each column. The
drop any function in a Pandas data frame allows us to drop records that have missing
values. This dropna function can also be used to drop columns with missing values
beyond a certain threshold. When you specify axis equal to 1, this threshold refers to
columns and not records. The threshold that I've specified here basically says that I
want to preserve only those features where 90% or more of the values are present,
that is not missing. I'll only preserve those columns which have less than 10% missing
values. Let's go ahead and take a look at the trimmed columns here. You can see that
the columns for insulin, as well as SkinThickness have been dropped. They contained
more than 10% missing values. We've gotten rid of those features. Now with these
existing features, you can build a classification model if you want to. You know how to
do that. I won't really go into that here. I'm going to go ahead and read in the diabetes
dataset once again, this time from the diabetes_processed.csv file. This is the file where
we imputed several missing values and cleaned up our data. I'm going to set up two
data frames here, one with the X values or features that we use to build our
classification model, that is all columns, except outcome and one with the Y labels. That
is the values from the outcome column. Pandas data frames contain this very useful var
function that allowed you to calculate variance in a column wise manner. X.var will give
you the variance of the different features that exist in our data. Now you can see that a
feature whose value does not vary much does not contain very much predictive
information. Our variance calculation here shows that the DiabetesPedigreeFunction
has a variance that is very, very small, just 0.1. So you could say that they contain more
information. Now the scales of the different features here in this diabetes dataset are
very different. You might want to scale them to be in the same range before you
calculate the variance. I'm going to do exactly this right now using the min/max scaler
from scikit-learn. I'm going to scale all of the features to be in the range 0 to 10.
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 65/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
X_scaled is the data frame that contains the scaled features. I'm going to invoke the var
function on this data frame to see the variance of these scaled features. Now that all of
the features are in the same scale, you can see that the DiabetesPedigreeFunction has
a fairly high variance. Now if you go by the principle that features with a high variance
have more predictive power contains more information, then you can use the variance
threshold estimator object in scikit-learn in order to select the features that you want to
work with. A threshold of one here indicates that I want to work with only those
features which have a variance greater than or equal to one called fit_transform on
your scaled features and here are the selected features from our data frame. We
originally had eight features, one of them has been dropped, and we are left with
seven. And the feature that was dropped, you can see by looking at the variances of
the individual features is the SkinThickness. SkinThickness is the only one which has a
variance of less than one, so that is the feature that was dropped using our variance
threshold selector. Our selected features, the seven features that were selected, are
stored in the X_new array. I'm going to create a data frame from this array and store it
in the x_new variable once again. As you can see, this data frame here doesn't really
have any column names so we need to figure out what columns this data corresponds
to and we'll do that using this little nested for loop here. I have two for loops here, one
which iterates over the columns in our selected features, selected using the variance
threshold of one, and the second for loop iterates over the original columns. Each time
we find a match in the values in a particular column, that is a feature that is present
among our selected_features. I'm going to append the name of that feature to the
selected features list. And here are the features that were selected using a variance
threshold of one, all features, except skin thickness.
classification model after we've selected relevant features, which means we'll work with
the diabetes dataset. This is a dataset that we are familiar with, it needs no
introduction. We'll perform feature selection using scikit-learn estimators and I'm going
to set up a little helper function here to figure out the names of the features that were
selected. This helper function is called get_selected_features and it takes as its input
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 66/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
arguments the original X features and another data frame X_new, which contains the
features that were selected using various statistical techniques. I'm going to identify
which features were selected by comparing the values in the individual columns by
running two nested for loops. Each time we see that a particular column in the original
data frame and the new data frame has a match, we select that particular feature and
append it to the selected features list. Now this is a rather cruel technique to identify
the selected features, but it works well for our prototype model. In the real world, you
won't really care what features were selected. You just want the relevant features to
build and train your model. The first statistical technique that we'll use to perform
univariate feature selection is chi-square. The chi-square analysis between every feature
in the target variable calculates the measure of dependency between these two
variables. We'll use this relationship to determine the most relevant features for our
classification model. The chi-square test gives us a goodness of fit measure because it
measures how well the observed distribution of a particular variable fits with the
distribution that is expected if two variables are independent. In order to select relevant
features, scikit-learn offers this very useful estimator object called SelectKBest. You
simply specify the value for K that you want and SelectKBest. We'll select KBest
features based on the statistical analysis you want to perform. I'm going to extract all
of our classification features into a data frame called X and Y will store the outcome
values. This contains the target labels for our classification model. The shape of X shows
us that we have 8 columns of data, that is 8 features to start off with. All of the features
in this data set are numeric. I'm going to convert all of the numbers to be in the same
np.float24 format. Calculating the most relevant features using the chi-square technique
is very straightforward. Simply instantiate a SelectKBest estimator object and specify
that the score_function that you want to use is the chi-square function. You also need
to specify a value for K. As specified, K is equal to 4 here. Now that I have the
SelectKBest estimator, I'm going to call fit on my training data along with the Y values.
This will select those X values which have the highest chi-square measure with the
corresponding Y value. Let's see fit.scores, this will give us the chi-square's score for
each of our individual features. In order to see what feature each of these scores
correspond to, I'm going to set up a little data frame with the feature name and the
corresponding chi-square score and I'm going to concatenate all of this to get a single
feature_score data frame. Here you can see the features and the corresponding chi-
square scores. In order to actually select the K, in this case, K is four, most relevant
features, you need to call transform on your X data and this will give you X_new, which
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 67/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
I'll now convert to a data frame and you can see that here are the four most relevant
features. We don't know what exactly these features are. The feature names have been
lost, but we can get these selected features using the get_selected_features helper
function that we set up earlier. The selected features selected using the chi-square
technique are Glucose, Insulin, BMI, and Age. These are the most relevant features. I'm
going to save the features selected using chi-square analysis in a new data frame called
chi2_best_features. We'll use these to build and train a classification model in just a bit.
But before that, let's see another feature_selection technique. The f_classif function in
the scikit-learn library uses the ANOVA F-value as a measure of dependency between
variables and it uses this relationship to determine the most relevant features for
classification. If you want to build a regression model and select features using ANOVA,
you have to use the corresponding F_Regression function. Here, you don't really need
to know the exact details of the ANOVA statistical analysis, just that this is a measure of
the strength of the relationship between variables. Instead of the SelectKBest estimator
that we saw earlier, we'll use another technique to select relevant features, the
SelectPercentile estimator. The SelectPercentile estimator that we've instantiated here
will select those features that are in the top 80 percentile. It'll drop 20% of the features
at the bottom, the ones that are the least relevant. Call fit on your X variables and Y
values and the ANOVA F-value scores for your individual features are available here in
this course property. Let's now select the top 80 percentile most relevant features by
calling transform on our X values and store the result in X_new, which I'll now convert
to a data frame. This data frame does not have the feature names. We'll extract the
feature names of the selected_features using our get_selected_features helper function.
Here are the selected features using ANOVA analysis. There are six features here
because we had specified percentile as 80. Let's get all of the selected features using
ANOVA analysis into a new data frame called f_classif_best_features. Now that we
have a relevant feature selected using two different statistical techniques, let's go ahead
and train a LogisticRegression classifier model. We'll use the build_model function here
to train different models. We'll split our data into training and test data, build a logistic
regression model, and use this model for prediction. And once we have predicted
values from the model, we'll print out the accuracy of this model on the test data. Let's
invoke the build_model function on our entire data set. Here, we'll use all of the original
eight features to build and train our model, and the accuracy of this model is 74.6 %. I'll
now build a model using the four most significant features that we've got using chi-
square analysis. And this model did even better. The accuracy here is almost 82%. Let's
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 68/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
build and train a model using the ANOVA selected features and this model also does
pretty well, an accuracy of 76.6 %.
feature elimination, forward selection, and backward selection. All of these are wrapper
methods for feature selection where you generate models with subsets of features and
find the best subset to work with based on the model's performance. In order to
perform feature selection using forward and backward elimination, we'll use the ML
extend library available in Python. This is the machine learning extensions library, an
open source Python library with different useful data science tools. Once this Python
library has been installed on your local machine, you can go ahead and use it as we'll
do shortly. Once again, we'll work with the diabetes dataset and I'm going to read this
data set into the diabetes_data data frame. Before we apply our feature selection
techniques, I'm going to select all of the features available and place it in an X data
frame. The Y data frame contains our target labels. The first technique that we'll use to
perform feature selection is the recursive feature elimination available in the form of an
estimator object in the scikit-learn library. Recursive feature elimination will train a
variety of models with selected features. It'll first start off with an entire set of features,
and at every step, it'll prune the least important feature until it finds the best possible
subset. Pruning the least important feature at each step is done recursively and that's
what gives this algorithm its name. In order to use the recursive feature elimination
estimator, you'll need to train a model on your data. Now the model that you choose is
up to you. I'm going to build a classification model using logistic regression. Instantiate
an rfe estimator object and pass in our logistic regression classifier as an input
argument. The number of features that we want to select is equal to 4. Invoking the
rfe.fit function will train our logistic regression classification model on different subsets of
features. Iteratively at every step, the worst performing feature will be eliminated. Then
we're finally left with the four best features to train our model. Once the recursive
feature elimination algorithm is complete, let's print out the final number of features, the
support, and ranking from our model. You can see that a total of four features were
finally selected. That's what we had specified. The selected features are specified by
true values in this 1D vector and the feature rankings are also given. I'm going to collate
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 69/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
all of this information together into a single data frame so that it's easier for us to view
the names of the columns, the ranking of each column, and whether the feature was
selected or not. Here is a data frame with all of the feature ranks and you can see that
there are four features with rank 1 and those features are Pregnancies, Glucose, BMI,
and DiabetesPedigreeFunction. Recursive feature elimination deemed that these
features are the most significant. I'm going to get the names of all of the selected
features into a single list by figuring out where feature rank selected is equal to true and
the selected features are the ones that you see on screen here. Let's extract all of these
features into a single data frame. I want the four features selected using recursive
feature elimination to be in a data frame named recursive_features. We'll use these to
train a classification model in just a bit. Let's move onto another feature selection
removes one variable at a time based on the performance of our classifier model until
we get to the specified number of variables. I'm going to go ahead and instantiate this
SequentialFeatureSelector and the model that I want to use, the classification model
predictions from the individual decision trees it has built under the hood. I've used the
SequentialFeatureSelector to perform forward selection. As indicated by our input
argument, forward is equal to true. I want to select four features by adding one feature
at a time to improve our model and the scoring that we'll use to evaluate whether the
added feature was good or not is the accuracy of this RandomForestClassifier. Call
feature_selector.fit on our X variables and Y values. And once we've got the selected
features, the selected features are represented using their indices in the K_feature_idx
property of the features and we'll use that information to get the selected feature
names in this forward_elimination_feature_names list. Let's set up a data frame with
these features called forward_elimination_features. We'll use these features to train a
classification model in just a bit, but before that, let's select features using backward
elimination or backward selection. We'll use the SequentialFeatureSelector from the ML
extend library. We'll use a RandomForestClassifier to select the best features.
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 70/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
Y data to get the selected features using backward elimination. Here are the backward
elimination feature names. Let's extract the data frame with exactly these features and
we'll use this data frame back elimination features to train a classification model. The
final model, the final classifier that we'll build and train is the LogisticRegression classifier.
Let's use this build model helper function to train this classifier. We're familiar with this
helper function, it needs no explanation. For each classifier that we'll build and train,
we'll print out the accuracy score of the model on our test data. Let's build a model
using all of the features from our input data set. This has an accuracy of 76.6 %. Let's
now build a model using only four features that we selected using recursive feature
elimination. This model performs better than our earlier model where we use all of the
features. It does an accuracy of 78.5 %. Let's build a model using feature selected using
forward elimination and this model here has an accuracy of 74% so not that good. Let's
build a model using backward_elimination_features and this model also has a similar
accuracy, 74%.
techniques. We'll use lasso regression, as well as a decision tree regressor. And for this,
we'll read in our regression analysis data set, the cars_processed.csv file. We are familiar
with this data set, it needs no introduction. In order to keep things simple, we'll work
with all of the features available in this dataset, except for origin and the Y values, that
is the value that we are trying to predict is, of course, the miles per gallon. Lasso
regression is a regularized regression model where there is a penalty function involved
for more complex coefficients. Important instantiate the lasso regression object and call
fit on our data set. Here, we are not really evaluating this regression model, only seeing
which features are significant so I'm not going to split this data into training data and
the test subset. The alpha value here is the regularization parameter and determines
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 71/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
the strength of the regularization of our model. The penalty parameter in a lasso
regression model is the sum of the absolute values of the coefficients and the penalty
parameter is multiplied by this alpha value that we have specified here. Once we fit this
lasso regression model on our data, there is a property called coefficients in every
regression model which gives you the value of the coefficients for the individual
predictors. Here, our predictors are all of our columns in our X data frame. Let's set up
a Pandas.Series object with the predictors and the corresponding coefficient and sort
the predictors by the value of the coefficients. The regularization parameter in lasso
regression forces unimportant coefficients to be close to 0. You can see here that the
most important predictors are age and the weight of a car and these are the features
that we'll select called lasso_features. We'll use these lasso features to train a different
regression model in just a bit. Now let's calculate which features are important using a
decision tree regressor. Here we instantiate and train a decision_tree model. During the
construction of a decision tree, the structure of the decision tree is such that the more
important features are higher up in the tree structure closer to the root. Instantiate
your DecisionTreeRegressor. I've constrained the tree to have a maximum depth of
four and call fit on our X and Y data. The predictors that we've used to fit this decision
tree are all of the columns in our X data frame. Every decision tree model that you
build has a property called feature_importances which assigns a numeric value to every
predictor. I'm going to sort these predictors based on their importance and here is the
result. The most important features based on this decision tree model are displacement
and horsepower. I'm going to go ahead and set up a list using these features,
displacement and horsepower, and these are the features that we'll use to train a
regression model in just a bit. We'll build a simple linear regression model using features
selected using our embedded methods. And in order to build and train this regression
model, we'll use a helper method as we've done earlier, the build model method. And
at the very end after training our regression model, we'll print out the r2_score of this
model on the test data. Alright. If we use the features selected using lasso regression,
we get a model with an r2_score of .76, that's pretty good. If you build and train a
model using features from our decision tree, we'll get a model with an r2_score of .73.
Module Summary
And with this demo, we come to the very end of this hands-on module on
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 72/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
and visualizing the different relationships that exist in our data. Specifically, we use
different visualization techniques to identify the correlations between our features.
When your feature variables are highly correlated, your models tend to be less robust.
We then saw how we could detect and handle multicollinearity in data. We then moved
onto applying feature selection techniques that we had discussed conceptually earlier.
We implemented techniques such as missing value ratio and variance threshold to select
specific features. We also selected features using filter methods, wrapper methods, and
embedded methods. And this brings us to the very end of this course on preparing
data for machine learning. Now if you're a student of machine learning and you want to
go beyond the basic models that you've built, here are some other courses on
Pluralsight that you could watch. Employing Ensemble Methods with scikit-learn will
introduce you to ensemble techniques for ML. If you want to venture beyond traditional
machine learning models into deep learning, Building Your First PyTorch Solution is a
course that'll get you started with the PyTorch deep learning framework. And it's time
for me to say goodbye. That's it for me here today. Thank you for listening.
Course author
Janani Ravi
Janani has a Masters degree from Stanford and worked for 7+ years at Google. She was one of the
original engineers on Google Docs and holds 4 patents for its real-time collaborative editing...
Course info
Level Beginner
Rating
My rating
Duration 3h 24m
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 73/74
05.04.2020 Preparing Data for Machine Learning | Pluralsight
Share course
https://app.pluralsight.com/library/courses/preparing-data-machine-learning/transcript 74/74