Cleaning Dirty Data With Pandas & Python

3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence
Courses
Get Quote
Home > Blog > Cleaning Dirty Data with Pandas & Python
Cleaning Dirty Data

Get your team upskilled or
reskilled today. Chat with one of
with Pandas & our experts to create a custom

training proposal. Fully
Python customized at no additional

cost.
About the Author: Al Nelson
Kellye Al is a geek about all things tech. He's a professional technical

Whitney writer and software developer who loves writing for tech
businesses and cultivating happy users. You can find him on
the web at http://www.alnelsonwrites.com or on Twitter as @musegarden.
-- Please Select --
Submit
If you are not completely satisfied with your

training class, we'll give you your money back.
Pandas [http://pandas.pydata.org/] is a popular Python library used for data

https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 1/11
Pandas [http://pandas.pydata.org/] is a popular Python library used for data
science and analysis. Used in conjunction with other data science toolsets
Courses
like SciPy [https://www.scipy.org/] , NumPy [http://www.numpy.org/] , and
Matplotlib [https://matplotlib.org/] , a modeler can create end-to-end analytic
workflows to solve business problems.
Get Quote
While you can do a lot of really powerful things with Python and data Get your team upskilled or
analysis, your analysis is only ever as good as your dataset. And many reskilled today. Chat with one of
datasets have missing, malformed, or erroneous data. It’s often our experts to create a custom
unavoidable–anything from incomplete reporting to technical glitches can training proposal. Fully
customized at no additional
cause “dirty” data [https://en.wikipedia.org/wiki/Dirty_data] .
cost.
Thankfully, Pandas provides a robust library of functions to help you clean
up, sort through, and make sense of your datasets, no matter what state
they’re in. For our example, we’re going to use a dataset of 5,000 movies
scraped from IMDB. It contains information on the actors, directors, budget,
and gross, as well as the IMDB rating and release year. In practice, you’ll be
using much larger datasets consisting of potentially millions of rows, but this
is a good sample dataset to start with.
Unfortunately, some of the fields in this dataset aren’t filled in and some of
them have default values such as 0 or NaN (Not a Number). -- Please Select --
No good. Let’s go through some Pandas hacks you can use to clean up your
Submit
dirty data.
[] Getting started If you are not completely satisfied with your

To get started with Pandas, first you will need to have it installed. You can do
so by running:
$ pip install pandas

Then we need to load the data we downloaded into Pandas. You can do this
Courses
with a few Python commands:
import pandas as pd Get Quote

data = pd.read_csv(‘movie_metadata.csv’)
Make sure you have your movie dataset in the same folder as you’re running reskilled today. Chat with one of
the Python script. If you have it stored elsewhere, you’ll need to change the our experts to create a custom
read_csv parameter to point to the file’s location. training proposal. Fully
cost.
[] Look at your data
To check out the basic structure of the data we just read in, you can use the
head() command to print out the first five rows. That should give you a
general idea of the structure of the dataset.
data.head()
When we look at the dataset either in Pandas or in a more traditional

program like Excel, we can start to note down the problems, and then we’ll
-- Please Select --
come up with solutions to fix those problems.
Pandas has some selection methods which you can use to slice and dice the Submit
dataset based on your queries. Let’s go through some quick examples
before moving on:

Look at the some basic stats for the ‘imdb_score’ column:
data.imdb_score.describe()
Select a column: data[‘movie_title’]
Select the first 10 rows of a column: data[‘duration’][:10]
Select multiple columns: data[[‘budget’,’gross’]]

Courses
Select all movies over two hours long: data[data[‘duration’] > 120]
Get Quote
[] Deal with missing data
One of the most common problems is missing data. This could be because it
was never filled out properly, the data wasn’t available, or there was a our experts to create a custom
computing error. Whatever the reason, if we leave the blank values in there, training proposal. Fully
it will cause errors in analysis later on. There are a couple of ways to deal customized at no additional
with missing data: cost.
Add in a default value for the missing data

Get rid of (delete) the rows that have missing data
Get rid of (delete) the columns that have a high incidence of missing data
We’ll go through each of those in turn.
[] Add default values -- Please Select --
First of all, we should probably get rid of all those nasty NaN values. But
what to put in its place? Well, this is where you’re going to have to eyeball Submit
the data a little bit. For our example, let’s look at the ‘country’ column. It’s
straightforward enough, but some of the movies don’t have a country
provided so the data shows up as NaN. In this case, we probably don’t want If you are not completely satisfied with your
to assume the country, so we can replace it with an empty string or some
other default value.
data.country = data.country.fillna(‘’)
This replaces the NaN entries in the ‘country’ column with the empty string,
but we could just as easily tell it to replace with a default name such as
Courses
“None Given”. You can find more information on fillna() in the Pandas
documentation [https://pandas.pydata.org/pandas- Get Quote
docs/stable/generated/pandas.DataFrame.fillna.html] .
With numerical data like the duration of the movie, a calculation like taking Get your team upskilled or
the mean duration can help us even the dataset out. It’s not a great
our experts to create a custom
measure, but it’s an estimate of what the duration could be based on the training proposal. Fully
other data. That way we don’t have crazy numbers like 0 or NaN throwing customized at no additional
off our analysis. cost.
data.duration =
data.duration.fillna(data.duration.mean())
[] Remove incomplete rows

Let’s say we want to get rid of any rows that have a missing value. It’s a
pretty aggressive technique, but there may be a use case where that’s
exactly what you want to do.
-- Please Select --
Dropping all rows with any NA values is easy:
Submit
data.dropna()
Of course, we can also drop rows that have all NA values:
data.dropna(how=’all’) If you are not completely satisfied with your

We can also put a limitation on how many non-null values need to be in a
row in order to keep it (in this example, the data needs to have at least 5
non-null values):
data.dropna(thresh=5)
Courses
Let’s say for instance that we don’t want to include any movie that doesn’t
have information on when the movie came out:
Get Quote
data.dropna(subset=[‘title_year’])
The subset parameter allows you to choose which columns you want to look Get your team upskilled or
at. You can also pass it a list of column names here. reskilled today. Chat with one of
[] Deal with error-prone columns customized at no additional
cost.
We can apply the same kind of criteria to our columns. We just need to use
the parameter axis=1 in our code. That means to operate on columns, not
rows. (We could have used axis=0 in our row examples, but it is 0 by default
if you don’t enter anything.)
Drop the columns with that are all NA values:
data.dropna(axis=1, how=’all’)
Drop all columns with any NA values:

-- Please Select --
data.dropna(axis=1, how=’any’)
The same threshold and subset parameters from above apply as well. For Submit
more information and examples, visit the Pandas documentation
[https://pandas.pydata.org/pandas-
docs/stable/generated/pandas.DataFrame.dropna.html] . If you are not completely satisfied with your
[] Normalize data types

Sometimes, especially when you’re reading in a CSV with a bunch of
numbers, some of the numbers will read in as strings instead of numeric

Courses
values, or vice versa. Here’s a way you can fix that and normalize your data
types:
data = pd.read_csv(‘movie_metadata.csv’, dtype=

Get Quote
{‘duration’: int})
This tells Pandas that the column ‘duration’ needs to be an integer value. reskilled today. Chat with one of
Similarly, if we want the release year to be a string and not a number, we can our experts to create a custom
do the same kind of thing:
data = pd.read_csv(‘movie_metadata.csv’, dtype= cost.
{title_year: str})
Keep in mind that this data reads the CSV from disk again, so make sure you
either normalize your data types first or dump your intermediary results to a
file before doing so.
[] Change casing
Columns with user-provided data are ripe for corruption. People make typos,
-- Please Select --
leave their caps lock on (or off), and add extra spaces where they shouldn’t.
To change all our movie titles to uppercase: Submit
data[‘movie_title’].str.upper()
Similarly, to get rid of trailing whitespace: If you are not completely satisfied with your
data[‘movie_title’].str.strip()
We won’t be able to cover correcting spelling mistakes in this tutorial, but

you can read up on fuzzy matching [https://github.com/seatgeek/fuzzywuzzy]
for more information
for more information.
Courses
[] Rename columns
Get Quote
Finally, if your data was generated by a computer program, it probably has
some computer-generated column names, too. Those can be hard to read Get your team upskilled or
and understand while working, so if you want to rename a column to reskilled today. Chat with one of
something more user-friendly, you can do it like this: our experts to create a custom
data.rename(columns = {‘title_year’:’release_date’, customized at no additional
‘movie_facebook_likes’:’facebook_likes’}) cost.
Here we’ve renamed ‘title_year’ to ‘release_date’ and

‘movie_facebook_likes’ to simply ‘facebook_likes’. Since this is not an in-
place operation, you’ll need to save the DataFrame by assigning it to a
variable.
data = data.rename(columns =
{‘title_year’:’release_date’,
‘movie_facebook_likes’:’facebook_likes’})
-- Please Select --
[] Save your results

Submit
When you’re done cleaning your data, you may want to export it back into
CSV format for further processing in another program. This is easy to do in
Pandas: If you are not completely satisfied with your
data.to_csv(‘cleanfile.csv’ encoding=’utf-8’)
[] More resources
Of course, this is only the tip of the iceberg. With variations in user
Courses
environments, languages, and user input, there are many ways that a
potential dataset may be dirty or corrupted. At this point you should have
Get Quote
learned some of the most common ways to clean your dataset with Pandas
and Python.
For more resources on Pandas and data cleaning, see these additional reskilled today. Chat with one of
resources: our experts to create a custom
Pandas documentation [https://pandas.pydata.org/pandas-docs/stable/] customized at no additional
cost.
Messy Data Tutorial [http://nbviewer.jupyter.org/github/jvns/pandas-
cookbook/blob/v0.1/cookbook/Chapter%207%20-
%20Cleaning%20up%20messy%20data.ipynb]
Kaggle Datasets [https://www.kaggle.com/datasets]
Python for Data Analysis (“The Pandas Book”)
[http://shop.oreilly.com/product/0636920023784.do]
-- Please Select --
Submit

G tQ t
Get Quote
Courses
DevelopIntelligence leads
technical and software
Get your team upskilled or reskilled today. Chat with one of our experts to create a development learning
custom training proposal. Fully customized at no additional cost. 100% guaranteed. GetforQuote
programs Fortune
5000 companies. We
First Name Last Name
Getprovide
your team upskilled
learning or
solutions
reskilled today. Chat
for hundreds with one of
of thousands
of engineers for over 250
Email Phone training proposal. Fully
global brands.
cost.
Educate
Company / Organization
learners using
experienced
practitioners.
-- Please Select --
Proven
customization
process is
Submit -- Please Select --
guaranteed.
Submit
Strategic partner,
not just another
vendor.
Courses
Get Quote
" The collaborative exercises
peer work, and relevant
Getexamples
your team were extremely
upskilled or
helpful. I strongly recommend
ourthis course.
experts to The flowa and
create custom
content... "
STUDENT, APPLE
cost.
-- Please Select --
Submit
© 2013 - 2019 DevelopIntelligence LLC - Privacy Policy


Cleaning Dirty Data With Pandas & Python

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cleaning Dirty Data With Pandas & Python

Uploaded by

Copyright:

Available Formats

3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence

Cleaning Dirty Data

with Pandas & our experts to create a custom

Python customized at no additional

Kellye Al is a geek about all things tech. He's a professional technical

If you are not completely satisﬁed with your

Pandas [http://pandas.pydata.org/] is a popular Python library used for data

[] Getting started If you are not completely satisﬁed with your

$ pip install pandas

import pandas as pd Get Quote

When we look at the dataset either in Pandas or in a more traditional

If you are not completely satisﬁed with your

Select a column: data[‘movie_title’]

Select the ﬁrst 10 rows of a column: data[‘duration’][:10]

Select multiple columns: data[[‘budget’,’gross’]]

Add in a default value for the missing data

We’ll go through each of those in turn.

[] Add default values -- Please Select --

[] Remove incomplete rows

Of course, we can also drop rows that have all NA values:

data.dropna(how=’all’) If you are not completely satisﬁed with your

Drop the columns with that are all NA values:

Drop all columns with any NA values:

[] Normalize data types

numbers, some of the numbers will read in as strings instead of numeric

data = pd.read_csv(‘movie_metadata.csv’, dtype=

To change all our movie titles to uppercase: Submit

We won’t be able to cover correcting spelling mistakes in this tutorial, but

Here we’ve renamed ‘title_year’ to ‘release_date’ and

[] Save your results

If you are not completely satisﬁed with your

© 2013 - 2019 DevelopIntelligence LLC - Privacy Policy

If you are not completely satisﬁed with your

You might also like