Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence

Courses

Get Quote
Home > Blog > Cleaning Dirty Data with Pandas & Python

Cleaning Dirty Data


Get your team upskilled or
reskilled today. Chat with one of

with Pandas & our experts to create a custom


training proposal. Fully

Python customized at no additional


cost.
About the Author: Al Nelson

Kellye Al is a geek about all things tech. He's a professional technical


Whitney writer and software developer who loves writing for tech
businesses and cultivating happy users. You can find him on
the web at http://www.alnelsonwrites.com or on Twitter as @musegarden.

-- Please Select --

Submit

If you are not completely satisfied with your


training class, we'll give you your money back.

Pandas [http://pandas.pydata.org/] is a popular Python library used for data


https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 1/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence
Pandas [http://pandas.pydata.org/] is a popular Python library used for data
science and analysis. Used in conjunction with other data science toolsets
Courses
like SciPy [https://www.scipy.org/] , NumPy [http://www.numpy.org/] , and
Matplotlib [https://matplotlib.org/] , a modeler can create end-to-end analytic
workflows to solve business problems.
Get Quote
While you can do a lot of really powerful things with Python and data Get your team upskilled or
analysis, your analysis is only ever as good as your dataset. And many reskilled today. Chat with one of
datasets have missing, malformed, or erroneous data. It’s often our experts to create a custom
unavoidable–anything from incomplete reporting to technical glitches can training proposal. Fully
customized at no additional
cause “dirty” data [https://en.wikipedia.org/wiki/Dirty_data] .
cost.
Thankfully, Pandas provides a robust library of functions to help you clean
up, sort through, and make sense of your datasets, no matter what state
they’re in. For our example, we’re going to use a dataset of 5,000 movies
scraped from IMDB. It contains information on the actors, directors, budget,
and gross, as well as the IMDB rating and release year. In practice, you’ll be
using much larger datasets consisting of potentially millions of rows, but this
is a good sample dataset to start with.

Unfortunately, some of the fields in this dataset aren’t filled in and some of
them have default values such as 0 or NaN (Not a Number). -- Please Select --

No good. Let’s go through some Pandas hacks you can use to clean up your
Submit
dirty data.

[] Getting started If you are not completely satisfied with your


training class, we'll give you your money back.
To get started with Pandas, first you will need to have it installed. You can do
so by running:

$ pip install pandas


https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 2/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence

Then we need to load the data we downloaded into Pandas. You can do this
Courses
with a few Python commands:

import pandas as pd Get Quote


data = pd.read_csv(‘movie_metadata.csv’)
Get your team upskilled or
Make sure you have your movie dataset in the same folder as you’re running reskilled today. Chat with one of
the Python script. If you have it stored elsewhere, you’ll need to change the our experts to create a custom
read_csv parameter to point to the file’s location. training proposal. Fully
customized at no additional
cost.
[] Look at your data
To check out the basic structure of the data we just read in, you can use the
head() command to print out the first five rows. That should give you a
general idea of the structure of the dataset.

data.head()

When we look at the dataset either in Pandas or in a more traditional


program like Excel, we can start to note down the problems, and then we’ll
-- Please Select --
come up with solutions to fix those problems.

Pandas has some selection methods which you can use to slice and dice the Submit
dataset based on your queries. Let’s go through some quick examples
before moving on:

If you are not completely satisfied with your


Look at the some basic stats for the ‘imdb_score’ column:
training class, we'll give you your money back.
data.imdb_score.describe()

Select a column: data[‘movie_title’]

Select the first 10 rows of a column: data[‘duration’][:10]

https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 3/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence

Select multiple columns: data[[‘budget’,’gross’]]


Courses
Select all movies over two hours long: data[data[‘duration’] > 120]

Get Quote
[] Deal with missing data
Get your team upskilled or
One of the most common problems is missing data. This could be because it
reskilled today. Chat with one of
was never filled out properly, the data wasn’t available, or there was a our experts to create a custom
computing error. Whatever the reason, if we leave the blank values in there, training proposal. Fully
it will cause errors in analysis later on. There are a couple of ways to deal customized at no additional
with missing data: cost.

Add in a default value for the missing data


Get rid of (delete) the rows that have missing data

Get rid of (delete) the columns that have a high incidence of missing data

We’ll go through each of those in turn.

[] Add default values -- Please Select --

First of all, we should probably get rid of all those nasty NaN values. But
what to put in its place? Well, this is where you’re going to have to eyeball Submit

the data a little bit. For our example, let’s look at the ‘country’ column. It’s
straightforward enough, but some of the movies don’t have a country
provided so the data shows up as NaN. In this case, we probably don’t want If you are not completely satisfied with your
training class, we'll give you your money back.
to assume the country, so we can replace it with an empty string or some
other default value.

data.country = data.country.fillna(‘’)

https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 4/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence

This replaces the NaN entries in the ‘country’ column with the empty string,
but we could just as easily tell it to replace with a default name such as
Courses
“None Given”. You can find more information on fillna() in the Pandas
documentation [https://pandas.pydata.org/pandas- Get Quote
docs/stable/generated/pandas.DataFrame.fillna.html] .

With numerical data like the duration of the movie, a calculation like taking Get your team upskilled or
reskilled today. Chat with one of
the mean duration can help us even the dataset out. It’s not a great
our experts to create a custom
measure, but it’s an estimate of what the duration could be based on the training proposal. Fully
other data. That way we don’t have crazy numbers like 0 or NaN throwing customized at no additional
off our analysis. cost.

data.duration =
data.duration.fillna(data.duration.mean())

[] Remove incomplete rows


Let’s say we want to get rid of any rows that have a missing value. It’s a
pretty aggressive technique, but there may be a use case where that’s
exactly what you want to do.
-- Please Select --
Dropping all rows with any NA values is easy:
Submit
data.dropna()

Of course, we can also drop rows that have all NA values:

data.dropna(how=’all’) If you are not completely satisfied with your


training class, we'll give you your money back.
We can also put a limitation on how many non-null values need to be in a
row in order to keep it (in this example, the data needs to have at least 5
non-null values):

https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 5/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence

data.dropna(thresh=5)
Courses
Let’s say for instance that we don’t want to include any movie that doesn’t
have information on when the movie came out:
Get Quote
data.dropna(subset=[‘title_year’])

The subset parameter allows you to choose which columns you want to look Get your team upskilled or
at. You can also pass it a list of column names here. reskilled today. Chat with one of
our experts to create a custom
training proposal. Fully
[] Deal with error-prone columns customized at no additional
cost.
We can apply the same kind of criteria to our columns. We just need to use
the parameter axis=1 in our code. That means to operate on columns, not
rows. (We could have used axis=0 in our row examples, but it is 0 by default
if you don’t enter anything.)

Drop the columns with that are all NA values:

data.dropna(axis=1, how=’all’)

Drop all columns with any NA values:


-- Please Select --
data.dropna(axis=1, how=’any’)

The same threshold and subset parameters from above apply as well. For Submit
more information and examples, visit the Pandas documentation
[https://pandas.pydata.org/pandas-
docs/stable/generated/pandas.DataFrame.dropna.html] . If you are not completely satisfied with your
training class, we'll give you your money back.

[] Normalize data types


Sometimes, especially when you’re reading in a CSV with a bunch of
https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 6/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence

numbers, some of the numbers will read in as strings instead of numeric


Courses
values, or vice versa. Here’s a way you can fix that and normalize your data
types:

data = pd.read_csv(‘movie_metadata.csv’, dtype=


Get Quote
{‘duration’: int})
Get your team upskilled or
This tells Pandas that the column ‘duration’ needs to be an integer value. reskilled today. Chat with one of
Similarly, if we want the release year to be a string and not a number, we can our experts to create a custom
training proposal. Fully
do the same kind of thing:
customized at no additional
data = pd.read_csv(‘movie_metadata.csv’, dtype= cost.
{title_year: str})

Keep in mind that this data reads the CSV from disk again, so make sure you
either normalize your data types first or dump your intermediary results to a
file before doing so.

[] Change casing
Columns with user-provided data are ripe for corruption. People make typos,
-- Please Select --
leave their caps lock on (or off), and add extra spaces where they shouldn’t.

To change all our movie titles to uppercase: Submit

data[‘movie_title’].str.upper()

Similarly, to get rid of trailing whitespace: If you are not completely satisfied with your
training class, we'll give you your money back.
data[‘movie_title’].str.strip()

We won’t be able to cover correcting spelling mistakes in this tutorial, but


you can read up on fuzzy matching [https://github.com/seatgeek/fuzzywuzzy]
for more information
https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 7/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence
for more information.

Courses
[] Rename columns
Get Quote
Finally, if your data was generated by a computer program, it probably has
some computer-generated column names, too. Those can be hard to read Get your team upskilled or
and understand while working, so if you want to rename a column to reskilled today. Chat with one of
something more user-friendly, you can do it like this: our experts to create a custom
training proposal. Fully
data.rename(columns = {‘title_year’:’release_date’, customized at no additional
‘movie_facebook_likes’:’facebook_likes’}) cost.

Here we’ve renamed ‘title_year’ to ‘release_date’ and


‘movie_facebook_likes’ to simply ‘facebook_likes’. Since this is not an in-
place operation, you’ll need to save the DataFrame by assigning it to a
variable.

data = data.rename(columns =
{‘title_year’:’release_date’,
‘movie_facebook_likes’:’facebook_likes’})

-- Please Select --

[] Save your results


Submit
When you’re done cleaning your data, you may want to export it back into
CSV format for further processing in another program. This is easy to do in
Pandas: If you are not completely satisfied with your
training class, we'll give you your money back.
data.to_csv(‘cleanfile.csv’ encoding=’utf-8’)

[] More resources
https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 8/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence

Of course, this is only the tip of the iceberg. With variations in user
Courses
environments, languages, and user input, there are many ways that a
potential dataset may be dirty or corrupted. At this point you should have
Get Quote
learned some of the most common ways to clean your dataset with Pandas
and Python.
Get your team upskilled or
For more resources on Pandas and data cleaning, see these additional reskilled today. Chat with one of
resources: our experts to create a custom
training proposal. Fully
Pandas documentation [https://pandas.pydata.org/pandas-docs/stable/] customized at no additional
cost.
Messy Data Tutorial [http://nbviewer.jupyter.org/github/jvns/pandas-
cookbook/blob/v0.1/cookbook/Chapter%207%20-
%20Cleaning%20up%20messy%20data.ipynb]
Kaggle Datasets [https://www.kaggle.com/datasets]
Python for Data Analysis (“The Pandas Book”)
[http://shop.oreilly.com/product/0636920023784.do]

-- Please Select --

Submit

If you are not completely satisfied with your


training class, we'll give you your money back.

G tQ t
https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 9/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence

Get Quote
Courses
DevelopIntelligence leads
technical and software
Get your team upskilled or reskilled today. Chat with one of our experts to create a development learning
custom training proposal. Fully customized at no additional cost. 100% guaranteed. GetforQuote
programs Fortune
5000 companies. We
First Name Last Name
Getprovide
your team upskilled
learning or
solutions
reskilled today. Chat
for hundreds with one of
of thousands
our experts to create a custom
of engineers for over 250
Email Phone training proposal. Fully
global brands.
customized at no additional
cost.
Educate
Company / Organization
learners using
experienced
practitioners.
-- Please Select --

Proven
customization
process is
Submit -- Please Select --
guaranteed.

Submit
Strategic partner,
not just another
vendor.
If you are not completely satisfied with your
training class, we'll give you your money back.

https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 10/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence

Courses

Get Quote
" The collaborative exercises
peer work, and relevant
Getexamples
your team were extremely
upskilled or
helpful. I strongly recommend
reskilled today. Chat with one of
ourthis course.
experts to The flowa and
create custom
content... "
training proposal. Fully
customized at no additional
STUDENT, APPLE
cost.

-- Please Select --

Submit

© 2013 - 2019 DevelopIntelligence LLC - Privacy Policy

If you are not completely satisfied with your


training class, we'll give you your money back.

https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 11/11

You might also like