Professional Documents
Culture Documents
Cleaning Dirty Data With Pandas & Python
Cleaning Dirty Data With Pandas & Python
Courses
Get Quote
Home > Blog > Cleaning Dirty Data with Pandas & Python
-- Please Select --
Submit
Unfortunately, some of the fields in this dataset aren’t filled in and some of
them have default values such as 0 or NaN (Not a Number). -- Please Select --
No good. Let’s go through some Pandas hacks you can use to clean up your
Submit
dirty data.
Then we need to load the data we downloaded into Pandas. You can do this
Courses
with a few Python commands:
data.head()
Pandas has some selection methods which you can use to slice and dice the Submit
dataset based on your queries. Let’s go through some quick examples
before moving on:
https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 3/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence
Get Quote
[] Deal with missing data
Get your team upskilled or
One of the most common problems is missing data. This could be because it
reskilled today. Chat with one of
was never filled out properly, the data wasn’t available, or there was a our experts to create a custom
computing error. Whatever the reason, if we leave the blank values in there, training proposal. Fully
it will cause errors in analysis later on. There are a couple of ways to deal customized at no additional
with missing data: cost.
Get rid of (delete) the columns that have a high incidence of missing data
First of all, we should probably get rid of all those nasty NaN values. But
what to put in its place? Well, this is where you’re going to have to eyeball Submit
the data a little bit. For our example, let’s look at the ‘country’ column. It’s
straightforward enough, but some of the movies don’t have a country
provided so the data shows up as NaN. In this case, we probably don’t want If you are not completely satisfied with your
training class, we'll give you your money back.
to assume the country, so we can replace it with an empty string or some
other default value.
data.country = data.country.fillna(‘’)
https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 4/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence
This replaces the NaN entries in the ‘country’ column with the empty string,
but we could just as easily tell it to replace with a default name such as
Courses
“None Given”. You can find more information on fillna() in the Pandas
documentation [https://pandas.pydata.org/pandas- Get Quote
docs/stable/generated/pandas.DataFrame.fillna.html] .
With numerical data like the duration of the movie, a calculation like taking Get your team upskilled or
reskilled today. Chat with one of
the mean duration can help us even the dataset out. It’s not a great
our experts to create a custom
measure, but it’s an estimate of what the duration could be based on the training proposal. Fully
other data. That way we don’t have crazy numbers like 0 or NaN throwing customized at no additional
off our analysis. cost.
data.duration =
data.duration.fillna(data.duration.mean())
https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 5/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence
data.dropna(thresh=5)
Courses
Let’s say for instance that we don’t want to include any movie that doesn’t
have information on when the movie came out:
Get Quote
data.dropna(subset=[‘title_year’])
The subset parameter allows you to choose which columns you want to look Get your team upskilled or
at. You can also pass it a list of column names here. reskilled today. Chat with one of
our experts to create a custom
training proposal. Fully
[] Deal with error-prone columns customized at no additional
cost.
We can apply the same kind of criteria to our columns. We just need to use
the parameter axis=1 in our code. That means to operate on columns, not
rows. (We could have used axis=0 in our row examples, but it is 0 by default
if you don’t enter anything.)
data.dropna(axis=1, how=’all’)
The same threshold and subset parameters from above apply as well. For Submit
more information and examples, visit the Pandas documentation
[https://pandas.pydata.org/pandas-
docs/stable/generated/pandas.DataFrame.dropna.html] . If you are not completely satisfied with your
training class, we'll give you your money back.
Keep in mind that this data reads the CSV from disk again, so make sure you
either normalize your data types first or dump your intermediary results to a
file before doing so.
[] Change casing
Columns with user-provided data are ripe for corruption. People make typos,
-- Please Select --
leave their caps lock on (or off), and add extra spaces where they shouldn’t.
data[‘movie_title’].str.upper()
Similarly, to get rid of trailing whitespace: If you are not completely satisfied with your
training class, we'll give you your money back.
data[‘movie_title’].str.strip()
Courses
[] Rename columns
Get Quote
Finally, if your data was generated by a computer program, it probably has
some computer-generated column names, too. Those can be hard to read Get your team upskilled or
and understand while working, so if you want to rename a column to reskilled today. Chat with one of
something more user-friendly, you can do it like this: our experts to create a custom
training proposal. Fully
data.rename(columns = {‘title_year’:’release_date’, customized at no additional
‘movie_facebook_likes’:’facebook_likes’}) cost.
data = data.rename(columns =
{‘title_year’:’release_date’,
‘movie_facebook_likes’:’facebook_likes’})
-- Please Select --
[] More resources
https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 8/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence
Of course, this is only the tip of the iceberg. With variations in user
Courses
environments, languages, and user input, there are many ways that a
potential dataset may be dirty or corrupted. At this point you should have
Get Quote
learned some of the most common ways to clean your dataset with Pandas
and Python.
Get your team upskilled or
For more resources on Pandas and data cleaning, see these additional reskilled today. Chat with one of
resources: our experts to create a custom
training proposal. Fully
Pandas documentation [https://pandas.pydata.org/pandas-docs/stable/] customized at no additional
cost.
Messy Data Tutorial [http://nbviewer.jupyter.org/github/jvns/pandas-
cookbook/blob/v0.1/cookbook/Chapter%207%20-
%20Cleaning%20up%20messy%20data.ipynb]
Kaggle Datasets [https://www.kaggle.com/datasets]
Python for Data Analysis (“The Pandas Book”)
[http://shop.oreilly.com/product/0636920023784.do]
-- Please Select --
Submit
G tQ t
https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 9/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence
Get Quote
Courses
DevelopIntelligence leads
technical and software
Get your team upskilled or reskilled today. Chat with one of our experts to create a development learning
custom training proposal. Fully customized at no additional cost. 100% guaranteed. GetforQuote
programs Fortune
5000 companies. We
First Name Last Name
Getprovide
your team upskilled
learning or
solutions
reskilled today. Chat
for hundreds with one of
of thousands
our experts to create a custom
of engineers for over 250
Email Phone training proposal. Fully
global brands.
customized at no additional
cost.
Educate
Company / Organization
learners using
experienced
practitioners.
-- Please Select --
Proven
customization
process is
Submit -- Please Select --
guaranteed.
Submit
Strategic partner,
not just another
vendor.
If you are not completely satisfied with your
training class, we'll give you your money back.
https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 10/11
3/5/2020 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence
Courses
Get Quote
" The collaborative exercises
peer work, and relevant
Getexamples
your team were extremely
upskilled or
helpful. I strongly recommend
reskilled today. Chat with one of
ourthis course.
experts to The flowa and
create custom
content... "
training proposal. Fully
customized at no additional
STUDENT, APPLE
cost.
-- Please Select --
Submit
https://www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 11/11