Pandas Pro Ling and Exploratory Data Analysis With Line One of Code!

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Open in app Get started

You have 2 free stories left this month. Sign up and get an extra one for free.

Pandas pro ling and exploratory data analysis with


line one of code!
Learn how to install and use pandas pro ling for automatic EDA

Magdalena Konkiewicz Follow


Jun 10 · 6 min read

Image by Colin Behrens from Pixabay

Introduction
If you are already familiar with the pandas profiling package you will not learn
anything new from this article so you can just skip it now.

However, if you have never heard of it, this may be one of the best productivity tips
regarding data analysis you have been given so far, so hang on.

. . .

Pandas profiling

Pandas profiling is a package that allows you to create an explorator y analysis data
report with minimal effort, one line of code.

Therefore, if you are a Data Scientist or Analyst who has been doing explorator y
data analysis manually then using pandas profiling will save you a lot of time, effort,
and typing. Do you remember all the repetitive code you use when doing explorator y
data analysis, such as:

info(),

describe(),

isnull(),

corr(),

etc.
You will not have to do it anymore. Pandas profiling package will do it for you and
will create a summar y full report of your data.

So let’s get started!


. . .
How to install pandas profiling package

The installation of pandas profiling is ver y easy. You can use the standard pip
command.

pip install pandas-profiling

It should take a minute or two to install the package and you should be ready to use
pandas profiling within python.

. . .

How to create a profiling report

In order to create a report, you can just load a data set with a standard read_csv()
function that stores data in the pandas data frame.

Then use ProfileReport initializer and pass it a data frame that you have just
created as a parameter.

You can use to_file() function to export the report so you can inspect it.

import pandas as pd
from pandas_profiling import ProfileReport

df = pd.read_csv(data_file_name)

report = ProfileReport(df)

report.to_file(output_file='output.html')

. . .

Let’s see a real example.


We are going to create a report for a real data set. We have chosen a data set on
heart disease that can be downloaded from here. This data set is 33 KB in size, has
14 columns, and 303 obser vations.

Let’s create a report for this data set using pandas profiling.

from pandas_profiling import ProfileReport

df = pd.read_csv('heart.csv')

report = ProfileReport(df)

report.to_file(output_file='output.html')

Once you run this code your should see progress bars with the report generation
and within a few seconds, you should be able to view the full report by opening
output.html file in your browser.

Yes, that was that easy! The report is ready and you
can view it!
***report should be saved in the same folder from which the original data was read from.

. . .

Report structure

Let’s see what is contained in the pandas profiling report.

Over view

In the over view section, we should see three tabs: Overview, Reproduction, and
Warnings.

The Overview tab gives basic information about data such as the number of columns
and rows, data size, percentage of missing values, data types, etc.

The Reproduction contains information about report creation.


And the Warning tab includes warnings that have been triggered while producing
the report.

Variables

This section focuses on a detailed analysis of each variable.

If the variable is continuous it will display a histogram and if it is categorical it will


show a bar chart with value distribution.

You can see the percentage of missing values for each variable as well.

The picture below shows the analysis for age and sex variables from heart disease
data set.
Interactions

Interactions section focuses on bivariate relationships between numerical variables.


You can use the tabs to choose relation pairs you want to examine. The picture
below shows the relationship between age and cholesterol.

Correlations

This section shows the different types of correlations. You can see the report for
Pearson’s, Spearman’s, Kendall’s, and Phik correlation for numerical variables and
Cramer’s V correlation for categorical variables.
Missing values

This is a section that shows missing values in the data set with the column break up.

We can see that our data set has no missing values in my of the columns.
Sample

This is a section that replaces head() and tail() function from manual data analysis.
You can see the first, and last 10 rows of the data set.

Duplicate rows

This section shows you if there are duplicate rows in the data set. There is actually
one duplicate entr y in the heart disease data set and its details are shown in the
screenshot below.
. . .

Disadvantages

In this article, we have talked a lot about the advantages of pandas profiling
packages, but are there any disadvantages? Yes, let’s mention some.

If your data set is ver y big it takes ver y long to create a report (could be hours in
extreme cases).

We have some basic EDA using a profiling package and it is a good start for data
analysis but it is definitely not a complete exploration. Normally we would see more
graph types such as boxplots, more detailed bar charts, and some other types of
visualizations and explorations techniques that would reveal quirks of the
particular data set.

Additionally, if you are just starting your data science journey it may be worth
learning how to gather the information included in the report using pandas itself.
This is so you can practice coding and manipulating data!

Otherwise, I think it is a great and very useful


package!
. . .

Summary

In this article, we have shown you how to install and use pandas profiling. We even
showed you a quick interpretation of the results.

Download the heart disease data set and tr y it yourself.

. . .

PS: I am writing articles that explain basic Data Science concepts in a simple
and comprehensible manner. If you liked this article there are some other ones
you may enjoy:

9 pandas visualizations techniques for e ective data analysis


Learn how to use line graphs, scatter plots, histograms, boxplots, and a few
other visualization techniques using…

towardsdatascience.com

What are lambda functions in python and why you should start
using them right now
A quick guide for beginners to start using lambda functions in python and
pandas.
towardsdatascience.com

Jupyter notebook autocompletion


The best productivity tool for Data Scientist you should be using if you are
not doing it yet…
towardsdatascience.com

Machine Learning Data Science Arti cial Intelligence Technology Programming

184
claps

WR ITTEN BY
Magdalena Konkiewicz Follow
Data Scientist, NLP and ML enthusiast and educator.

Towards Data Science Follow


A Medium publication sharing concepts, ideas, and codes.

See responses (3)


More From Medium

Dual Boot is Dead: Windows and Linux are now One.


Dimitris Poulopoulos in Towards Data Science

New Features in Python 3.9


James Briggs in Towards Data Science

Why Is Data Science Losing Its Charm?


Harshit Ahuja in Towards Data Science

How I passed the TensorFlow Developer Certi cation Exam


Daniel Bourke in Towards Data Science

How to Build a Data Science Web App in Python


Chanin Nantasenamat in Towards Data Science

Why we switched from Flask to FastAPI for production


machine learning
Caleb Kaiser in Towards Data Science

I Worked T hrough 500+ Data Science Interview Questions


Megan Dibble in Towards Data Science

10 Smooth Python Tricks For Python Gods


Emmett Boudreau in Towards Data Science

About Help Legal

Get the Medium app

You might also like