Pandas Pro Ling and Exploratory Data Analysis With Line One of Code!

Open in app Get started
You have 2 free stories left this month. Sign up and get an extra one for free.
Pandas pro ling and exploratory data analysis with

line one of code!
Learn how to install and use pandas pro ling for automatic EDA
Magdalena Konkiewicz Follow

Jun 10 · 6 min read
Image by Colin Behrens from Pixabay
Introduction
If you are already familiar with the pandas profiling package you will not learn
anything new from this article so you can just skip it now.
However, if you have never heard of it, this may be one of the best productivity tips
regarding data analysis you have been given so far, so hang on.
. . .
Pandas profiling
Pandas profiling is a package that allows you to create an explorator y analysis data
report with minimal effort, one line of code.
Therefore, if you are a Data Scientist or Analyst who has been doing explorator y
data analysis manually then using pandas profiling will save you a lot of time, effort,
and typing. Do you remember all the repetitive code you use when doing explorator y
data analysis, such as:
info(),
describe(),
isnull(),
corr(),
etc.
You will not have to do it anymore. Pandas profiling package will do it for you and
will create a summar y full report of your data.
So let’s get started!

. . .
How to install pandas profiling package
The installation of pandas profiling is ver y easy. You can use the standard pip
command.
pip install pandas-profiling
It should take a minute or two to install the package and you should be ready to use
pandas profiling within python.
. . .
How to create a profiling report
In order to create a report, you can just load a data set with a standard read_csv()
function that stores data in the pandas data frame.
Then use ProfileReport initializer and pass it a data frame that you have just
created as a parameter.
You can use to_file() function to export the report so you can inspect it.
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.read_csv(data_file_name)
report = ProfileReport(df)
report.to_file(output_file='output.html')
. . .
Let’s see a real example.

We are going to create a report for a real data set. We have chosen a data set on
heart disease that can be downloaded from here. This data set is 33 KB in size, has
14 columns, and 303 obser vations.
Let’s create a report for this data set using pandas profiling.
from pandas_profiling import ProfileReport
df = pd.read_csv('heart.csv')
report = ProfileReport(df)
report.to_file(output_file='output.html')
Once you run this code your should see progress bars with the report generation
and within a few seconds, you should be able to view the full report by opening
output.html file in your browser.
Yes, that was that easy! The report is ready and you
can view it!
***report should be saved in the same folder from which the original data was read from.
. . .
Report structure
Let’s see what is contained in the pandas profiling report.
Over view
In the over view section, we should see three tabs: Overview, Reproduction, and
Warnings.
The Overview tab gives basic information about data such as the number of columns
and rows, data size, percentage of missing values, data types, etc.
The Reproduction contains information about report creation.

And the Warning tab includes warnings that have been triggered while producing
the report.
Variables
This section focuses on a detailed analysis of each variable.
If the variable is continuous it will display a histogram and if it is categorical it will

show a bar chart with value distribution.
You can see the percentage of missing values for each variable as well.
The picture below shows the analysis for age and sex variables from heart disease
data set.
Interactions
Interactions section focuses on bivariate relationships between numerical variables.

You can use the tabs to choose relation pairs you want to examine. The picture
below shows the relationship between age and cholesterol.
Correlations
This section shows the different types of correlations. You can see the report for
Pearson’s, Spearman’s, Kendall’s, and Phik correlation for numerical variables and
Cramer’s V correlation for categorical variables.
Missing values
This is a section that shows missing values in the data set with the column break up.
We can see that our data set has no missing values in my of the columns.
Sample
This is a section that replaces head() and tail() function from manual data analysis.
You can see the first, and last 10 rows of the data set.
Duplicate rows
This section shows you if there are duplicate rows in the data set. There is actually
one duplicate entr y in the heart disease data set and its details are shown in the
screenshot below.
. . .
Disadvantages
In this article, we have talked a lot about the advantages of pandas profiling
packages, but are there any disadvantages? Yes, let’s mention some.
If your data set is ver y big it takes ver y long to create a report (could be hours in
extreme cases).
We have some basic EDA using a profiling package and it is a good start for data
analysis but it is definitely not a complete exploration. Normally we would see more
graph types such as boxplots, more detailed bar charts, and some other types of
visualizations and explorations techniques that would reveal quirks of the
particular data set.
Additionally, if you are just starting your data science journey it may be worth
learning how to gather the information included in the report using pandas itself.
This is so you can practice coding and manipulating data!
Otherwise, I think it is a great and very useful

package!
. . .
Summary
In this article, we have shown you how to install and use pandas profiling. We even
showed you a quick interpretation of the results.
Download the heart disease data set and tr y it yourself.
. . .
PS: I am writing articles that explain basic Data Science concepts in a simple
and comprehensible manner. If you liked this article there are some other ones
you may enjoy:
9 pandas visualizations techniques for e ective data analysis

Learn how to use line graphs, scatter plots, histograms, boxplots, and a few
other visualization techniques using…
towardsdatascience.com
What are lambda functions in python and why you should start
using them right now
A quick guide for beginners to start using lambda functions in python and
pandas.
Jupyter notebook autocompletion

The best productivity tool for Data Scientist you should be using if you are
not doing it yet…
Machine Learning Data Science Arti cial Intelligence Technology Programming
184
claps
WR ITTEN BY
Magdalena Konkiewicz Follow
Data Scientist, NLP and ML enthusiast and educator.
Towards Data Science Follow

A Medium publication sharing concepts, ideas, and codes.
See responses (3)

More From Medium
Dual Boot is Dead: Windows and Linux are now One.

Dimitris Poulopoulos in Towards Data Science
New Features in Python 3.9

James Briggs in Towards Data Science
Why Is Data Science Losing Its Charm?

Harshit Ahuja in Towards Data Science
How I passed the TensorFlow Developer Certi cation Exam

Daniel Bourke in Towards Data Science
How to Build a Data Science Web App in Python

Chanin Nantasenamat in Towards Data Science
Why we switched from Flask to FastAPI for production

machine learning
Caleb Kaiser in Towards Data Science
I Worked T hrough 500+ Data Science Interview Questions

Megan Dibble in Towards Data Science
10 Smooth Python Tricks For Python Gods

Emmett Boudreau in Towards Data Science
About Help Legal
Get the Medium app

Pandas Pro Ling and Exploratory Data Analysis With Line One of Code!

Uploaded by

Copyright:

Available Formats

You might also like

Pandas Pro Ling and Exploratory Data Analysis With Line One of Code!

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pandas Pro Ling and Exploratory Data Analysis With Line One of Code!

Uploaded by

Copyright:

Available Formats

Open in app Get started

Pandas pro ling and exploratory data analysis with

Magdalena Konkiewicz Follow

Image by Colin Behrens from Pixabay

So let’s get started!

pip install pandas-profiling

How to create a profiling report

Let’s see a real example.

from pandas_profiling import ProfileReport

Let’s see what is contained in the pandas profiling report.

The Reproduction contains information about report creation.

This section focuses on a detailed analysis of each variable.

If the variable is continuous it will display a histogram and if it is categorical it will

Interactions section focuses on bivariate relationships between numerical variables.

Otherwise, I think it is a great and very useful

Download the heart disease data set and tr y it yourself.

9 pandas visualizations techniques for e ective data analysis

Jupyter notebook autocompletion

Machine Learning Data Science Arti cial Intelligence Technology Programming

Towards Data Science Follow

See responses (3)

Dual Boot is Dead: Windows and Linux are now One.

New Features in Python 3.9

Why Is Data Science Losing Its Charm?

How I passed the TensorFlow Developer Certi cation Exam

How to Build a Data Science Web App in Python

Why we switched from Flask to FastAPI for production

I Worked T hrough 500+ Data Science Interview Questions

10 Smooth Python Tricks For Python Gods

About Help Legal

Get the Medium app

You might also like