Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

A Basic Guide to Initial and

Exploratory Data Analysis


With a few examples in Python

Introduction
A data analyst is defined differently in different work setups. A data
analyst might be contributing to all kinds of work — including MIS,
reporting, data engineering, database management, etc. in a real scenario.
It’s not necessarily bad. But here, what we’ll be talking about it the actual
job of a data analyst — not what companies make them do. A data analyst
literally means a person who analyses data — not engineers pipelines to
get it flowing, not maintaining databases. Just the analysis. Find out what
the data means. Identify trends. Support the business teams to make
better decisions and improve overall efficiency. It’s a role that tends more
towards hardcore statistics than hardcore computer sciences.

With that in mind, for a data analyst, it is useful to go through with the
following two steps before starting out to find what the data means —

• Finding out useful information just by looking at different data points


and their availability and distribution in the given dataset.
• Finding out how different variables correlate to each other on the basis
of their availability. One can also examine the quality of data — which
is where our first step comes in.

This is not a comprehensive writeup about all the methodologies that can
be used for performing these two steps. It’s just a basic overview of some
of them.
Photo by Isaac Smith on Unsplash. Hopefully, this is not the analysis you’ll be doing as a data analyst. By
the way, isn’t this a good graph! :)

Initial Data Analysis


Visualise the Completeness of a Data Set
using missingno — data incompleteness is a major problem — both in
machine-based data collection and human-based data collection. Both
methods are prone to error. The completeness of data is less of a problem
if it is identified prior to analysing any given dataset. It can be quite a
disaster if decisions are made on incomplete data. Food for thought —
think about cases where no data is better than incomplete data for
arriving at a reasonably sane conclusion/inference.

The idea is to check for the incompleteness of data

In the following chart/plot, it is quite visible that data for db and tp is


scarcely available in the dataset, while data for tb, sf and gen is abundant.
Intentionally Distorted Liver Patient Data taken
from https://archive.ics.uci.edu/ml/datasets/ILPD+(Indian+Liver+Patient+Dataset)

Get the above plot using:


import pandas as pd
import missingno as mn
from IPython import get_ipython
get_ipython().run_line_magic(‘matplotlib’, ‘inline’)
df = pd.read_csv(‘/Users/mac/Downloads/missingno-ex.csv’)
mn.matrix(df.sample(500), figsize=(10,6))

The idea is to
Identify correlation between the availability of data
for two or more fields

For large datasets, take a random sample from that dataset and plot the
data for identifying how frequently the data is missing. Also, plot
a dendrogram to identify the correlation between the availability of
data for two or more fields, using the following line of code:
mn.dendrogram(df, figsize=(10,6))
To identify pair dissimilarity between two variables, plot a Dendrogram. To know more, please read
— http://www.nonlinear.com/support/progenesis/comet/faq/v2.0/dendrogram.aspx

And moving towards simplicity, visualise the data using a bar chart — with
absolute numbers being displayed where a particular variables contains
data.
mn.bar(df, figsize=(10,6))

Absolute Numbers

Now, that we have some idea about the distribution and availability of
data, we can go on and explore the data.
The first thing that one might wonder is what the dataset actually looks
like. Just to get this hint, one might use head(n) to print out the
first n number of records in the dataset. NaN means that the data is not
available. We’ll come to the NaN values at a later point.

Using pandas data frame to explore the data

Exploratory Data Analysis


And now that we have some idea about how our data looks, it’s time for
exploring the data using Seaborn.
The idea is to see how are two given variables in a
dataset are related to each other?

After looking at the data, the first analytical question that comes to mind
is how are two variables in this dataset related to each other? Or, how
does one variable change its value when some other variable is changed?
Or, whether a relationship exists between these two variables at all.

Even if finding a linear regression fit line might not be the motive, one might still want to explore the data
for the value it carries. It’s important to know whether a relationship exists between two variables or not
(correlation) and if it does, how does the value change (regression)
Get the above chart using this simple code
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("notebook", font_scale=1.1)
sns.set_style("ticks")
sns.lmplot('alba', 'tp',
data=df,
fit_reg=True,
scatter_kws={"marker": "D",
"s": 10})
plt.show()

Along with the contents of the previous chart, to visualise the distribution
of values of these variables plotted in the previous chart, rather
than sns.lmplot, one can use sns.jointplot using the following line of code:
sns.jointplot(x="total_bill", y="tip", data=tips, kind="reg")

Read more about pearsonr — https://docs.scipy.org/doc/scipy-


0.14.0/reference/generated/scipy.stats.pearsonr.html

Let us look into one of the most useful EDA methods for visualising data
— Box Plots.
The idea is to study the distribution of a variable in a
dataset

Box Plots are essentially used for studying the distribution of a variable
in a dataset. Although, Histograms can also be used for that, but
comparatively, Box Plots provide a better summary of the data in some
cases. A box plot tells us about the shape, variability and the centre of the
data. Although, when we’re only keen to know about the distribution in a
way which tells us at which side the data is skewed — histograms can serve
our purpose saving us from other details.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('/Users/mac/Downloads/missingno-ex.csv')
sns.factorplot(x='alba',y= 'ab/gr',data=df,kind='box',aspect=1)
plt.show()

Know more about Box Plots —1. https://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-


whisker-plot/ and 2. https://www.nature.com/nmeth/journal/v11/n2/pdf/nmeth.2813.pdf
The same data that has been visualised above can be visualised like this
using a factor-point plot in Seaborn.
sns.factorplot(x='alba',y= 'ab/gr',data=df,kind='point',aspect=1)

Cleaning the data — obviously, this needs to be done before we start our
analysis and is the part of prepping the data for use — hence, it is a part of
the Initial Data Analysis phase rather than EDA.

While you used the missingno library to visualise where the data was
missing, you did not clean the data for usage. The first part is
removing nulls from the dataset. And then, removing the duplicates.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsdf = pd.read_csv('/Users/mac/Downloads/missingno-
ex.csv')
# Cleaning the data now
df = df.dropna().drop_duplicates()
sns.factorplot(x='alba',y= 'ab/gr',data=df,kind='box',aspect=1)
plt.show()

This is what the data looks like after it has been cleaned with dropna() and drop_duplicates(). Notice the
missing indices on the leftmost column — which is, by the way, not a part of the dataset. It just represents
the row number.
After cleaning up, this is the data that remains for use.

This was a basic introduction to a couple of examples that can be used in


the first two steps of the process of analysing data, namely,

1. Initial Data Analysis — initial explorations about the nature of data


and how it has been collected. This part involves cleaning and munging
of data so that it’s useful for analysis.
2. Exploratory Data Analysis — involves the full exploration, mostly
by visual methods some of which are mentioned above.
3. Modelling — creating a model for the given data and establishing
relationships between different variables — with a training data set.
4. Validation — checking if the model works for the data set that was
not used for training, i.e., the test dataset. If the model is valid, then
you go on to the next step otherwise, you go back and improve the
model.
5. Prediction — if the model is validated, it means you can say with a
certain confidence and good probability how a certain variable in the
dataset will change with respect to another variable in the dataset. This
is the endgame.
Conclusion
The first two steps are important, especially if there’s a lack of trust in the
data. This lack of trust can arise from various reasons like how was the
data collected, what is the source of the data, is the data biased, has data
been intentionally tampered with and so on. Performing the first two
steps, i.e., initial data analysis and exploratory data analysis, it is easy to
identify issues with the data set and not spend any time making false and
erroneous models and predictions.

Happy Learning…..

Sunil Arava

You might also like