Professional Documents
Culture Documents
Exploratory Data Analysis-1
Exploratory Data Analysis-1
Introduction
A data analyst is defined differently in different work setups. A data
analyst might be contributing to all kinds of work — including MIS,
reporting, data engineering, database management, etc. in a real scenario.
It’s not necessarily bad. But here, what we’ll be talking about it the actual
job of a data analyst — not what companies make them do. A data analyst
literally means a person who analyses data — not engineers pipelines to
get it flowing, not maintaining databases. Just the analysis. Find out what
the data means. Identify trends. Support the business teams to make
better decisions and improve overall efficiency. It’s a role that tends more
towards hardcore statistics than hardcore computer sciences.
With that in mind, for a data analyst, it is useful to go through with the
following two steps before starting out to find what the data means —
This is not a comprehensive writeup about all the methodologies that can
be used for performing these two steps. It’s just a basic overview of some
of them.
Photo by Isaac Smith on Unsplash. Hopefully, this is not the analysis you’ll be doing as a data analyst. By
the way, isn’t this a good graph! :)
The idea is to
Identify correlation between the availability of data
for two or more fields
For large datasets, take a random sample from that dataset and plot the
data for identifying how frequently the data is missing. Also, plot
a dendrogram to identify the correlation between the availability of
data for two or more fields, using the following line of code:
mn.dendrogram(df, figsize=(10,6))
To identify pair dissimilarity between two variables, plot a Dendrogram. To know more, please read
— http://www.nonlinear.com/support/progenesis/comet/faq/v2.0/dendrogram.aspx
And moving towards simplicity, visualise the data using a bar chart — with
absolute numbers being displayed where a particular variables contains
data.
mn.bar(df, figsize=(10,6))
Absolute Numbers
Now, that we have some idea about the distribution and availability of
data, we can go on and explore the data.
The first thing that one might wonder is what the dataset actually looks
like. Just to get this hint, one might use head(n) to print out the
first n number of records in the dataset. NaN means that the data is not
available. We’ll come to the NaN values at a later point.
After looking at the data, the first analytical question that comes to mind
is how are two variables in this dataset related to each other? Or, how
does one variable change its value when some other variable is changed?
Or, whether a relationship exists between these two variables at all.
Even if finding a linear regression fit line might not be the motive, one might still want to explore the data
for the value it carries. It’s important to know whether a relationship exists between two variables or not
(correlation) and if it does, how does the value change (regression)
Get the above chart using this simple code
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("notebook", font_scale=1.1)
sns.set_style("ticks")
sns.lmplot('alba', 'tp',
data=df,
fit_reg=True,
scatter_kws={"marker": "D",
"s": 10})
plt.show()
Along with the contents of the previous chart, to visualise the distribution
of values of these variables plotted in the previous chart, rather
than sns.lmplot, one can use sns.jointplot using the following line of code:
sns.jointplot(x="total_bill", y="tip", data=tips, kind="reg")
Let us look into one of the most useful EDA methods for visualising data
— Box Plots.
The idea is to study the distribution of a variable in a
dataset
Box Plots are essentially used for studying the distribution of a variable
in a dataset. Although, Histograms can also be used for that, but
comparatively, Box Plots provide a better summary of the data in some
cases. A box plot tells us about the shape, variability and the centre of the
data. Although, when we’re only keen to know about the distribution in a
way which tells us at which side the data is skewed — histograms can serve
our purpose saving us from other details.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('/Users/mac/Downloads/missingno-ex.csv')
sns.factorplot(x='alba',y= 'ab/gr',data=df,kind='box',aspect=1)
plt.show()
Cleaning the data — obviously, this needs to be done before we start our
analysis and is the part of prepping the data for use — hence, it is a part of
the Initial Data Analysis phase rather than EDA.
While you used the missingno library to visualise where the data was
missing, you did not clean the data for usage. The first part is
removing nulls from the dataset. And then, removing the duplicates.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsdf = pd.read_csv('/Users/mac/Downloads/missingno-
ex.csv')
# Cleaning the data now
df = df.dropna().drop_duplicates()
sns.factorplot(x='alba',y= 'ab/gr',data=df,kind='box',aspect=1)
plt.show()
This is what the data looks like after it has been cleaned with dropna() and drop_duplicates(). Notice the
missing indices on the leftmost column — which is, by the way, not a part of the dataset. It just represents
the row number.
After cleaning up, this is the data that remains for use.
Happy Learning…..
Sunil Arava