EDA Cheatsheet - Class Note

shinde.mayur17@gmail.
com
39GCUAX2OS
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 1
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
This cheatsheet contains the syntax of the codes used in the EDA content.
Assume we have a dataframe as df and its variables or column name as
Variable_name1, variable_name 2 and so on
To check the version of libraries

print(libraryname._version_)
shinde.mayur17@gmail.com
39GCUAX2OS
For example: let’s check numpy and pandas version
DESCRIBING DATA import numpy as np

AND DATA PRE- import pandas as pd
PROCESSING import matplotlib.pyplot as plt
import seaborn as sns
print(np._version_)
print(pd._version_)
To import the file
df=pd.read_csv(‘filename.csv)
## here file of type csv
To get top 5 & bottom 5 records

df.head( )
DESCRIBING DATA df.tail( )
AND DATA PRE-
39GCUAX2OS
PROCESSING To check number of rows and number of columns

df.shape[0]
df.shape[1]
To check data type, column name,

count of values (information of data)
df.info( )
To understand the statistics/description of data
like: min, 25%.50%,75%, max(), std, mean, count
df.describe( )
To check individual or any particular column’s data

df[‘variable_name1’].describe( )
To check unique count in any variable

DESCRIBING DATA df[‘variable_name1’].unique( )
AND DATA PRE-
39GCUAX2OS
PROCESSING To replace special characters(say here ‘?’) from the data if

resulting in noise or incorrect information
df[‘variable_name1’]=df[‘variable_name1’].replace(‘?’,
np.NaN)
To change the datatype to float

df[‘variable_name1’]=df[‘variable_name1’].
astype(‘float64’)
If want to replace a value (here let’s say -1)
with particular column value:
New_variable_name= df[df[‘variable_name1’]<condition]
[‘variable_name2’].values[0]
## here Variable_name1 value is to be replaced

with variable_name2 value
HOW TO TREAT df[‘variable_name1’]=df[‘variable_name1’].replace(to_
ANOMALIES IN
39GCUAX2OS replace=-1, value=New_variable_name
THE DATA
If want to replace with mean( )
df[‘variable_name1’]=df[‘variable_name1’].replace(-
1,df[‘variable_name’].mean( ))
If want to replace categorical then it will be done with mode( )

df[‘variable_name1’]=df[‘variable_name1’].
replace(‘?’,df[‘variable_name’].mode( )[0])
To check number of values in a column
df[‘variable_name1’].value_counts()
Number of values in a column

HOW TO TREAT df[‘variable_name1’].value_counts( )
ANOMALIES IN
39GCUAX2OS
THE DATA To check the the complete row as per location

df.iloc[location number]
To check missing values

df.isnull().sum()
Imputation using mean() incase no
outliers are present
df.variable_name1=df.variable_name1.fillna
(df.variable_name1.mean())
How to create dataframe as per data types from the given data
df_num=df.select_dtypes([‘float64’,’int64’])
df_cat=df.select_dtypes([‘object’])
39GCUAX2OS
IMPUTATION
Libraries that can be used for imputation :

sklearn.impute
Now, let’s impute median() in numeric datatype
from sklearn.impute import SimpleImputer
imputer= SimpleImputer(missing_values=np.nan,
strategy=’median’)
imr= imputer.fit(df_num)
df_num=pd.DataFrame(imr.transform(df_num), columns=df_
num.columns)
Similarly can be done for df_cat
How to perform concatenation of two dataframes & join

df_new=pd.concat([df_num, df_cat],axis=1, join=’inner’)
IMPUTATION
39GCUAX2OS How to drop null values
df.dropna( )
To check and drop duplicates

dups=df.duplicated( )
df.drop_duplicates(inplace=True)
Univariate Analysis
DATA
VISUALIZATION
39GCUAX2OS
Pick any number of fields to perform univariate analysis
df[[‘variable_name1’,’variable_name2’]].describe()
Now, to create any plot it is suggested to prepare the drawing board

with appropriate figsize so that plot is clearly visible and understandable.
Then subplots which allows us to choose how many plots to be shown in
rows/columns
fig.set_size_inches(10,8)
UNIVARIATE
##any number as per choice can be given in brackets
ANALYSIS
39GCUAX2OS
fig, axes = plt.subplots(nrows=2,ncols=2)

##any number as per choice can be given in brackets
plt.show() ## to see the plots
Plot the columns:

To create histplot, boxplot we use seaborn library.
sns.histplot(df[‘’variable_name1’’], kde=True, ax=axes[0][0]) **
sns.boxplot(x=’variable_name1’, data=df, ax=axes[0][1])
##kde is used to see the shape of the distribution
## ax=axes[ ][ ] defines the row and column of the drawing
board where the plots will be displayed.
## histplot or displot any of them can be used
• In Case of categorical variables, it is important to

UNIVARIATE
understand the frequency of the different levels
ANALYSIS
39GCUAX2OS
• Now, to view the count/frequency of categorical field

in form of percentage use normalize:
df[‘variable_name3’].value_counts(normalize=true)
To check the count of categorical field in the form of plot use:

sns.countplot(x=’variable_name3’, data=df, palette=’pastel’)
## where palette defines the type of colors to be shown in bars.
Bivariate Analysis
DATA
VISUALIZATION
39GCUAX2OS
Here, let’s analyze using 2 variables:
Both are Numeric variables
Use scatterplot
plt.scatter(df[‘variable_name1’],df[,’variable_name2’])
Both are Categorical variables

BIVARIATE
In this case we use countplot
ANALYSIS
39GCUAX2OS
sns.countplot(x=’variable_name_3’, hue=’Variable_name4’, data=df)
Crosstab function can be used to see information

of two categorical variables
pd.crosstab(df[‘variable_name_3’], df[‘’Variable_name4’],
margins=True, normalize = True)
## margins give total of values align x and y axis
One is numeric and other is categorical

We can use boxplot
sns.boxplot(x=’categorical variable’, y=’numeric variable’,
data=df)
To capture all numeric fields , check there combination (put them in pairs)
sns.pairplot(df)
BIVARIATE
ANALYSIS
39GCUAX2OS
Other plot to check numeric to numeric relationship is heatmap

(correlation)
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(),annot=True,fmt=”.2f”);
where .2f - only 2 decimal places; annot shows correlation numbers
between the heatmap
Lastly, the output with lighter color shows “High Correlation” whereas if
there is a very negative relationship (i.e nearly no correlation it is dark-black
in color), however other colors have some relationship.
Multivariate analysis where more than 2 variables are involved.
So, we can put one categorical variable on the x-axis, one continuous
variable on the y-axis. Lastly, we have put the 3rd variable which is
represented by colors.
sns.boxplot(x=’categorical_variable1, y= ‘numeric_
variable’,hue=’categorical_variable2’)
BIVARIATE
Incase we have high categorical levels we can use FacetGrid
ANALYSIS a= sns.FacetGrid(df, col=”categorical_variable1”,
39GCUAX2OS
hue=’categorical_variable2’,col_wrap=3, height=3)
where col_wrap will show three plots in a row
To show the data with any particular numeric variable

a.map(plt.scatter, “numeric_variable_name1”,
‘numeric_variable_name2’)
a.add_legend()
SCALING: Scaling helps us achieve the same weightage to all numeric
variables irrespective of the values it contains.
To get all columns of numeric data type together and categorical data type
together we can create lists and perform for loop:
cat=[ ] #categorical list
num=[ ] #numeric list
# here we we have created a for loop search for all columns with datatype
as object and add it in cat list and other datatype which is numeric in num
DATA
39GCUAX2OS
using append
PREPARATION for i in df.columns:
if df[i].dtype==”object”:
cat.append(i)
else:
num.append(i)
# to see the list , do print
print(cat)
print(num)
Z-SCORE: How to apply zscore in numeric data
for scaling or standardizing the data
We have imported zscore from library scipy.stats
from scipy.stats import zscore
data_scaled=df[num].apply(zscore)
#here data_scaled a new dataframe is created so that comparison can be
done with the original dataframe. Also, we have applied zscore function to
METHODS /
39GCUAX2OS all numeric fields in dataframe
TECHNIQUES
FOR SCALING Note: when we apply zscore it centralizes the data - mean is near to 0 and
standard deviation as 1. Also, the scale is changed (i.e. values gets changed)
and is at a range which is comparable.
We can perform histplot on both the dataframes one before scaling and
other after performing scaling as explained previously **
Scaling can be done using StandardScaler function as well. Import the
function from library. Create an object fit numeric dataframe in it and
transform it. Once done, the data will be in the form of an array which needs
to be converted back to dataframe.
METHODS / Below is the set of codes to do:

39GCUAX2OS
TECHNIQUES
from sklearn.preprocessing import StandardScaler

FOR SCALING scaler = StandardScaler().fit(df[num])
data_standard=scaler.transform(df[num])
data_standard=pd.DataFrame(data_standard,
columns=df[num].columns)
data_standard.describe()
Another method can be MinMax (where minimum value will be
0 and max will be 1 and other values will range between 0 to 1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(df[num])
# created the object & fit numeric fields
METHODS / data_minmax = scaler.transform(df[num])
39GCUAX2OS
TECHNIQUES
# transformed it and now scale data is available

FOR SCALING data_minmax=pd.DataFrame(data_minmax, columns=df[num].columns)
#create dataframe to put the scaled data
data_minmax.describe()
Note: based upon the need/requirement of Algorithm scaling techniques are applied.
Check for skewness/kurtosis in data and
do transformation accordingly.
To check skewness / kurtosis:
df[‘variable_name’].skew( )
#helps us understand how symmetric is the distribution
df[‘variable_name’].kurtosis( )
TRANSFORMATION
39GCUAX2OS #helps us understand the sharpness/peak of distribution
For a Normal distribution skewness/kurtosis should be between -1 to +1.

Several transformation techniques to improve the skewness,kurtosis
log transformation
print(np.log(df[‘variable_name’]).skew())
print(np.log(df[‘variable_name’]).kurtosis())
To see transformation on the plot:
sns.histplot(np.log(df[‘variable_name’]),
kde=True, ax = axs[0])
sns.boxplot(x= np.log(df[‘variable_name’]), ax = axs[1])
Sqrt transformation
TRANSFORMATION
39GCUAX2OS print(np.sqrt(df[‘variable_name’]).skew())
print(np.sqrt(df[‘variable_name’]).kurtosis())
To see transformation on the plot:

sns.histplot(np.sqrt(df[‘variable_name’]), kde=True, ax =
axs[0])
sns.boxplot(x= np.sqrt(df[‘variable_name’]), ax = axs[1])
Distribution means more spread
Using root of 10 transformation
print((df[‘variable_name’]**0.1).skew())
print((df[‘variable_name’]**0.1).kurtosis())
fig_dims = (10, 5)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=fig_dims)
TRANSFORMATION
39GCUAX2OS sns.histplot((df[‘variable_name’]**0.1), kde=True, ax = axs[0])
sns.boxplot(x=(df[‘variable_name’]**0.1), ax = axs[1])
Check and compare the skewness/kurtosis from different transformation

technique and choose the technique which gives better results
Note: Reverse of the transformation is also required to get the correct results.
We need to apply a square where square root is applied or exponential incase of log.
Outlier treatment
Sometimes we might not need to apply outlier treatment
if treating can hamper the data.
Z score: Normalization
Treat outlier if zscore value is either less than -3 or
greater than +3 Using z score technique:
TRANSFORMATION
39GCUAX2OS z= x-μ / σ
Where μ is the mean of variable x and σ is the standard deviation
df[‘new_variable_zscore’]= (df.variable_name - df.variable_

name.mean( ) / df.variable_name.std( ))
#df[‘new_variable_zscore’] is new column which has zscore values
Check if zscore column has any value below -3 or greater than 3
Calculate the value that will be the threshold which can be imputed to the
outlier so as to get the normal distribution: i.e x= (z* σ)+ μ
Here, z is 3. so
3*σ + μ
impute_value = (3*variable_name.std())
TRANSFORMATION
39GCUAX2OS + df.variable_name.mean()round(impute_value,2)
Now, imputing this value will remove outlier and the data will be
normalized.
Note: we can use np.where to replace the value.
Use boxplot to detect outliers:
Point above or below the whisker value are outliers
def detect_outlier(col):
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
39GCUAX2OS
TRANSFORMATION
lower_range= Q1-(1.5 * IQR)

upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range
#IQR - inter quartile range, lower_range--- lower whisker value, uPper_

range------ upper whisker value
One Way:
Capping the value at 99%ile value
q99percent = df.variable_name.quantile(q=0.99)
(calculated the value & stored in q99percent)
39GCUAX2OS
TRANSFORMATION
It is very important to have business knowledge before doing any

imputation.
How to reset index

df.reset_index(drop=True,inplace=True)
LAB ENCODING : should be numeric
Can be applied to ordinal variables, so we can change the categorical
variable to numeric values and do encoding (it assign values/numbers
based upon the alphabetical order of categorical names in that field).
The first step is to change the datatype
ENCODING
Below is the syntax:
2 OPTIONS
39GCUAX2OS
df[‘categorical_variable_name’] = df[“categorical_variable_
name”].cat.codes
ONE HOT ENDCODING:

One Hot Encoding is not to be done on Label fields.
Preferred: If number of levels in categorical are less than 20
Syntax to perform one hot encoding
cat.remove(‘categorical_variable’)
ENCODING
df_new =pd.get_dummies(df, columns=cat,drop_first=True)
2 OPTIONS
39GCUAX2OS
always try to drop 1 dummy variables as there is high possibility of

correlation between these dummy variables
39GCUAX2OS
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.


EDA Cheatsheet - Class Note

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EDA Cheatsheet - Class Note

Uploaded by

Copyright:

Available Formats

shinde.mayur17@gmail.

To check the version of libraries

DESCRIBING DATA import numpy as np

To get top 5 & bottom 5 records

PROCESSING To check number of rows and number of columns

To check data type, column name,

To check individual or any particular column’s data

To check unique count in any variable

PROCESSING To replace special characters(say here ‘?’) from the data if

To change the datatype to float

## here Variable_name1 value is to be replaced

If want to replace categorical then it will be done with mode( )

Number of values in a column

THE DATA To check the the complete row as per location

To check missing values

Libraries that can be used for imputation :

How to perform concatenation of two dataframes & join

To check and drop duplicates

Now, to create any plot it is suggested to prepare the drawing board

fig, axes = plt.subplots(nrows=2,ncols=2)

Plot the columns:

• In Case of categorical variables, it is important to

• Now, to view the count/frequency of categorical field

To check the count of categorical field in the form of plot use:

Both are Categorical variables

sns.countplot(x=’variable_name_3’, hue=’Variable_name4’, data=df)

Crosstab function can be used to see information

One is numeric and other is categorical

Other plot to check numeric to numeric relationship is heatmap

To show the data with any particular numeric variable

METHODS / Below is the set of codes to do:

from sklearn.preprocessing import StandardScaler

# transformed it and now scale data is available

For a Normal distribution skewness/kurtosis should be between -1 to +1.

To see transformation on the plot:

Check and compare the skewness/kurtosis from different transformation

Where μ is the mean of variable x and σ is the standard deviation

df[‘new_variable_zscore’]= (df.variable_name - df.variable_

Note: we can use np.where to replace the value.

lower_range= Q1-(1.5 * IQR)

#IQR - inter quartile range, lower_range--- lower whisker value, uPper_

It is very important to have business knowledge before doing any

How to reset index

ONE HOT ENDCODING:

always try to drop 1 dummy variables as there is high possibility of

This file is meant for personal use by shinde.mayur17@gmail.com only.

You might also like