Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

shinde.mayur17@gmail.

com
39GCUAX2OS

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 1
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
This cheatsheet contains the syntax of the codes used in the EDA content.
Assume we have a dataframe as df and its variables or column name as
Variable_name1, variable_name 2 and so on

To check the version of libraries


print(libraryname._version_)
shinde.mayur17@gmail.com
39GCUAX2OS
For example: let’s check numpy and pandas version

DESCRIBING DATA import numpy as np


AND DATA PRE- import pandas as pd
PROCESSING import matplotlib.pyplot as plt
import seaborn as sns

print(np._version_)
print(pd._version_)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 2
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
To import the file
df=pd.read_csv(‘filename.csv)
## here file of type csv

To get top 5 & bottom 5 records


df.head( )
DESCRIBING DATA df.tail( )
AND DATA PRE-
shinde.mayur17@gmail.com
39GCUAX2OS

PROCESSING To check number of rows and number of columns


df.shape[0]
df.shape[1]

To check data type, column name,


count of values (information of data)
df.info( )

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 3
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
To understand the statistics/description of data
like: min, 25%.50%,75%, max(), std, mean, count
df.describe( )

To check individual or any particular column’s data


df[‘variable_name1’].describe( )

To check unique count in any variable


DESCRIBING DATA df[‘variable_name1’].unique( )
AND DATA PRE-
shinde.mayur17@gmail.com
39GCUAX2OS

PROCESSING To replace special characters(say here ‘?’) from the data if


resulting in noise or incorrect information
df[‘variable_name1’]=df[‘variable_name1’].replace(‘?’,
np.NaN)

To change the datatype to float


df[‘variable_name1’]=df[‘variable_name1’].
astype(‘float64’)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 4
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
If want to replace a value (here let’s say -1)
with particular column value:
New_variable_name= df[df[‘variable_name1’]<condition]
[‘variable_name2’].values[0]

## here Variable_name1 value is to be replaced


with variable_name2 value
HOW TO TREAT df[‘variable_name1’]=df[‘variable_name1’].replace(to_
ANOMALIES IN
shinde.mayur17@gmail.com
39GCUAX2OS replace=-1, value=New_variable_name
THE DATA
If want to replace with mean( )
df[‘variable_name1’]=df[‘variable_name1’].replace(-
1,df[‘variable_name’].mean( ))

If want to replace categorical then it will be done with mode( )


df[‘variable_name1’]=df[‘variable_name1’].
replace(‘?’,df[‘variable_name’].mode( )[0])

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 5
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
To check number of values in a column
df[‘variable_name1’].value_counts()

Number of values in a column


HOW TO TREAT df[‘variable_name1’].value_counts( )
ANOMALIES IN
shinde.mayur17@gmail.com
39GCUAX2OS

THE DATA To check the the complete row as per location


df.iloc[location number]

To check missing values


df.isnull().sum()

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 6
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Imputation using mean() incase no
outliers are present
df.variable_name1=df.variable_name1.fillna
(df.variable_name1.mean())

How to create dataframe as per data types from the given data
df_num=df.select_dtypes([‘float64’,’int64’])
df_cat=df.select_dtypes([‘object’])
39GCUAX2OS
IMPUTATION
shinde.mayur17@gmail.com

Libraries that can be used for imputation :


sklearn.impute
Now, let’s impute median() in numeric datatype
from sklearn.impute import SimpleImputer
imputer= SimpleImputer(missing_values=np.nan,
strategy=’median’)
imr= imputer.fit(df_num)
df_num=pd.DataFrame(imr.transform(df_num), columns=df_
num.columns)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 7
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Similarly can be done for df_cat

How to perform concatenation of two dataframes & join


df_new=pd.concat([df_num, df_cat],axis=1, join=’inner’)

IMPUTATION
shinde.mayur17@gmail.com
39GCUAX2OS How to drop null values
df.dropna( )

To check and drop duplicates


dups=df.duplicated( )
df.drop_duplicates(inplace=True)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 8
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Univariate Analysis

DATA
shinde.mayur17@gmail.com

VISUALIZATION
39GCUAX2OS

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 9
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Pick any number of fields to perform univariate analysis
df[[‘variable_name1’,’variable_name2’]].describe()

Now, to create any plot it is suggested to prepare the drawing board


with appropriate figsize so that plot is clearly visible and understandable.
Then subplots which allows us to choose how many plots to be shown in
rows/columns
fig.set_size_inches(10,8)
UNIVARIATE
shinde.mayur17@gmail.com
##any number as per choice can be given in brackets
ANALYSIS
39GCUAX2OS

fig, axes = plt.subplots(nrows=2,ncols=2)


##any number as per choice can be given in brackets
plt.show() ## to see the plots

Plot the columns:


To create histplot, boxplot we use seaborn library.
sns.histplot(df[‘’variable_name1’’], kde=True, ax=axes[0][0]) **
sns.boxplot(x=’variable_name1’, data=df, ax=axes[0][1])

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 10
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
##kde is used to see the shape of the distribution
## ax=axes[ ][ ] defines the row and column of the drawing
board where the plots will be displayed.
## histplot or displot any of them can be used

• In Case of categorical variables, it is important to


UNIVARIATE
shinde.mayur17@gmail.com
understand the frequency of the different levels
ANALYSIS
39GCUAX2OS

• Now, to view the count/frequency of categorical field


in form of percentage use normalize:
df[‘variable_name3’].value_counts(normalize=true)

To check the count of categorical field in the form of plot use:


sns.countplot(x=’variable_name3’, data=df, palette=’pastel’)
## where palette defines the type of colors to be shown in bars.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 11
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Bivariate Analysis

DATA
shinde.mayur17@gmail.com

VISUALIZATION
39GCUAX2OS

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 12
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Here, let’s analyze using 2 variables:
Both are Numeric variables
Use scatterplot
plt.scatter(df[‘variable_name1’],df[,’variable_name2’])

Both are Categorical variables


BIVARIATE
shinde.mayur17@gmail.com
In this case we use countplot
ANALYSIS
39GCUAX2OS

sns.countplot(x=’variable_name_3’, hue=’Variable_name4’, data=df)

Crosstab function can be used to see information


of two categorical variables
pd.crosstab(df[‘variable_name_3’], df[‘’Variable_name4’],
margins=True, normalize = True)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 13
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
## margins give total of values align x and y axis

One is numeric and other is categorical


We can use boxplot
sns.boxplot(x=’categorical variable’, y=’numeric variable’,
data=df)

To capture all numeric fields , check there combination (put them in pairs)
sns.pairplot(df)
BIVARIATE
shinde.mayur17@gmail.com

ANALYSIS
39GCUAX2OS

Other plot to check numeric to numeric relationship is heatmap


(correlation)
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(),annot=True,fmt=”.2f”);
where .2f - only 2 decimal places; annot shows correlation numbers
between the heatmap
Lastly, the output with lighter color shows “High Correlation” whereas if
there is a very negative relationship (i.e nearly no correlation it is dark-black
in color), however other colors have some relationship.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 14
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Multivariate analysis where more than 2 variables are involved.
So, we can put one categorical variable on the x-axis, one continuous
variable on the y-axis. Lastly, we have put the 3rd variable which is
represented by colors.
sns.boxplot(x=’categorical_variable1, y= ‘numeric_
variable’,hue=’categorical_variable2’)

BIVARIATE
shinde.mayur17@gmail.com
Incase we have high categorical levels we can use FacetGrid
ANALYSIS a= sns.FacetGrid(df, col=”categorical_variable1”,
39GCUAX2OS

hue=’categorical_variable2’,col_wrap=3, height=3)
where col_wrap will show three plots in a row

To show the data with any particular numeric variable


a.map(plt.scatter, “numeric_variable_name1”,
‘numeric_variable_name2’)
a.add_legend()

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 15
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
SCALING: Scaling helps us achieve the same weightage to all numeric
variables irrespective of the values it contains.
To get all columns of numeric data type together and categorical data type
together we can create lists and perform for loop:
cat=[ ] #categorical list
num=[ ] #numeric list
# here we we have created a for loop search for all columns with datatype
as object and add it in cat list and other datatype which is numeric in num
DATA
shinde.mayur17@gmail.com
39GCUAX2OS

using append
PREPARATION for i in df.columns:
if df[i].dtype==”object”:
cat.append(i)
else:
num.append(i)
# to see the list , do print
print(cat)
print(num)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 16
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Z-SCORE: How to apply zscore in numeric data
for scaling or standardizing the data
We have imported zscore from library scipy.stats
from scipy.stats import zscore
data_scaled=df[num].apply(zscore)
#here data_scaled a new dataframe is created so that comparison can be
done with the original dataframe. Also, we have applied zscore function to
METHODS /
shinde.mayur17@gmail.com
39GCUAX2OS all numeric fields in dataframe
TECHNIQUES
FOR SCALING Note: when we apply zscore it centralizes the data - mean is near to 0 and
standard deviation as 1. Also, the scale is changed (i.e. values gets changed)
and is at a range which is comparable.

We can perform histplot on both the dataframes one before scaling and
other after performing scaling as explained previously **

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 17
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Scaling can be done using StandardScaler function as well. Import the
function from library. Create an object fit numeric dataframe in it and
transform it. Once done, the data will be in the form of an array which needs
to be converted back to dataframe.

METHODS / Below is the set of codes to do:


39GCUAX2OS
TECHNIQUES
shinde.mayur17@gmail.com

from sklearn.preprocessing import StandardScaler


FOR SCALING scaler = StandardScaler().fit(df[num])
data_standard=scaler.transform(df[num])
data_standard=pd.DataFrame(data_standard,
columns=df[num].columns)
data_standard.describe()

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 18
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Another method can be MinMax (where minimum value will be
0 and max will be 1 and other values will range between 0 to 1)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler().fit(df[num])
# created the object & fit numeric fields
METHODS / data_minmax = scaler.transform(df[num])
39GCUAX2OS
TECHNIQUES
shinde.mayur17@gmail.com

# transformed it and now scale data is available


FOR SCALING data_minmax=pd.DataFrame(data_minmax, columns=df[num].columns)
#create dataframe to put the scaled data
data_minmax.describe()

Note: based upon the need/requirement of Algorithm scaling techniques are applied.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 19
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Check for skewness/kurtosis in data and
do transformation accordingly.
To check skewness / kurtosis:
df[‘variable_name’].skew( )
#helps us understand how symmetric is the distribution
df[‘variable_name’].kurtosis( )
TRANSFORMATION
shinde.mayur17@gmail.com
39GCUAX2OS #helps us understand the sharpness/peak of distribution

For a Normal distribution skewness/kurtosis should be between -1 to +1.


Several transformation techniques to improve the skewness,kurtosis
log transformation
print(np.log(df[‘variable_name’]).skew())
print(np.log(df[‘variable_name’]).kurtosis())

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 20
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
To see transformation on the plot:
sns.histplot(np.log(df[‘variable_name’]),
kde=True, ax = axs[0])
sns.boxplot(x= np.log(df[‘variable_name’]), ax = axs[1])

Sqrt transformation
TRANSFORMATION
shinde.mayur17@gmail.com
39GCUAX2OS print(np.sqrt(df[‘variable_name’]).skew())
print(np.sqrt(df[‘variable_name’]).kurtosis())

To see transformation on the plot:


sns.histplot(np.sqrt(df[‘variable_name’]), kde=True, ax =
axs[0])
sns.boxplot(x= np.sqrt(df[‘variable_name’]), ax = axs[1])

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 21
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Distribution means more spread
Using root of 10 transformation
print((df[‘variable_name’]**0.1).skew())
print((df[‘variable_name’]**0.1).kurtosis())

fig_dims = (10, 5)
fig, axs = plt.subplots(nrows=1, ncols=2, figsize=fig_dims)
TRANSFORMATION
shinde.mayur17@gmail.com
39GCUAX2OS sns.histplot((df[‘variable_name’]**0.1), kde=True, ax = axs[0])
sns.boxplot(x=(df[‘variable_name’]**0.1), ax = axs[1])

Check and compare the skewness/kurtosis from different transformation


technique and choose the technique which gives better results

Note: Reverse of the transformation is also required to get the correct results.
We need to apply a square where square root is applied or exponential incase of log.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 22
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Outlier treatment
Sometimes we might not need to apply outlier treatment
if treating can hamper the data.

Z score: Normalization
Treat outlier if zscore value is either less than -3 or
greater than +3 Using z score technique:
TRANSFORMATION
shinde.mayur17@gmail.com
39GCUAX2OS z= x-μ / σ

Where μ is the mean of variable x and σ is the standard deviation

df[‘new_variable_zscore’]= (df.variable_name - df.variable_


name.mean( ) / df.variable_name.std( ))
#df[‘new_variable_zscore’] is new column which has zscore values
Check if zscore column has any value below -3 or greater than 3

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 23
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Calculate the value that will be the threshold which can be imputed to the
outlier so as to get the normal distribution: i.e x= (z* σ)+ μ
Here, z is 3. so
3*σ + μ
impute_value = (3*variable_name.std())
TRANSFORMATION
shinde.mayur17@gmail.com
39GCUAX2OS + df.variable_name.mean()round(impute_value,2)

Now, imputing this value will remove outlier and the data will be
normalized.

Note: we can use np.where to replace the value.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 24
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Use boxplot to detect outliers:
Point above or below the whisker value are outliers

def detect_outlier(col):
Q1,Q3=np.percentile(col,[25,75])
IQR=Q3-Q1
39GCUAX2OS
TRANSFORMATION
shinde.mayur17@gmail.com

lower_range= Q1-(1.5 * IQR)


upper_range= Q3+(1.5 * IQR)
return lower_range, upper_range

#IQR - inter quartile range, lower_range--- lower whisker value, uPper_


range------ upper whisker value

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 25
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
One Way:
Capping the value at 99%ile value

q99percent = df.variable_name.quantile(q=0.99)
(calculated the value & stored in q99percent)
39GCUAX2OS
TRANSFORMATION
shinde.mayur17@gmail.com

It is very important to have business knowledge before doing any


imputation.

How to reset index


df.reset_index(drop=True,inplace=True)

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 26
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
LAB ENCODING : should be numeric
Can be applied to ordinal variables, so we can change the categorical
variable to numeric values and do encoding (it assign values/numbers
based upon the alphabetical order of categorical names in that field).
The first step is to change the datatype
ENCODING
shinde.mayur17@gmail.com
Below is the syntax:
2 OPTIONS
39GCUAX2OS

df[‘categorical_variable_name’] = df[“categorical_variable_
name”].cat.codes

ONE HOT ENDCODING:


One Hot Encoding is not to be done on Label fields.

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 27
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Preferred: If number of levels in categorical are less than 20
Syntax to perform one hot encoding
cat.remove(‘categorical_variable’)
ENCODING
shinde.mayur17@gmail.com
df_new =pd.get_dummies(df, columns=cat,drop_first=True)
2 OPTIONS
39GCUAX2OS

always try to drop 1 dummy variables as there is high possibility of


correlation between these dummy variables

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 28
This file is meant for personal use by shinde.mayur17@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
shinde.mayur17@gmail.com
39GCUAX2OS

Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

This file is meant for personal use by shinde.mayur17@gmail.com only.


Sharing or publishing the contents in part or full is liable for legal action.

You might also like