Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

Learn Data Analysis with Python: Find out the Practical

Code for Data Cleaning

Introduction

If we want to apply for any data analyst or data scientist role, it is necessary to know one of

the programming languages used for such roles. It could be R or Python or Scala etc. To fulfill

this, I have selected Python for data analysis.

The data preparation step is the most important part to win the battle of a data analysis project.

This document will have information about how data cleaning ( missing values, outliers,

duplicates )is possible with Python.

Raw data is full of impurity like outliers, missing values, duplicates, etc. To clean this data

means, it needs to be logical, significant and regulated.

Missing Data

This is one of the most common issues. To solve these, there are many methods. Let us find

out some practical codes.


Image by Author (graderecord.csv)

This is a CSV file with two missing values in the Grade column. Let us now find out practical

codes to process such types of missing information.

import pandas as pd

df = pd.read_csv(“graderecord.csv”)

df.head(10)

Code: Drop Rows with Missing values

df_no_missing = df.dropna()

df_no_missing

Code: Replace Empty rows with 0

df.fillna(0)

Code: Replace Empty rows with Mean or Median or Mode of the column
df[“Grade”].fillna(df[“Grade”].mean(), inplace=True)

df[“Grade”].fillna(df[“Grade”].median(), inplace=True)

df[“Grade”].fillna(df[“Grade”].mode(), inplace=True)

Code: Selecting Rows with No Missing values

df[df[‘Grade’].notnull()]

Code: If a column has all empty values. drop the completely empty column

# Add a Column with Empty Values

import numpy as np

df[‘newcol’] = np.nan

df.head()

# Drop Empty Column

df.dropna(axis=1, how=”all”)

Outlier Treatment

In this blog, I am considering two ways of outlier treatment.

Method 1: If data is normally distributed, then we follow the standardization method.

Considering, confidence interval of 95% that means z-score is 1.96 for 5% alpha value. In

conclusion, 95% of data is distributed within 1.96 standard deviations of the mean. So we can

drop the value below or above this range.


Image by Author (gradedata.csv)

import pandas as pd

df = pd.read_csv(“gradedata.csv”)

df

meangrade = df[‘grade’].mean()

stdgrade = df[‘grade’].std()

higherrange = meangrade + stdgrade * 1.96

lowerrange = meangrade — stdgrade * 1.96

df = df.drop(df[df[‘grade’] > higherrange].index)

df = df.drop(df[df[‘grade’] < lowerrange].index)

df

Method 2: In this method, we use the interquartile range (IQR). IQR is the difference between

25% (Q1 ) of the quantile and 75% (Q3) of the quantile. Any value lower than Q1–1.5*IQR or

greater than Q3–1.5*IQR is considered an outlier.


q1 = df[‘grade’].quantile(.25)

q3 = df[‘grade’].quantile(.75)

iqr = q3-q1

highrange = q3 + iqr * 1.5

lowrange = q1 — iqr * 1.5

df = df.drop(df[df[‘grade’] > highrange].index)

df = df.drop(df[df[‘grade’] < lowrange].index)

df

Finding Duplicates

Using Python libraries, we can identify duplicate rows as well as unique rows of a data set.

# Creating Dataset with Duplicates

import pandas as pd

Emp = [‘Jane’,’Johny’,’Boby’,’Jane’,’Mary’,’Jony’,’Melica’,’Melica’]

Salary = [9500,7800,7600,9500,7700,7800,9900,10000]

SalaryList = zip(Emp,Salary)

df = pd.DataFrame(data = SalaryList,columns=[‘Emp’, ‘Salary’])


df

# Displaying Only Duplicates in the Dataframe

df.duplicated()

# Displaying Dataset without Duplicates

df.drop_duplicates()

# Drop Rows with Duplicate Emp, Keeping the Last Observation

df.drop_duplicates([‘Emp’], keep=”last”)

You might also like