Welcome to Scribd!

Learn Data Analysis With Python

Uploaded by

0% found this document useful (0 votes)

11 views6 pages

The document discusses techniques for cleaning data in Python, including handling missing values, outliers, and duplicates. For missing values, it demonstrates dropping rows, replacing with 0, mean, median or mode, and selecting only rows without missing values. For outliers, it shows standardizing based on z-scores or using the interquartile range to identify outliers. For duplicates, it provides code to identify, display only, and remove duplicates while keeping the last observation. The goal is to clean raw data by addressing issues like missing values, outliers and duplicates to prepare for analysis.

Original Description:

Original Title

Learn Data Analysis with Python

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as docx, pdf, or txt

0% found this document useful (0 votes)

11 views6 pages

Learn Data Analysis With Python

Uploaded by

Anjali Agarwal

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Download as docx, pdf, or txt

Jump to Page

You are on page 1of 6

Search inside document

Learn Data Analysis with Python: Find out the Practical

Code for Data Cleaning

Introduction

If we want to apply for any data analyst or data scientist role, it is necessary to know one of

the programming languages used for such roles. It could be R or Python or Scala etc. To fulfill

this, I have selected Python for data analysis.

The data preparation step is the most important part to win the battle of a data analysis project.

This document will have information about how data cleaning ( missing values, outliers,

duplicates )is possible with Python.

Raw data is full of impurity like outliers, missing values, duplicates, etc. To clean this data

means, it needs to be logical, significant and regulated.

Missing Data

This is one of the most common issues. To solve these, there are many methods. Let us find

out some practical codes.

Image by Author (graderecord.csv)

This is a CSV file with two missing values in the Grade column. Let us now find out practical

codes to process such types of missing information.

import pandas as pd

df = pd.read_csv(“graderecord.csv”)

df.head(10)

Code: Drop Rows with Missing values

df_no_missing = df.dropna()

df_no_missing

Code: Replace Empty rows with 0

df.fillna(0)

Code: Replace Empty rows with Mean or Median or Mode of the column
df[“Grade”].fillna(df[“Grade”].mean(), inplace=True)

df[“Grade”].fillna(df[“Grade”].median(), inplace=True)

df[“Grade”].fillna(df[“Grade”].mode(), inplace=True)

Code: Selecting Rows with No Missing values

df[df[‘Grade’].notnull()]

Code: If a column has all empty values. drop the completely empty column

# Add a Column with Empty Values

import numpy as np

df[‘newcol’] = np.nan

df.head()

# Drop Empty Column

df.dropna(axis=1, how=”all”)

Outlier Treatment

In this blog, I am considering two ways of outlier treatment.

Method 1: If data is normally distributed, then we follow the standardization method.

Considering, confidence interval of 95% that means z-score is 1.96 for 5% alpha value. In

conclusion, 95% of data is distributed within 1.96 standard deviations of the mean. So we can

drop the value below or above this range.

Image by Author (gradedata.csv)

import pandas as pd

df = pd.read_csv(“gradedata.csv”)

meangrade = df[‘grade’].mean()

stdgrade = df[‘grade’].std()

higherrange = meangrade + stdgrade * 1.96

lowerrange = meangrade — stdgrade * 1.96

df = df.drop(df[df[‘grade’] > higherrange].index)

df = df.drop(df[df[‘grade’] < lowerrange].index)

Method 2: In this method, we use the interquartile range (IQR). IQR is the difference between

25% (Q1 ) of the quantile and 75% (Q3) of the quantile. Any value lower than Q1–1.5*IQR or

greater than Q3–1.5*IQR is considered an outlier.

q1 = df[‘grade’].quantile(.25)

q3 = df[‘grade’].quantile(.75)

iqr = q3-q1

highrange = q3 + iqr * 1.5

lowrange = q1 — iqr * 1.5

df = df.drop(df[df[‘grade’] > highrange].index)

df = df.drop(df[df[‘grade’] < lowrange].index)

Finding Duplicates

Using Python libraries, we can identify duplicate rows as well as unique rows of a data set.

# Creating Dataset with Duplicates

import pandas as pd

Emp = [‘Jane’,’Johny’,’Boby’,’Jane’,’Mary’,’Jony’,’Melica’,’Melica’]

Salary = [9500,7800,7600,9500,7700,7800,9900,10000]

SalaryList = zip(Emp,Salary)

df = pd.DataFrame(data = SalaryList,columns=[‘Emp’, ‘Salary’])

# Displaying Only Duplicates in the Dataframe

df.duplicated()

# Displaying Dataset without Duplicates

df.drop_duplicates()

# Drop Rows with Duplicate Emp, Keeping the Last Observation

df.drop_duplicates([‘Emp’], keep=”last”)

Simple Data Science (R)
From Everand
Simple Data Science (R)
Narayana Nemani
Rating: 5 out of 5 stars
5/5 (1)
Machine Learning
Document136 pages
Machine Learning
Kenssy
100% (2)
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
Document9 pages
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
NohaM.
No ratings yet
Lab Cns
Document2 pages
Lab Cns
asdf
No ratings yet
Gymnasiearbete - Artificiella Neuronnät - Henrik Lindgren - Slutgiltig PDF
Document22 pages
Gymnasiearbete - Artificiella Neuronnät - Henrik Lindgren - Slutgiltig PDF
Anonymous UbD49p
100% (1)
EDA - Exploratory Data Analysis
Document16 pages
EDA - Exploratory Data Analysis
spraga1995
No ratings yet
Handling The Dataset Using R - Word
Document54 pages
Handling The Dataset Using R - Word
swarnalathasatheesh.13
No ratings yet
Group A Assignment No2 Writeup
Document9 pages
Group A Assignment No2 Writeup
403 Chaudhari Sanika Sagar
No ratings yet
Ass-2 Ds
Document29 pages
Ass-2 Ds
Vedant Andhale
No ratings yet
Lecture Material 3
Document7 pages
Lecture Material 3
2021me372
No ratings yet
Assignment 1 - LP1
Document14 pages
Assignment 1 - LP1
bbad070105
No ratings yet
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
Document4 pages
DataPreparation - Outlier - Treatment ASSIGEMENT ANSWER
Sanchari Dey
No ratings yet
What Is Exploratory Data Analysis
Document13 pages
What Is Exploratory Data Analysis
Ramkrishna
No ratings yet
ML Practical 03
Document20 pages
ML Practical 03
chatgptlogin2001
No ratings yet
Machine Learning-1
Document24 pages
Machine Learning-1
factpolice007
No ratings yet
Pandas
Document29 pages
Pandas
Vineet Saraswat
No ratings yet
Unit-2 Bda
Document11 pages
Unit-2 Bda
claritysubhash55
No ratings yet
Big Data Analysis
Document38 pages
Big Data Analysis
rathodrohit2121
No ratings yet
Python For DS Cheat Sheet
Document6 pages
Python For DS Cheat Sheet
Sebastián Emdef
100% (2)
Assignment 3 - LP1
Document13 pages
Assignment 3 - LP1
bbad070105
No ratings yet
Ai - Phase 3
Document9 pages
Ai - Phase 3
Manikandan N
No ratings yet
Dimensional Reduction in R
Document24 pages
Dimensional Reduction in R
Shil Shambharkar
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
Document55 pages
Advanced Python Programming Data Science: The University of Sheffield
Be Kind
No ratings yet
ML Unit 2
Document41 pages
ML Unit 2
abhijit kate
No ratings yet
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
Document37 pages
Descriptive Statistics With Pandas: Data Handling Using Pandas - II
B. Jennifer
100% (1)
A Better Way For Data Preprocessing - Pandas Pipe - by Soner Yıldırım - Jul, 2021 - Towards Data Science
Document6 pages
A Better Way For Data Preprocessing - Pandas Pipe - by Soner Yıldırım - Jul, 2021 - Towards Data Science
dung pham anh
No ratings yet
Lab 08 - Data Preprocessing
Document9 pages
Lab 08 - Data Preprocessing
rida
No ratings yet
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
Document12 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
Ahsan Ahmad Beg
100% (1)
Engo 645
Document9 pages
Engo 645
sree vishnupriyq
No ratings yet
Working With Data: V Balachandra 19F41A0471
Document17 pages
Working With Data: V Balachandra 19F41A0471
balu
No ratings yet
Demo Class 15 and 16102022 (Pandas in Python)
Document45 pages
Demo Class 15 and 16102022 (Pandas in Python)
Oskar Nguyen
No ratings yet
Smaexp 3
Document9 pages
Smaexp 3
greeshmahedvikar
No ratings yet
Machine Learning Techniques Lesson 1
Document9 pages
Machine Learning Techniques Lesson 1
Igor Caetano Diniz
No ratings yet
Data Wrangling
Document30 pages
Data Wrangling
Yashwanth Yashu
No ratings yet
2 Python Data Processing
Document66 pages
2 Python Data Processing
Shaifali Jain
100% (2)
Data Science Lab Manual
Document32 pages
Data Science Lab Manual
Ravishankar Gautam
No ratings yet
Data Preprocessing in Python
Document12 pages
Data Preprocessing in Python
sredhar s
No ratings yet
20BCP238
Document4 pages
20BCP238
MR.NAITIK PATEL
No ratings yet
Exercise and Experiment 3
Document14 pages
Exercise and Experiment 3
h8792670
No ratings yet
Decision Tree Using Sci-Kit Learn
Document9 pages
Decision Tree Using Sci-Kit Learn
sudeepvmenon
No ratings yet
Aiml Ex 4-7
Document8 pages
Aiml Ex 4-7
Lakshmi Dheeba K
No ratings yet
FDS - Handling Missing Data and Advanced Indexing
Document17 pages
FDS - Handling Missing Data and Advanced Indexing
SRISABARIVASAN
No ratings yet
MLA Lab 6:-Implementation of Decision Tree
Document16 pages
MLA Lab 6:-Implementation of Decision Tree
tushar3patil03
No ratings yet
ML File
Document17 pages
ML File
Rishita Maheshwari
No ratings yet
Pattern Recognition
Document26 pages
Pattern Recognition
Aryan Attri
No ratings yet
Principal Component Analysis Notes : Info
Document22 pages
Principal Component Analysis Notes : Info
VALMICK GUHA
No ratings yet
Data Analysis With Pandas - Aggregates in Pandas Cheatsheet - Codecademy
Document2 pages
Data Analysis With Pandas - Aggregates in Pandas Cheatsheet - Codecademy
Utsav Soi
100% (1)
Numpy&pandas
Document17 pages
Numpy&pandas
Saif Ali Khan
No ratings yet
Anemia Code
Document33 pages
Anemia Code
sksharini67
No ratings yet
What Is Compiler in Datastage - Compilation Process in Datastage
Document14 pages
What Is Compiler in Datastage - Compilation Process in Datastage
shivnat
No ratings yet
Computing Programming With Python (W10)
Document30 pages
Computing Programming With Python (W10)
Anh Nguyễn Song Tường
No ratings yet
Pandas: Key Features of Pandas
Document44 pages
Pandas: Key Features of Pandas
jose
No ratings yet
Comprehensive EDA Python Guide
Document13 pages
Comprehensive EDA Python Guide
Muhammad Faizan
No ratings yet
Cabico Tan
Document11 pages
Cabico Tan
jaydee cabico
No ratings yet
CART+ +Loan+Delinquent+ +Student+File+0.1 - New - Ipynb Colaboratory
Document5 pages
CART+ +Loan+Delinquent+ +Student+File+0.1 - New - Ipynb Colaboratory
SHEKHAR SWAMI
No ratings yet
RANDOM FOREST (Binary Classification)
Document5 pages
RANDOM FOREST (Binary Classification)
Noor Ul Haq
No ratings yet
Sentiment Analysis With NLP Deep Learning
Document8 pages
Sentiment Analysis With NLP Deep Learning
Ankush Chajgotra
No ratings yet
UNIT 1 Exploratory Data Analysis
Document8 pages
UNIT 1 Exploratory Data Analysis
parimala balamurugan
100% (1)
Building Good Training Sets UNIT 1 PART2
Document46 pages
Building Good Training Sets UNIT 1 PART2
Aditya Sharma
No ratings yet
ML Practical 205160694034
Document33 pages
ML Practical 205160694034
09Samrat Bikram Shah
No ratings yet
FAQ - ReCell
Document5 pages
FAQ - ReCell
Nkechi Koko
No ratings yet
Numpy and Pandas
Document11 pages
Numpy and Pandas
Suja Mary
No ratings yet
Install (/customer-Service/download-Install) : Remove Autodesk Software (Windows)
Document11 pages
Install (/customer-Service/download-Install) : Remove Autodesk Software (Windows)
Mark Roger Huberit II
No ratings yet
Iso Medec PDF
Document98 pages
Iso Medec PDF
Ikoy Setyawan Keren
100% (1)
Pinch Spreadsheet Nov06 Final
Document97 pages
Pinch Spreadsheet Nov06 Final
madrasah
No ratings yet
Mirror Box
Document6 pages
Mirror Box
samer Jouba
100% (2)
Iso 7547
Document5 pages
Iso 7547
H & H GRSE
No ratings yet
LEM Active P Series
Document8 pages
LEM Active P Series
juan manuel sierra solis
No ratings yet
Smart Factories - Manufacturing Environments and Systems of The Future
Document5 pages
Smart Factories - Manufacturing Environments and Systems of The Future
benozaman
No ratings yet
Site Meeting 03
Document4 pages
Site Meeting 03
Fakrul Adli
No ratings yet
Nachi Ballscrew Bearings
Document8 pages
Nachi Ballscrew Bearings
Hakan Ada
No ratings yet
Watershed Transform: Excerpted From The Blog
Document7 pages
Watershed Transform: Excerpted From The Blog
M Xubair Yousaf Xai
No ratings yet
CSG TBG Dimensions PDF
Document3 pages
CSG TBG Dimensions PDF
Elisa Maria Angulo Vanegas
No ratings yet
Concrete & Shuttering Working-R1
Document20 pages
Concrete & Shuttering Working-R1
vivek
No ratings yet
Solo User Manual v1
Document57 pages
Solo User Manual v1
mohx0001
No ratings yet
Pos 5802 Specification
Document4 pages
Pos 5802 Specification
Nicole Zijiang
No ratings yet
Bsnlspeedtest in Blog BSNL DNS Address HTML
Document13 pages
Bsnlspeedtest in Blog BSNL DNS Address HTML
Arun Kumar
No ratings yet
Innovation Culture The Big Elephant in The Room
Document13 pages
Innovation Culture The Big Elephant in The Room
kreatos
No ratings yet
Procedure Title: Land - PTW Procedure: Scope Objective Referenced Documents Associated Forms
Document36 pages
Procedure Title: Land - PTW Procedure: Scope Objective Referenced Documents Associated Forms
senator
100% (1)
d121 E2e Next Datasheet en PDF
Document108 pages
d121 E2e Next Datasheet en PDF
Batatamaca Maca
No ratings yet
The OSI Model: Overview On The Seven Layers of Computer Networks
Document9 pages
The OSI Model: Overview On The Seven Layers of Computer Networks
IJIRST
No ratings yet
APDL - Chapter 2 - Adding Commands To The Toolbar (UP19980820)
Document3 pages
APDL - Chapter 2 - Adding Commands To The Toolbar (UP19980820)
Mohammad Althaf
No ratings yet
Wind River VxWorks Platform 6.9 - Overview
Document3 pages
Wind River VxWorks Platform 6.9 - Overview
ciruvekof
No ratings yet
Public Sector Union Negotiation Models, Strategies, and Tactics
Document31 pages
Public Sector Union Negotiation Models, Strategies, and Tactics
Noah Franklin
No ratings yet
How To Fix Fibre Cement Slates - Cembrit - ESI Building Design
Document7 pages
How To Fix Fibre Cement Slates - Cembrit - ESI Building Design
Hood Rock
No ratings yet
Sims2Exception 2023.05.24 01.14.06
Document6 pages
Sims2Exception 2023.05.24 01.14.06
Carlos Garcia
No ratings yet
Cientifica WP3 PDF
Document11 pages
Cientifica WP3 PDF
Rupesh Kumar
No ratings yet
TerrSet Tutorial
Document489 pages
TerrSet Tutorial
M Muhammad
No ratings yet
Varistores-D V680K14 0 V680K14 VARISTOR DOCUMENTACIÓN
Document12 pages
Varistores-D V680K14 0 V680K14 VARISTOR DOCUMENTACIÓN
Alberto Morillo Parellada
No ratings yet
Precision Group Corporate Brochure
Document8 pages
Precision Group Corporate Brochure
anto9940682568
No ratings yet