Professional Documents
Culture Documents
Lab4-IDA - Ipynb - Colaboratory
Lab4-IDA - Ipynb - Colaboratory
Lab4-IDA - Ipynb - Colaboratory
Venkatesh Gauri Shankar
# Exp 4: Find missing values with estimation and their categorization in the various dataset.
## 1. First look on dataset
import pandas as pd
import numpy as np
df = pd.read_csv("titanic_dataset.csv")
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabi
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500 NaN
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C8
(Florence
df.drop("Name",axis=1,inplace=True)
df.head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin
df.drop("Ticket",axis=1,inplace=True)
df.drop("PassengerId",axis=1,inplace=True)
df.drop("Cabin",axis=1,inplace=True)
df.drop("Embarked",axis=1,inplace=True)
df.head()
1 1 1 female 38.0
df.drop("Survived",axis=1,inplace=True)
1 0 71.2833
##2. How to know whether data is missing or not
df.info()
<class 'pandas.core.frame.DataFrame'>
print(df.isnull().sum())
Pclass 0
Sex 0
Age 177
SibSp 0
Parch 0
Fare 0
dtype: int64
##Deleting the column with missing data (if there are many null values)
updated_df = df.dropna(axis=1)
updated_df.info()
<class 'pandas.core.frame.DataFrame'>
##Deleting the row with missing data (if there are many null values)
updated_df = df.dropna(axis=0)
updated_df.info()
<class 'pandas.core.frame.DataFrame'>
##Central Tendency-mean
df['Age'] = df['Age'].fillna(df['Age'].mean())
df.info()
<class 'pandas.core.frame.DataFrame'>
print(df.isnull().sum())
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
dtype: int64
##Central Tendency-median
df['Age'] = df['Age'].fillna(df['Age'].median())
df.info()
<class 'pandas.core.frame.DataFrame'>
print(df.isnull().sum())
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
dtype: int64
##Central Tendency-mode
df['Age'] = df['Age'].fillna(df['Age'].mode()[0])
df.info()
<class 'pandas.core.frame.DataFrame'>
print(df.isnull().sum())
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
dtype: int64
## 5. Interpolation.
import pandas as pd
import numpy as np
df2 = pd.read_csv("titanic_dataset.csv")
df2.head()
STON/O
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0
31012
df2.drop("Survived",axis=1,inplace=True)
df2.drop("Ticket",axis=1,inplace=True)
df2.drop("PassengerId",axis=1,inplace=True)
df2.drop("Cabin",axis=1,inplace=True)
df2.drop("Embarked",axis=1,inplace=True)
df2.drop("Name",axis=1,inplace=True)
df2.head()
df2.info()
<class 'pandas.core.frame.DataFrame'>
print(df2.isnull().sum())
Pclass 0
Sex 0
Age 177
SibSp 0
Parch 0
Fare 0
dtype: int64
df2['Age'].interpolate(method='linear', direction = 'forward', inplace=True)
df2.info()
<class 'pandas.core.frame.DataFrame'>
print(df2.isnull().sum())
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
dtype: int64
## KNN-Finite difference
pip install fancyimpute
Collecting fancyimpute
Collecting knnimpute>=0.1.0
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv")
df
0 1 19.0 15.0 39
1 2 21.0 15.0 81
2 3 20.0 NaN 6
3 4 23.0 16.0 77
4 5 NaN 17.0 40
from fancyimpute import KNN
knn_imputer = KNN()
# imputing the missing value with knn imputer
data = knn_imputer.fit_transform(df)
data
[ 3. , 20. , 15.72267049, 6. ],