Lab4-IDA - Ipynb - Colaboratory

#
Venkatesh Gauri Shankar
# Exp 4: Find missing values with estimation and their categorization in the various dataset.
## 1. First look on dataset
import pandas as pd
import numpy as np
df = pd.read_csv("titanic_dataset.csv")
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabi
Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500 NaN
Harris
Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C8
(Florence
df.drop("Name",axis=1,inplace=True)
df.head()
PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin
0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85
2 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN
3 4 1 1 female 35.0 1 0 113803 53.1000 C123
4 5 0 3 male 35.0 0 0 373450 8.0500 NaN
df.drop("Ticket",axis=1,inplace=True)
df.drop("PassengerId",axis=1,inplace=True)
df.drop("Cabin",axis=1,inplace=True)
df.drop("Embarked",axis=1,inplace=True)
df.head()
Survived Pclass Sex Age SibSp Parch Fare
0 0 3 male 22.0 1 0 7.2500
1 1 1 female 38.0
df.drop("Survived",axis=1,inplace=True)
1 0 71.2833
2 1 3 female 26.0 0 0 7.9250

df.head()
3 1 1 female 35.0 1 0 53.1000
4 Pclass 0 Sex 3Age male

SibSp35.0
Parch 0 Fare 0 8.0500
0 3 male 22.0 1 0 7.2500
1 1 female 38.0 1 0 71.2833
2 3 female 26.0 0 0 7.9250
3 1 female 35.0 1 0 53.1000
4 3 male 35.0 0 0 8.0500
##2. How to know whether data is missing or not
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890

Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pclass 891 non-null int64
1 Sex 891 non-null object
2 Age 714 non-null float64
3 SibSp 891 non-null int64
4 Parch 891 non-null int64
5 Fare 891 non-null float64
dtypes: float64(2), int64(3), object(1)
memory usage: 41.9+ KB
print(df.isnull().sum())
Pclass 0
Sex 0
Age 177
SibSp 0
Parch 0
Fare 0
dtype: int64
##Deleting the column with missing data (if there are many null values)
updated_df = df.dropna(axis=1)
updated_df.info()

--- ------ -------------- -----
##Deleting the row with missing data (if there are many null values)
updated_df = df.dropna(axis=0)
updated_df.info()
Int64Index: 714 entries, 0 to 890

--- ------ -------------- -----
##Central Tendency-mean
df['Age'] = df['Age'].fillna(df['Age'].mean())
df.info()

--- ------ -------------- -----
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
dtype: int64
##Central Tendency-median
df['Age'] = df['Age'].fillna(df['Age'].median())
df.info()

--- ------ -------------- -----
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
dtype: int64
##Central Tendency-mode
df['Age'] = df['Age'].fillna(df['Age'].mode()[0])
df.info()

--- ------ -------------- -----
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
dtype: int64
## 5. Interpolation.
import pandas as pd
import numpy as np
df2 = pd.read_csv("titanic_dataset.csv")
df2.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Tick
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 211
Cumings, Mrs. John Bradley

1 2 1 1 female 38.0 1 0 PC 175
(Florence Briggs Th...
STON/O
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0
31012
Futrelle, Mrs. Jacques Heath

3 4 1 1 female 35.0 1 0 1138
(Lily May Peel)
df2.drop("Survived",axis=1,inplace=True)
df2.drop("Ticket",axis=1,inplace=True)
df2.drop("PassengerId",axis=1,inplace=True)
df2.drop("Cabin",axis=1,inplace=True)
df2.drop("Embarked",axis=1,inplace=True)
df2.drop("Name",axis=1,inplace=True)
df2.head()
Pclass Sex Age SibSp Parch Fare
0 3 male 22.0 1 0 7.2500
1 1 female 38.0 1 0 71.2833
2 3 female 26.0 0 0 7.9250
3 1 female 35.0 1 0 53.1000
4 3 male 35.0 0 0 8.0500
df2.info()

--- ------ -------------- -----
print(df2.isnull().sum())
Pclass 0
Sex 0
Age 177
SibSp 0
Parch 0
Fare 0
dtype: int64
df2['Age'].interpolate(method='linear', direction = 'forward', inplace=True)
df2.info()

--- ------ -------------- -----
print(df2.isnull().sum())
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
dtype: int64
## KNN-Finite difference
pip install fancyimpute
Collecting fancyimpute
Downloading fancyimpute-0.7.0.tar.gz (25 kB)
Collecting knnimpute>=0.1.0
Downloading knnimpute-0.1.0.tar.gz (8.3 kB)
Requirement already satisfied: scikit-learn>=0.24.2 in /usr/local/lib/python3.7/dist-packages

Requirement already satisfied: cvxpy in /usr/local/lib/python3.7/dist-packages (from fancyimput
Requirement already satisfied: cvxopt in /usr/local/lib/python3.7/dist-packages (from fancyimpu
Requirement already satisfied: pytest in /usr/local/lib/python3.7/dist-packages (from fancyimpu
Collecting nose
Downloading nose-1.3.7-py3-none-any.whl (154 kB)
|████████████████████████████████| 154 kB 8.1 MB/s
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from knnimpute>=0

Requirement already satisfied: numpy>=1.10 in /usr/local/lib/python3.7/dist-packages (from knni
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from sci
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from sci
Requirement already satisfied: scs>=1.1.3 in /usr/local/lib/python3.7/dist-packages (from cvxpy
Requirement already satisfied: osqp>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from cvxp
Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from cvx
Requirement already satisfied: ecos>=2 in /usr/local/lib/python3.7/dist-packages (from cvxpy->f
Requirement already satisfied: qdldl in /usr/local/lib/python3.7/dist-packages (from osqp>=0.4
Requirement already satisfied: dill>=0.3.4 in /usr/local/lib/python3.7/dist-packages (from mult
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from pytes
Requirement already satisfied: py>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from pytest
Requirement already satisfied: more-itertools>=4.0.0 in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: atomicwrites>=1.0 in /usr/local/lib/python3.7/dist-packages (fro
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/dist-packages (from py
Requirement already satisfied: pluggy<0.8,>=0.5 in /usr/local/lib/python3.7/dist-packages (from
Building wheels for collected packages: fancyimpute, knnimpute
Building wheel for fancyimpute (setup.py) ... done
Created wheel for fancyimpute: filename=fancyimpute-0.7.0-py3-none-any.whl size=29899 sha256=

Stored in directory: /root/.cache/pip/wheels/e3/04/06/a1a7d89ef4e631ce6268ea2d8cde04f7290651c
Building wheel for knnimpute (setup.py) ... done
Created wheel for knnimpute: filename=knnimpute-0.1.0-py3-none-any.whl size=11353 sha256=78c1

Stored in directory: /root/.cache/pip/wheels/72/21/a8/a045cacd9838abd5643f6bfa852c0796a99d6b1
Successfully built fancyimpute knnimpute
Installing collected packages: nose, knnimpute, fancyimpute
Successfully installed fancyimpute-0.7.0 knnimpute-0.1.0 nose-1.3.7
import pandas as pd
import numpy as np
df = pd.read_csv("data.csv")
df
Cust id Age Annual Income (K$) Spending Score (1-100)
0 1 19.0 15.0 39
1 2 21.0 15.0 81
2 3 20.0 NaN 6
3 4 23.0 16.0 77
4 5 NaN 17.0 40
from fancyimpute import KNN
knn_imputer = KNN()
# imputing the missing value with knn imputer
data = knn_imputer.fit_transform(df)
Imputing row 1/5 with 0 missing, elapsed time: 0.001
data
array([[ 1. , 19. , 15. , 39. ],
[ 2. , 21. , 15. , 81. ],
[ 3. , 20. , 15.72267049, 6. ],
[ 4. , 23. , 16. , 77. ],
[ 5. , 19.09437562, 17. , 40. ]])
Colab paid products

-
Cancel contracts here

Lab4-IDA - Ipynb - Colaboratory

Uploaded by

Copyright:

Available Formats

You might also like

Lab4-IDA - Ipynb - Colaboratory

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab4-IDA - Ipynb - Colaboratory

Uploaded by

Copyright:

Available Formats

#

0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN

1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85

2 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN

3 4 1 1 female 35.0 1 0 113803 53.1000 C123

4 5 0 3 male 35.0 0 0 373450 8.0500 NaN

Survived Pclass Sex Age SibSp Parch Fare

0 0 3 male 22.0 1 0 7.2500

2 1 3 female 26.0 0 0 7.9250

4 Pclass 0 Sex 3Age male

0 3 male 22.0 1 0 7.2500

1 1 female 38.0 1 0 71.2833

2 3 female 26.0 0 0 7.9250

3 1 female 35.0 1 0 53.1000

4 3 male 35.0 0 0 8.0500

RangeIndex: 891 entries, 0 to 890

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 891 non-null int64

1 Sex 891 non-null object

2 Age 714 non-null float64

3 SibSp 891 non-null int64

4 Parch 891 non-null int64

5 Fare 891 non-null float64

dtypes: float64(2), int64(3), object(1)

memory usage: 41.9+ KB

RangeIndex: 891 entries, 0 to 890

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 891 non-null int64

1 Sex 891 non-null object

2 SibSp 891 non-null int64

3 Parch 891 non-null int64

4 Fare 891 non-null float64

dtypes: float64(1), int64(3), object(1)

memory usage: 34.9+ KB

Int64Index: 714 entries, 0 to 890

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 714 non-null int64

1 Sex 714 non-null object

2 Age 714 non-null float64

3 SibSp 714 non-null int64

4 Parch 714 non-null int64

5 Fare 714 non-null float64

dtypes: float64(2), int64(3), object(1)

memory usage: 39.0+ KB

RangeIndex: 891 entries, 0 to 890

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 891 non-null int64

1 Sex 891 non-null object

2 Age 891 non-null float64

3 SibSp 891 non-null int64

4 Parch 891 non-null int64

5 Fare 891 non-null float64

dtypes: float64(2), int64(3), object(1)