Lab4-IDA - Ipynb - Colaboratory

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Venkatesh Gauri Shankar

# Exp 4: Find missing values with estimation and their categorization in the various dataset.

## 1. First look on dataset

import pandas as pd
import numpy as np
df = pd.read_csv("titanic_dataset.csv")

df.head()

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabi

Braund,
0 1 0 3 Mr. Owen male 22.0 1 0 A/5 21171 7.2500 NaN
Harris

Cumings,
Mrs. John
Bradley
1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C8
(Florence

df.drop("Name",axis=1,inplace=True)

df.head()

PassengerId Survived Pclass Sex Age SibSp Parch Ticket Fare Cabin

0 1 0 3 male 22.0 1 0 A/5 21171 7.2500 NaN

1 2 1 1 female 38.0 1 0 PC 17599 71.2833 C85

2 3 1 3 female 26.0 0 0 STON/O2. 3101282 7.9250 NaN

3 4 1 1 female 35.0 1 0 113803 53.1000 C123

4 5 0 3 male 35.0 0 0 373450 8.0500 NaN

df.drop("Ticket",axis=1,inplace=True)

df.drop("PassengerId",axis=1,inplace=True)

df.drop("Cabin",axis=1,inplace=True)

df.drop("Embarked",axis=1,inplace=True)

df.head()

Survived Pclass Sex Age SibSp Parch Fare

0 0 3 male 22.0 1 0 7.2500

1 1 1 female 38.0
df.drop("Survived",axis=1,inplace=True)
1 0 71.2833

2 1 3 female 26.0 0 0 7.9250


df.head()
3 1 1 female 35.0 1 0 53.1000

4 Pclass 0 Sex 3Age male


SibSp35.0
Parch 0 Fare 0 8.0500

0 3 male 22.0 1 0 7.2500

1 1 female 38.0 1 0 71.2833

2 3 female 26.0 0 0 7.9250

3 1 female 35.0 1 0 53.1000

4 3 male 35.0 0 0 8.0500

##2.    How to know whether data is missing or not

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890


Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 891 non-null int64

1 Sex 891 non-null object

2 Age 714 non-null float64

3 SibSp 891 non-null int64

4 Parch 891 non-null int64

5 Fare 891 non-null float64

dtypes: float64(2), int64(3), object(1)

memory usage: 41.9+ KB

print(df.isnull().sum())

Pclass 0

Sex 0

Age 177

SibSp 0

Parch 0

Fare 0

dtype: int64

##Deleting the column with missing data (if there are many null values)

updated_df = df.dropna(axis=1)

updated_df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890


Data columns (total 5 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 891 non-null int64

1 Sex 891 non-null object

2 SibSp 891 non-null int64

3 Parch 891 non-null int64

4 Fare 891 non-null float64

dtypes: float64(1), int64(3), object(1)

memory usage: 34.9+ KB

##Deleting the row with missing data (if there are many null values)

updated_df = df.dropna(axis=0)

updated_df.info()

<class 'pandas.core.frame.DataFrame'>

Int64Index: 714 entries, 0 to 890


Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 714 non-null int64

1 Sex 714 non-null object

2 Age 714 non-null float64

3 SibSp 714 non-null int64

4 Parch 714 non-null int64

5 Fare 714 non-null float64

dtypes: float64(2), int64(3), object(1)

memory usage: 39.0+ KB

##Central Tendency-mean

df['Age'] = df['Age'].fillna(df['Age'].mean())

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890


Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 891 non-null int64

1 Sex 891 non-null object

2 Age 891 non-null float64

3 SibSp 891 non-null int64

4 Parch 891 non-null int64

5 Fare 891 non-null float64

dtypes: float64(2), int64(3), object(1)

memory usage: 41.9+ KB

print(df.isnull().sum())

Pclass 0

Sex 0

Age 0

SibSp 0

Parch 0

Fare 0

dtype: int64

##Central Tendency-median

df['Age'] = df['Age'].fillna(df['Age'].median())

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890


Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 891 non-null int64

1 Sex 891 non-null object

2 Age 891 non-null float64

3 SibSp 891 non-null int64

4 Parch 891 non-null int64

5 Fare 891 non-null float64

dtypes: float64(2), int64(3), object(1)

memory usage: 41.9+ KB

print(df.isnull().sum())

Pclass 0

Sex 0

Age 0

SibSp 0

Parch 0

Fare 0

dtype: int64

##Central Tendency-mode

df['Age'] = df['Age'].fillna(df['Age'].mode()[0])

df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890


Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 891 non-null int64

1 Sex 891 non-null object

2 Age 891 non-null float64

3 SibSp 891 non-null int64

4 Parch 891 non-null int64

5 Fare 891 non-null float64

dtypes: float64(2), int64(3), object(1)

memory usage: 41.9+ KB

print(df.isnull().sum())

Pclass 0

Sex 0

Age 0

SibSp 0

Parch 0

Fare 0

dtype: int64

## 5.   Interpolation.

import pandas as pd

import numpy as np

df2 = pd.read_csv("titanic_dataset.csv")

df2.head()

PassengerId Survived Pclass Name Sex Age SibSp Parch Tick

0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 211

Cumings, Mrs. John Bradley


1 2 1 1 female 38.0 1 0 PC 175
(Florence Briggs Th...

STON/O
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0
31012

Futrelle, Mrs. Jacques Heath


3 4 1 1 female 35.0 1 0 1138
(Lily May Peel)

df2.drop("Survived",axis=1,inplace=True)
df2.drop("Ticket",axis=1,inplace=True)
df2.drop("PassengerId",axis=1,inplace=True)
df2.drop("Cabin",axis=1,inplace=True)
df2.drop("Embarked",axis=1,inplace=True)
df2.drop("Name",axis=1,inplace=True)

df2.head()

Pclass Sex Age SibSp Parch Fare

0 3 male 22.0 1 0 7.2500

1 1 female 38.0 1 0 71.2833

2 3 female 26.0 0 0 7.9250

3 1 female 35.0 1 0 53.1000

4 3 male 35.0 0 0 8.0500

df2.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890


Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 891 non-null int64

1 Sex 891 non-null object

2 Age 714 non-null float64

3 SibSp 891 non-null int64

4 Parch 891 non-null int64

5 Fare 891 non-null float64

dtypes: float64(2), int64(3), object(1)

memory usage: 41.9+ KB

print(df2.isnull().sum())

Pclass 0

Sex 0

Age 177

SibSp 0

Parch 0

Fare 0

dtype: int64

df2['Age'].interpolate(method='linear', direction = 'forward', inplace=True)

df2.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 891 entries, 0 to 890


Data columns (total 6 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Pclass 891 non-null int64

1 Sex 891 non-null object

2 Age 891 non-null float64

3 SibSp 891 non-null int64

4 Parch 891 non-null int64

5 Fare 891 non-null float64

dtypes: float64(2), int64(3), object(1)

memory usage: 41.9+ KB

print(df2.isnull().sum())

Pclass 0

Sex 0

Age 0

SibSp 0

Parch 0

Fare 0

dtype: int64

## KNN-Finite difference

 pip install fancyimpute

Collecting fancyimpute

Downloading fancyimpute-0.7.0.tar.gz (25 kB)

Collecting knnimpute>=0.1.0

Downloading knnimpute-0.1.0.tar.gz (8.3 kB)

Requirement already satisfied: scikit-learn>=0.24.2 in /usr/local/lib/python3.7/dist-packages


Requirement already satisfied: cvxpy in /usr/local/lib/python3.7/dist-packages (from fancyimput
Requirement already satisfied: cvxopt in /usr/local/lib/python3.7/dist-packages (from fancyimpu
Requirement already satisfied: pytest in /usr/local/lib/python3.7/dist-packages (from fancyimpu
Collecting nose

Downloading nose-1.3.7-py3-none-any.whl (154 kB)

|████████████████████████████████| 154 kB 8.1 MB/s

Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from knnimpute>=0


Requirement already satisfied: numpy>=1.10 in /usr/local/lib/python3.7/dist-packages (from knni
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from sci
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.7/dist-packages (from sci
Requirement already satisfied: scs>=1.1.3 in /usr/local/lib/python3.7/dist-packages (from cvxpy
Requirement already satisfied: osqp>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from cvxp
Requirement already satisfied: multiprocess in /usr/local/lib/python3.7/dist-packages (from cvx
Requirement already satisfied: ecos>=2 in /usr/local/lib/python3.7/dist-packages (from cvxpy->f
Requirement already satisfied: qdldl in /usr/local/lib/python3.7/dist-packages (from osqp>=0.4
Requirement already satisfied: dill>=0.3.4 in /usr/local/lib/python3.7/dist-packages (from mult
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from pytes
Requirement already satisfied: py>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from pytest
Requirement already satisfied: more-itertools>=4.0.0 in /usr/local/lib/python3.7/dist-packages
Requirement already satisfied: atomicwrites>=1.0 in /usr/local/lib/python3.7/dist-packages (fro
Requirement already satisfied: attrs>=17.4.0 in /usr/local/lib/python3.7/dist-packages (from py
Requirement already satisfied: pluggy<0.8,>=0.5 in /usr/local/lib/python3.7/dist-packages (from
Building wheels for collected packages: fancyimpute, knnimpute

Building wheel for fancyimpute (setup.py) ... done

Created wheel for fancyimpute: filename=fancyimpute-0.7.0-py3-none-any.whl size=29899 sha256=


Stored in directory: /root/.cache/pip/wheels/e3/04/06/a1a7d89ef4e631ce6268ea2d8cde04f7290651c
Building wheel for knnimpute (setup.py) ... done

Created wheel for knnimpute: filename=knnimpute-0.1.0-py3-none-any.whl size=11353 sha256=78c1


Stored in directory: /root/.cache/pip/wheels/72/21/a8/a045cacd9838abd5643f6bfa852c0796a99d6b1
Successfully built fancyimpute knnimpute

Installing collected packages: nose, knnimpute, fancyimpute

Successfully installed fancyimpute-0.7.0 knnimpute-0.1.0 nose-1.3.7

import pandas as pd

import numpy as np

df = pd.read_csv("data.csv")

df

Cust id Age Annual Income (K$) Spending Score (1-100)

0 1 19.0 15.0 39

1 2 21.0 15.0 81

2 3 20.0 NaN 6

3 4 23.0 16.0 77

4 5 NaN 17.0 40

from fancyimpute import KNN

knn_imputer = KNN()

# imputing the missing value with knn imputer

data = knn_imputer.fit_transform(df)

Imputing row 1/5 with 0 missing, elapsed time: 0.001

data

array([[ 1. , 19. , 15. , 39. ],

[ 2. , 21. , 15. , 81. ],

[ 3. , 20. , 15.72267049, 6. ],

[ 4. , 23. , 16. , 77. ],

[ 5. , 19.09437562, 17. , 40. ]])

Colab paid products


-
Cancel contracts here

You might also like