Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

AP19110010012_Lab Assignment 4 - Jupyter Notebook 05/09/21, 4:11 PM

In [25]: import pandas as pd


import numpy as np
df = pd.read_csv("data_cleaning.csv")
df.head()

Out[25]:
Make Colour Odometer (KM) Doors Price

0 Honda White 35431.0 4.0 15323.0

1 BMW Blue NaN 5.0 19943.0

2 Honda White 84714.0 4.0 28343.0

3 Toyota White 154365.0 NaN 13434.0

4 Nissan Blue 181577.0 3.0 14043.0

Getting to know which and how many columns are null

In [26]: df.isna().sum()

Out[26]: Make 1
Colour 1
Odometer (KM) 4
Doors 1
Price 2
dtype: int64

Data Preprocessing

1. Ignoring the row if there are less number of missing datas.

Just like in the above data frame if we only have missing data in the 2nd row, we might
and drop the row. But we currently have large no.of missing data. Thus, this method is
not viable

2. Fill the missing data manually

In [46]: df.loc[1,['Odometer (KM)']] = 100000

In [47]: df.loc[1,['Odometer (KM)']]

Out[47]: Odometer (KM) 100000


Name: 1, dtype: object

This method is not effective if large no.of missing data is present.

http://localhost:8888/notebooks/AP19110010012_Lab%20Assignment%204.ipynb Page 1 of 5
AP19110010012_Lab Assignment 4 - Jupyter Notebook 05/09/21, 4:11 PM

3. Using global constant to replace the missing values.

In [48]: df = pd.read_csv('data_cleaning.csv')
df.head()

Out[48]:
Make Colour Odometer (KM) Doors Price

0 Honda White 35431.0 4.0 15323.0

1 BMW Blue NaN 5.0 19943.0

2 Honda White 84714.0 4.0 28343.0

3 Toyota White 154365.0 NaN 13434.0

4 Nissan Blue 181577.0 3.0 14043.0

In [49]: df.fillna(0.0).head()

Out[49]:
Make Colour Odometer (KM) Doors Price

0 Honda White 35431.0 4.0 15323.0

1 BMW Blue 0.0 5.0 19943.0

2 Honda White 84714.0 4.0 28343.0

3 Toyota White 154365.0 0.0 13434.0

4 Nissan Blue 181577.0 3.0 14043.0

Here, we replaced the missing values with constant 0

4. We take mean to fill the missing values.

In [50]: df = pd.read_csv('data_cleaning.csv')
df.head()

Out[50]:
Make Colour Odometer (KM) Doors Price

0 Honda White 35431.0 4.0 15323.0

1 BMW Blue NaN 5.0 19943.0

2 Honda White 84714.0 4.0 28343.0

3 Toyota White 154365.0 NaN 13434.0

4 Nissan Blue 181577.0 3.0 14043.0

Using mean to fill missing value of integers

http://localhost:8888/notebooks/AP19110010012_Lab%20Assignment%204.ipynb Page 2 of 5
AP19110010012_Lab Assignment 4 - Jupyter Notebook 05/09/21, 4:11 PM

In [52]: df.columns[0:2]

Out[52]: Index(['Make', 'Colour'], dtype='object')

In [53]: for i in df.columns[2:4]:


df[i].fillna(int(df[i].mean()), inplace = True)

In [54]: df.head()

Out[54]:
Make Colour Odometer (KM) Doors Price

0 Honda White 35431.0 4.0 15323.0

1 BMW Blue 112890.0 5.0 19943.0

2 Honda White 84714.0 4.0 28343.0

3 Toyota White 154365.0 4.0 13434.0

4 Nissan Blue 181577.0 3.0 14043.0

5. We take most frequent to fill the missing values.

In [55]: df = pd.read_csv('data_cleaning.csv')
df.head()

Out[55]:
Make Colour Odometer (KM) Doors Price

0 Honda White 35431.0 4.0 15323.0

1 BMW Blue NaN 5.0 19943.0

2 Honda White 84714.0 4.0 28343.0

3 Toyota White 154365.0 NaN 13434.0

4 Nissan Blue 181577.0 3.0 14043.0

Using most frequent value from each column

http://localhost:8888/notebooks/AP19110010012_Lab%20Assignment%204.ipynb Page 3 of 5
AP19110010012_Lab Assignment 4 - Jupyter Notebook 05/09/21, 4:11 PM

In [56]: df.fillna(df.mode().iloc[0])

Out[56]:
Make Colour Odometer (KM) Doors Price

0 Honda White 35431.0 4.0 15323.0

1 BMW Blue 17119.0 5.0 19943.0

2 Honda White 84714.0 4.0 28343.0

3 Toyota White 154365.0 4.0 13434.0

4 Nissan Blue 181577.0 3.0 14043.0

5 Honda Red 42652.0 4.0 23883.0

6 Toyota Blue 163453.0 4.0 8473.0

7 Honda White 17119.0 4.0 20306.0

8 Honda White 130538.0 4.0 9374.0

9 Honda Blue 51029.0 4.0 26683.0

10 Nissan White 167421.0 4.0 6010.0

11 Nissan Green 17119.0 4.0 6160.0

12 Nissan White 102303.0 4.0 16909.0

13 BMW White 134181.0 4.0 11121.0

14 Honda Blue 199833.0 4.0 18946.0

15 Toyota Blue 17119.0 4.0 16290.0

16 Toyota Red 96742.0 4.0 34465.0

17 BMW White 194189.0 5.0 17177.0

18 Nissan White 67991.0 3.0 6010.0

19 Nissan Blue 17119.0 4.0 6010.0

20 Toyota Green 124844.0 4.0 24130.0

21 Honda White 30615.0 4.0 29653.0

22 Toyota White 148744.0 4.0 22489.0

23 Honda Green 130075.0 4.0 21242.0

The missing values has been replaced by the most frequent values.

In [ ]:

http://localhost:8888/notebooks/AP19110010012_Lab%20Assignment%204.ipynb Page 4 of 5
AP19110010012_Lab Assignment 4 - Jupyter Notebook 05/09/21, 4:11 PM

http://localhost:8888/notebooks/AP19110010012_Lab%20Assignment%204.ipynb Page 5 of 5

You might also like