1 Data Mining 2 Lab - 1 3 Vinay Sirohi 4 2139472 5 Identify A Dataset, Preprocess The Dataset Set Using Normaliza-Tion Techniques

2139472 Lab_1
December 3, 2021
1 Data Mining
2 Lab_1
3 Vinay Sirohi
4 2139472
5 Identify a dataset, Preprocess the dataset set using normaliza-

tion techniques
5.0.1 Importing the useful libraries
[1]: import numpy as np

from sklearn.preprocessing import LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
5.1 Step 1: Reading and Understanding the Data

[2]: housing = pd.read_csv('Housing.csv')
housing.head()
[2]: price area bedrooms bathrooms stories mainroad guestroom basement \

0 13300000 7420 4 2 3 yes no no
1 12250000 8960 4 4 4 yes no no
2 12250000 9960 3 2 2 yes no yes
3 12215000 7500 4 2 2 yes no yes
4 11410000 7420 4 1 2 yes yes yes
1
hotwaterheating airconditioning parking prefarea furnishingstatus
0 no yes 2 yes furnished
1 no yes 3 no furnished
2 no no 2 yes semi-furnished
3 no yes 3 yes furnished
4 no yes 2 no furnished
[3]: housing.shape
[3]: (545, 13)
[4]: housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 545 non-null int64
1 area 545 non-null int64
2 bedrooms 545 non-null int64
3 bathrooms 545 non-null int64
4 stories 545 non-null int64
5 mainroad 545 non-null object
6 guestroom 545 non-null object
7 basement 545 non-null object
8 hotwaterheating 545 non-null object
9 airconditioning 545 non-null object
10 parking 545 non-null int64
11 prefarea 545 non-null object
12 furnishingstatus 545 non-null object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB
5.2 Step 2: Visualising the Data

[5]: plt.figure(figsize=[5,5])
plt.scatter(housing.area, housing.price , color = 'green')
plt.show()
2
[6]: sns.pairplot(housing)
plt.show()
3
5.3 Step 3: Data Preparation
5.3.1 We can see that your dataset has many columns with values as ‘Yes’ or ‘No’.
5.3.2 But in order to fit a regression line, we would need numerical values and not
string. Hence, we need to convert them to 1s and 0s, where 1 is a ‘Yes’ and 0
is a ‘No’.
[7]: housing.columns
[7]: Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',

'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
'parking', 'prefarea', 'furnishingstatus'],
dtype='object')
4
[8]: labelencoder = LabelEncoder()
housing['mainroad']=labelencoder.fit_transform(housing['mainroad'])
housing['guestroom']=labelencoder.fit_transform(housing['guestroom'])
housing['basement']=labelencoder.fit_transform(housing['basement'])
housing['hotwaterheating']=labelencoder.
,→fit_transform(housing['hotwaterheating'])
housing['airconditioning']=labelencoder.
,→fit_transform(housing['airconditioning'])
housing['prefarea']=labelencoder.fit_transform(housing['prefarea'])
housing
[8]: price area bedrooms bathrooms stories mainroad guestroom \

0 13300000 7420 4 2 3 1 0
1 12250000 8960 4 4 4 1 0
2 12250000 9960 3 2 2 1 0
3 12215000 7500 4 2 2 1 0
4 11410000 7420 4 1 2 1 1
.. … … … … … … …
540 1820000 3000 2 1 1 1 0
541 1767150 2400 3 1 1 0 0
542 1750000 3620 2 1 1 1 0
543 1750000 2910 3 1 1 0 0
544 1750000 3850 3 1 2 1 0
basement hotwaterheating airconditioning parking prefarea \

0 0 0 1 2 1
1 0 0 1 3 0
2 1 0 0 2 1
3 1 0 1 3 1
4 1 0 1 2 0
.. … … … … …
540 1 0 0 2 0
541 0 0 0 0 0
542 0 0 0 0 0
543 0 0 0 0 0
544 0 0 0 0 0
furnishingstatus
0 furnished
1 furnished
2 semi-furnished
3 furnished
4 furnished
.. …
540 unfurnished
541 semi-furnished
542 unfurnished
5
543 furnished
544 unfurnished
[545 rows x 13 columns]
5.3.3 Dummy Variables

The variable furnishingstatus has three levels. We need to convert these levels into integer as well.
[9]: housing = pd.get_dummies(housing)

housing.head()

0 13300000 7420 4 2 3 1 0
1 12250000 8960 4 4 4 1 0
2 12250000 9960 3 2 2 1 0
3 12215000 7500 4 2 2 1 0
4 11410000 7420 4 1 2 1 1

0 0 0 1 2 1
1 0 0 1 3 0
2 1 0 0 2 1
3 1 0 1 3 1
4 1 0 1 2 0
furnishingstatus_furnished furnishingstatus_semi-furnished \
0 1 0
1 1 0
2 0 1
3 1 0
4 1 0
furnishingstatus_unfurnished
0 0
1 0
2 0
3 0
4 0
5.4 Step 4 : Rescaling the Features

5.4.1 1.Min-Max scaling
[10]: scaler = MinMaxScaler()
[11]: cols = list(housing.columns)

for col in cols:
6
housing[col] = housing[col].astype(float)
housing[[col]] = scaler.fit_transform(housing[[col]])
housing.head()

0 1.000000 0.396564 0.6 0.333333 0.666667 1.0 0.0
1 0.909091 0.502405 0.6 1.000000 1.000000 1.0 0.0
2 0.909091 0.571134 0.4 0.333333 0.333333 1.0 0.0
3 0.906061 0.402062 0.6 0.333333 0.333333 1.0 0.0
4 0.836364 0.396564 0.6 0.000000 0.333333 1.0 1.0

0 0.0 0.0 1.0 0.666667 1.0
1 0.0 0.0 1.0 1.000000 0.0
2 1.0 0.0 0.0 0.666667 1.0
3 1.0 0.0 1.0 1.000000 1.0
4 1.0 0.0 1.0 0.666667 0.0
0 1.0 0.0
1 1.0 0.0
2 0.0 1.0
3 1.0 0.0
4 1.0 0.0
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5.4.2 2. Standardisation (mean-0, sigma-1)
[12]: sc = StandardScaler()
housin = sc.fit_transform(housing)
housing

0 1.000000 0.396564 0.6 0.333333 0.666667 1.0 0.0
1 0.909091 0.502405 0.6 1.000000 1.000000 1.0 0.0
2 0.909091 0.571134 0.4 0.333333 0.333333 1.0 0.0
3 0.906061 0.402062 0.6 0.333333 0.333333 1.0 0.0
4 0.836364 0.396564 0.6 0.000000 0.333333 1.0 1.0
.. … … … … … … …
540 0.006061 0.092784 0.2 0.000000 0.000000 1.0 0.0
541 0.001485 0.051546 0.4 0.000000 0.000000 0.0 0.0
7
542 0.000000 0.135395 0.2 0.000000 0.000000 1.0 0.0
543 0.000000 0.086598 0.4 0.000000 0.000000 0.0 0.0
544 0.000000 0.151203 0.4 0.000000 0.333333 1.0 0.0

0 0.0 0.0 1.0 0.666667 1.0
1 0.0 0.0 1.0 1.000000 0.0
2 1.0 0.0 0.0 0.666667 1.0
3 1.0 0.0 1.0 1.000000 1.0
4 1.0 0.0 1.0 0.666667 0.0
.. … … … … …
540 1.0 0.0 0.0 0.666667 0.0
541 0.0 0.0 0.0 0.000000 0.0
542 0.0 0.0 0.0 0.000000 0.0
543 0.0 0.0 0.0 0.000000 0.0
544 0.0 0.0 0.0 0.000000 0.0
0 1.0 0.0
1 1.0 0.0
2 0.0 1.0
3 1.0 0.0
4 1.0 0.0
.. … …
540 0.0 0.0
541 0.0 1.0
542 0.0 0.0
543 1.0 0.0
544 0.0 0.0
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
.. …
540 1.0
541 0.0
542 1.0
543 0.0
544 1.0
[545 rows x 15 columns]
8
5.4.3 Step 5: Building a linear model
[16]: X=housing.drop('price',axis=1)
y=housing['price']
[17]: X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=.3,random_state=2)
[18]: model = LinearRegression()

model.fit(X, y)
[18]: LinearRegression()
[24]: lin=LinearRegression()
lin.fit(X_train,Y_train)
ypred=lin.predict(X_test)
[26]: plt.scatter(Y_test,ypred)
plt.show()

1 Data Mining 2 Lab - 1 3 Vinay Sirohi 4 2139472 5 Identify A Dataset, Preprocess The Dataset Set Using Normaliza-Tion Techniques

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 Data Mining 2 Lab - 1 3 Vinay Sirohi 4 2139472 5 Identify A Dataset, Preprocess The Dataset Set Using Normaliza-Tion Techniques

Uploaded by

Copyright:

Available Formats

2139472 Lab_1

5 Identify a dataset, Preprocess the dataset set using normaliza-

[1]: import numpy as np

5.1 Step 1: Reading and Understanding the Data

[2]: price area bedrooms bathrooms stories mainroad guestroom basement \

[3]: (545, 13)

5.2 Step 2: Visualising the Data

[7]: Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',

[8]: price area bedrooms bathrooms stories mainroad guestroom \

basement hotwaterheating airconditioning parking prefarea \

[545 rows x 13 columns]

5.3.3 Dummy Variables

[9]: housing = pd.get_dummies(housing)

[9]: price area bedrooms bathrooms stories mainroad guestroom \

basement hotwaterheating airconditioning parking prefarea \

5.4 Step 4 : Rescaling the Features

[10]: scaler = MinMaxScaler()

[11]: cols = list(housing.columns)

[11]: price area bedrooms bathrooms stories mainroad guestroom \

basement hotwaterheating airconditioning parking prefarea \

5.4.2 2. Standardisation (mean-0, sigma-1)

[12]: price area bedrooms bathrooms stories mainroad guestroom \

basement hotwaterheating airconditioning parking prefarea \

[545 rows x 15 columns]

[18]: model = LinearRegression()

You might also like