Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

2139472 Lab_1

December 3, 2021

1 Data Mining

2 Lab_1

3 Vinay Sirohi

4 2139472

5 Identify a dataset, Preprocess the dataset set using normaliza-


tion techniques
5.0.1 Importing the useful libraries

[1]: import numpy as np


from sklearn.preprocessing import LabelEncoder
from sklearn import feature_selection
from sklearn import model_selection
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

5.1 Step 1: Reading and Understanding the Data


[2]: housing = pd.read_csv('Housing.csv')
housing.head()

[2]: price area bedrooms bathrooms stories mainroad guestroom basement \


0 13300000 7420 4 2 3 yes no no
1 12250000 8960 4 4 4 yes no no
2 12250000 9960 3 2 2 yes no yes
3 12215000 7500 4 2 2 yes no yes
4 11410000 7420 4 1 2 yes yes yes

1
hotwaterheating airconditioning parking prefarea furnishingstatus
0 no yes 2 yes furnished
1 no yes 3 no furnished
2 no no 2 yes semi-furnished
3 no yes 3 yes furnished
4 no yes 2 no furnished

[3]: housing.shape

[3]: (545, 13)

[4]: housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 545 non-null int64
1 area 545 non-null int64
2 bedrooms 545 non-null int64
3 bathrooms 545 non-null int64
4 stories 545 non-null int64
5 mainroad 545 non-null object
6 guestroom 545 non-null object
7 basement 545 non-null object
8 hotwaterheating 545 non-null object
9 airconditioning 545 non-null object
10 parking 545 non-null int64
11 prefarea 545 non-null object
12 furnishingstatus 545 non-null object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB

5.2 Step 2: Visualising the Data


[5]: plt.figure(figsize=[5,5])
plt.scatter(housing.area, housing.price , color = 'green')
plt.show()

2
[6]: sns.pairplot(housing)
plt.show()

3
5.3 Step 3: Data Preparation
5.3.1 We can see that your dataset has many columns with values as ‘Yes’ or ‘No’.
5.3.2 But in order to fit a regression line, we would need numerical values and not
string. Hence, we need to convert them to 1s and 0s, where 1 is a ‘Yes’ and 0
is a ‘No’.
[7]: housing.columns

[7]: Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',


'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
'parking', 'prefarea', 'furnishingstatus'],
dtype='object')

4
[8]: labelencoder = LabelEncoder()
housing['mainroad']=labelencoder.fit_transform(housing['mainroad'])
housing['guestroom']=labelencoder.fit_transform(housing['guestroom'])
housing['basement']=labelencoder.fit_transform(housing['basement'])
housing['hotwaterheating']=labelencoder.
,→fit_transform(housing['hotwaterheating'])

housing['airconditioning']=labelencoder.
,→fit_transform(housing['airconditioning'])

housing['prefarea']=labelencoder.fit_transform(housing['prefarea'])
housing

[8]: price area bedrooms bathrooms stories mainroad guestroom \


0 13300000 7420 4 2 3 1 0
1 12250000 8960 4 4 4 1 0
2 12250000 9960 3 2 2 1 0
3 12215000 7500 4 2 2 1 0
4 11410000 7420 4 1 2 1 1
.. … … … … … … …
540 1820000 3000 2 1 1 1 0
541 1767150 2400 3 1 1 0 0
542 1750000 3620 2 1 1 1 0
543 1750000 2910 3 1 1 0 0
544 1750000 3850 3 1 2 1 0

basement hotwaterheating airconditioning parking prefarea \


0 0 0 1 2 1
1 0 0 1 3 0
2 1 0 0 2 1
3 1 0 1 3 1
4 1 0 1 2 0
.. … … … … …
540 1 0 0 2 0
541 0 0 0 0 0
542 0 0 0 0 0
543 0 0 0 0 0
544 0 0 0 0 0

furnishingstatus
0 furnished
1 furnished
2 semi-furnished
3 furnished
4 furnished
.. …
540 unfurnished
541 semi-furnished
542 unfurnished

5
543 furnished
544 unfurnished

[545 rows x 13 columns]

5.3.3 Dummy Variables


The variable furnishingstatus has three levels. We need to convert these levels into integer as well.

[9]: housing = pd.get_dummies(housing)


housing.head()

[9]: price area bedrooms bathrooms stories mainroad guestroom \


0 13300000 7420 4 2 3 1 0
1 12250000 8960 4 4 4 1 0
2 12250000 9960 3 2 2 1 0
3 12215000 7500 4 2 2 1 0
4 11410000 7420 4 1 2 1 1

basement hotwaterheating airconditioning parking prefarea \


0 0 0 1 2 1
1 0 0 1 3 0
2 1 0 0 2 1
3 1 0 1 3 1
4 1 0 1 2 0

furnishingstatus_furnished furnishingstatus_semi-furnished \
0 1 0
1 1 0
2 0 1
3 1 0
4 1 0

furnishingstatus_unfurnished
0 0
1 0
2 0
3 0
4 0

5.4 Step 4 : Rescaling the Features


5.4.1 1.Min-Max scaling

[10]: scaler = MinMaxScaler()

[11]: cols = list(housing.columns)


for col in cols:

6
housing[col] = housing[col].astype(float)
housing[[col]] = scaler.fit_transform(housing[[col]])
housing.head()

[11]: price area bedrooms bathrooms stories mainroad guestroom \


0 1.000000 0.396564 0.6 0.333333 0.666667 1.0 0.0
1 0.909091 0.502405 0.6 1.000000 1.000000 1.0 0.0
2 0.909091 0.571134 0.4 0.333333 0.333333 1.0 0.0
3 0.906061 0.402062 0.6 0.333333 0.333333 1.0 0.0
4 0.836364 0.396564 0.6 0.000000 0.333333 1.0 1.0

basement hotwaterheating airconditioning parking prefarea \


0 0.0 0.0 1.0 0.666667 1.0
1 0.0 0.0 1.0 1.000000 0.0
2 1.0 0.0 0.0 0.666667 1.0
3 1.0 0.0 1.0 1.000000 1.0
4 1.0 0.0 1.0 0.666667 0.0

furnishingstatus_furnished furnishingstatus_semi-furnished \
0 1.0 0.0
1 1.0 0.0
2 0.0 1.0
3 1.0 0.0
4 1.0 0.0

furnishingstatus_unfurnished
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0

5.4.2 2. Standardisation (mean-0, sigma-1)

[12]: sc = StandardScaler()
housin = sc.fit_transform(housing)
housing

[12]: price area bedrooms bathrooms stories mainroad guestroom \


0 1.000000 0.396564 0.6 0.333333 0.666667 1.0 0.0
1 0.909091 0.502405 0.6 1.000000 1.000000 1.0 0.0
2 0.909091 0.571134 0.4 0.333333 0.333333 1.0 0.0
3 0.906061 0.402062 0.6 0.333333 0.333333 1.0 0.0
4 0.836364 0.396564 0.6 0.000000 0.333333 1.0 1.0
.. … … … … … … …
540 0.006061 0.092784 0.2 0.000000 0.000000 1.0 0.0
541 0.001485 0.051546 0.4 0.000000 0.000000 0.0 0.0

7
542 0.000000 0.135395 0.2 0.000000 0.000000 1.0 0.0
543 0.000000 0.086598 0.4 0.000000 0.000000 0.0 0.0
544 0.000000 0.151203 0.4 0.000000 0.333333 1.0 0.0

basement hotwaterheating airconditioning parking prefarea \


0 0.0 0.0 1.0 0.666667 1.0
1 0.0 0.0 1.0 1.000000 0.0
2 1.0 0.0 0.0 0.666667 1.0
3 1.0 0.0 1.0 1.000000 1.0
4 1.0 0.0 1.0 0.666667 0.0
.. … … … … …
540 1.0 0.0 0.0 0.666667 0.0
541 0.0 0.0 0.0 0.000000 0.0
542 0.0 0.0 0.0 0.000000 0.0
543 0.0 0.0 0.0 0.000000 0.0
544 0.0 0.0 0.0 0.000000 0.0

furnishingstatus_furnished furnishingstatus_semi-furnished \
0 1.0 0.0
1 1.0 0.0
2 0.0 1.0
3 1.0 0.0
4 1.0 0.0
.. … …
540 0.0 0.0
541 0.0 1.0
542 0.0 0.0
543 1.0 0.0
544 0.0 0.0

furnishingstatus_unfurnished
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
.. …
540 1.0
541 0.0
542 1.0
543 0.0
544 1.0

[545 rows x 15 columns]

8
5.4.3 Step 5: Building a linear model

[16]: X=housing.drop('price',axis=1)
y=housing['price']

[17]: X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=.3,random_state=2)

[18]: model = LinearRegression()


model.fit(X, y)

[18]: LinearRegression()

[24]: lin=LinearRegression()
lin.fit(X_train,Y_train)
ypred=lin.predict(X_test)

[26]: plt.scatter(Y_test,ypred)
plt.show()

You might also like