Professional Documents
Culture Documents
1 Data Mining 2 Lab - 1 3 Vinay Sirohi 4 2139472 5 Identify A Dataset, Preprocess The Dataset Set Using Normaliza-Tion Techniques
1 Data Mining 2 Lab - 1 3 Vinay Sirohi 4 2139472 5 Identify A Dataset, Preprocess The Dataset Set Using Normaliza-Tion Techniques
December 3, 2021
1 Data Mining
2 Lab_1
3 Vinay Sirohi
4 2139472
1
hotwaterheating airconditioning parking prefarea furnishingstatus
0 no yes 2 yes furnished
1 no yes 3 no furnished
2 no no 2 yes semi-furnished
3 no yes 3 yes furnished
4 no yes 2 no furnished
[3]: housing.shape
[4]: housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 545 non-null int64
1 area 545 non-null int64
2 bedrooms 545 non-null int64
3 bathrooms 545 non-null int64
4 stories 545 non-null int64
5 mainroad 545 non-null object
6 guestroom 545 non-null object
7 basement 545 non-null object
8 hotwaterheating 545 non-null object
9 airconditioning 545 non-null object
10 parking 545 non-null int64
11 prefarea 545 non-null object
12 furnishingstatus 545 non-null object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB
2
[6]: sns.pairplot(housing)
plt.show()
3
5.3 Step 3: Data Preparation
5.3.1 We can see that your dataset has many columns with values as ‘Yes’ or ‘No’.
5.3.2 But in order to fit a regression line, we would need numerical values and not
string. Hence, we need to convert them to 1s and 0s, where 1 is a ‘Yes’ and 0
is a ‘No’.
[7]: housing.columns
4
[8]: labelencoder = LabelEncoder()
housing['mainroad']=labelencoder.fit_transform(housing['mainroad'])
housing['guestroom']=labelencoder.fit_transform(housing['guestroom'])
housing['basement']=labelencoder.fit_transform(housing['basement'])
housing['hotwaterheating']=labelencoder.
,→fit_transform(housing['hotwaterheating'])
housing['airconditioning']=labelencoder.
,→fit_transform(housing['airconditioning'])
housing['prefarea']=labelencoder.fit_transform(housing['prefarea'])
housing
furnishingstatus
0 furnished
1 furnished
2 semi-furnished
3 furnished
4 furnished
.. …
540 unfurnished
541 semi-furnished
542 unfurnished
5
543 furnished
544 unfurnished
furnishingstatus_furnished furnishingstatus_semi-furnished \
0 1 0
1 1 0
2 0 1
3 1 0
4 1 0
furnishingstatus_unfurnished
0 0
1 0
2 0
3 0
4 0
6
housing[col] = housing[col].astype(float)
housing[[col]] = scaler.fit_transform(housing[[col]])
housing.head()
furnishingstatus_furnished furnishingstatus_semi-furnished \
0 1.0 0.0
1 1.0 0.0
2 0.0 1.0
3 1.0 0.0
4 1.0 0.0
furnishingstatus_unfurnished
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
[12]: sc = StandardScaler()
housin = sc.fit_transform(housing)
housing
7
542 0.000000 0.135395 0.2 0.000000 0.000000 1.0 0.0
543 0.000000 0.086598 0.4 0.000000 0.000000 0.0 0.0
544 0.000000 0.151203 0.4 0.000000 0.333333 1.0 0.0
furnishingstatus_furnished furnishingstatus_semi-furnished \
0 1.0 0.0
1 1.0 0.0
2 0.0 1.0
3 1.0 0.0
4 1.0 0.0
.. … …
540 0.0 0.0
541 0.0 1.0
542 0.0 0.0
543 1.0 0.0
544 0.0 0.0
furnishingstatus_unfurnished
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
.. …
540 1.0
541 0.0
542 1.0
543 0.0
544 1.0
8
5.4.3 Step 5: Building a linear model
[16]: X=housing.drop('price',axis=1)
y=housing['price']
[17]: X_train,X_test,Y_train,Y_test=train_test_split(X,y,test_size=.3,random_state=2)
[18]: LinearRegression()
[24]: lin=LinearRegression()
lin.fit(X_train,Y_train)
ypred=lin.predict(X_test)
[26]: plt.scatter(Y_test,ypred)
plt.show()