Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

VEDIC VIDYASHRAM SENIOR SECONDARY SCHOOL

Madurai Road, Thachanallur, Tirunelveli - 627 358

INFORMATICS PRACTICES

PROJECT ON

HOUSE PRICE PREDICTION

Submitted in partial fulfillment of the requirement

Practicals of Senior Secondary (CBSE)

(2023 - 2024)

Submitted By : P.KAVIPRIYAN

Grade : XII C
VEDIC VIDYASHRAM SENIOR SECONDARY SCHOOL

Madurai Road, Thachanallur, Tirunelveli - 627 358

CERTIFICATE

This is to certify that the Project Work entitled “HOUSE PRICE


PREDICTION” is the bonafide record of work done by P.KAVIPRIYAN of
Grade : XII, Exam No: in partial fulfillment of the practical classes
of 12th Standard during the Academic Year 2023-24.

He has taken proper care and shown utmost sincerity in completion of this
project as per the guidelines issued by CBSE.

DATE: INTERNAL EXAMINER

PRINCIPAL EXTERNAL EXAMINER


ACKNOWLEDGEMENT

● It is with a sense of gratitude, I acknowledge the efforts of the entire host


of well-wishers who have contributed in their own special ways to the
success and execution of this project.
● First of all, I express my heartfelt gratitude and indebtedness to my
school CORRESPONDENT, Mr.T. DURAISAMY, MCA, from the
bottom of my heart, for his unlimited support, motivation and
infrastructural aid rendered at all times.
● I would like to express my sincere thanks to my PRINCIPAL,
Mr.C P ENOSH, M.A, M.Phil, B.Ed, for all his substantial valuable
guidance and moral support which has helped me to patch this project
with undoubted success.
● I had been immeasurably enriched by working under the
expert supervision my subject teacher, Mr.S.
SHUNMUGA SUNDARAM, M.E.,A.M.I.E, who has the knack of
correcting and directing me in every situation. I convey my special
thanks to him.
● At last, I extend thanks with all my heart to the Teaching and Non-
Teaching staff who have assisted me constructively in my work .
DECLARATION

We hereby declare that the project work entitled HOUSE PRICE

PREDICTION submitted to DEPARTMENT OF INFORMATICS

PRACTICES, VEDIC VIDYASHRAM SENIOR SECONDARY

SCHOOL is a result of my own work and my indebtedness to other work

publications, references, if any, have been duly acknowledged.

DATE: P.KAVIPRIYAN
CONTENTS

Pg.
S.NO TITLE
no

01. ACKNOWLEDGEMENT

02. DECLARATION

03. PROBLEM DEFINITION 1

04. PROJECT STAGES 2

05. OBJECTIVE 3

06. EXISTING AND PROPOSED SYSTEM 4

HARDWARE AND SOFTWARE


07. 7
REQUIREMENT

08. WORKING DESCRIPTION 8

09. CODING 10

10. OUTPUT SCREENS 24

11. CONCLUSION 28

12. BIBLIOGRAPHY 29
PROBLEM DEFINITION

• People looking to buy a new home tend to be more conservative with their
budgets and market strategies. The existing system involves calculation of
house prices without the necessary prediction about future market trends and
price increase. The goal of the paper is to predict the efficient house pricing for
real estate customers with respect to their budgets and priorities.

• By analyzing previous market trends and price ranges, and also upcoming
developments future prices will be predicted. The functioning of this paper
involves a website which accepts customer’s specifications and then combines
the application of multiple linear regression algorithm of data mining.

• This application will help customers to invest in an estate without


approaching an agent. It also decreases the risk involved in the transaction.

pg. - 1 -
PROJECT STAGES

The project consists of the following stages:

IMPORTING
LIBRARIES AND
DATASET

EXPLORING AND
PREPROCESSING
THE DATASET

MODEL
IMPLEMENTATION

MODEL TESTING

pg. - 2 -
OBJECTIVES

• Create a machine learning model using linear regression

and Boston housing dataset while following the machine learning


workflow.

High-Level Approach:

● Exploring and analyzing the data used for making prediction

● Creating a simple model using linear regression

● Using the model to carryout prediction and evaluating it's


efficiency

pg. - 3 -
EXISTING AND PROPOSED SYSTEM

EXISTING SYSTEM:
• There are several approaches that can be used to determine the price of the
house, one of them is the prediction analysis. The first approach is a quantitative
prediction.

• A quantitative approach is an approach that utilizes time series data [5]. The
time-series approach is to look for the relationship between current prices and
prevailing prices. The second approach is to use linear regression based on
hedonic pricing. Previous research conducted by Gharehchopogh using linear
regression approach get 0,929 errors with the actual price. In linear regression,
determining coefficients generally using the least square method, but it takes a
long time to get the best formula.

• Particle swarm optimization (PSO) is proposed to find the coefficients aimed


at obtaining optimal result . Some previous researches such as Marini and
Walzack show that PSO gets better results than other hybrid methods. There are
several advantages of PSO, in the small search space PSO can do
better solution search. Although the PSO global search is less than optimal , but
on the optimization problem the value of the variable on the regression equation
can find a maximum solution using PSO 3. PROPOSED SYSTEM. The land
prices are predicted with a new set of parameters with a different technique. Also
we predicted the compensation for the settlement of the property.

pg. - 4 -
• Mathematical relationships help us to understand many aspects of everyday
life. When such relationships are expressed with exact numbers, we gain
Additional clarity Regression is concerned with specifying the relationship
between a single numeric dependent variable and one or more numeric
independent variables. House prices increase every year, so there is a need for a
system to predict house prices in the future. House price prediction can help the
developer determine the selling price of a house and can help the customer to
arrange the right time to purchase a house.

PROPOSED SYSTEM:
• Nowadays, e-education and e-learning is highly influenced. Everything is
shifting from manual to automated systems. The objective of this project is to
predict the house prices so as to minimize the problems faced by the customer.
The present method is that the customer approaches a real estate agent to manage
his/her investments and suggest suitable estates for his investments. But this
method is risky as the agent might predict wrong estates and thus leading to loss
of the customer’s investments.

• The manual method which is currently used in the market is out dated and
has high risk. So as to overcome this fault, there is a need for an updated and
automated system. Data mining algorithms can be used to help investors to invest
in an appropriate estate according to their mentioned requirements. Also the new
system will be cost and time efficient. This will have simple operations. The
proposed system works on Linear Regression Algorithm.

pg. - 5 -
REQUIREMENT

• libraries:
• Numpy
• Pandas
• Sklearn
• matplotlib.plt
• Seaborn

SOFTWARE :

● PYTHON 3.7
● MYSQL 5.0

HARDWARE :

CPUs Intel Dual Core

RAM 2GB (minimum) - 4GB (recommended)

Disk Storage 500GB

Operating
Windows or Linux
System

pg. - 6 -
WORKING DESCRIPTION

• The Sequence diagram above explains the working of the system. The
proposed system is supposed to be a website with 3 objects namely: Customer,
the Web Interface and the Database Server.

• The database
server also includes
the computational
mechanism described
in the algorithm.
When the customer
first enters into the
website they are
displayed with a GUI
where they can enter
inputs such as the
type of house, the
area in which it is
located etc.

• A data index
searching then
provides with outputs
consisting of matching properties. Now, if the customer wants to check the
house price in future they can enter the date from the future. The system will
identify the date and categorize it in the quarters. The algorithm then will
compute the value of rate and provide the results back to the customer.

pg. - 7 -
CODING
# -*- coding: utf-8 -*-

"""house-price-prediction-top-14-xgboost.ipynb

Automatically generated by colaboratory.

Original file is located at

https://colab.research.google.com/drive/16p1a388cb30t6r0sgf6w0tahiwqewtw-

My main objectives on this project are:

* applying exploratory data analysis and trying to get some insights about our
dataset

* getting data in better shape by transforming and feature engineering to help us


in building better models

* Building and tuning couple models to get some stable results on predicting
housing prices

"""

import os

for dirname, _, filenames in os.walk('/kaggle/input'):

for filename in filenames:

print(os.path.join(dirname, filename))

"""# meeting the data

We’re going to start by loading the data and taking first look on it as usual. for
the column names we have great dictionary file in our dataset location so we can
get familiar with them in no time. I highly recommend looking at that before you
start working on the dataset.

pg. - 8 -
"""

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from scipy import stats

df_train = pd.read_csv('../input/house-prices-advanced-regression-
techniques/train.csv')

df_test = pd.read_csv('../input/house-prices-advanced-regression-
techniques/test.csv')

df_train.head()

df_test.head()

df_train.shape

df_test.shape

"""As we can see that in train there are 1460 rows with 81 columns and in test
dataset 1459 rows with 80 columns. our dependent variable is **'saleprice'**"""

df_train.describe()

df_test.describe()

df_train.columns , df_test.columns

"""We have 1460 observations of 80 variables in the training dataframe. The


variables are described below:

saleprice - this is the target variable/dependent variable that you're trying to


predict.

* mssubclass: the building class

* mszoning: the general zoning classification

pg. - 9 -
* lotfrontage: linear feet of street connected to property

* lotarea: lot size in square feet

* street: type of road access

* alley: type of alley access

* lotshape: general shape of property

* landcontour: flatness of the property

* utilities: type of utilities available

* lotconfig: lot configuration

* landslope: slope of property

* neighborhood: physical locations within ames city limits

* condition1: proximity to main road or railroad

* condition2: proximity to main road or railroad (if a second is present)

* bldgtype: type of dwelling

* housestyle: style of dwelling

* overallqual: overall material and finish quality

* overallcond: overall condition rating

* yearbuilt: original construction date

* yearremodadd: remodel date

* roofstyle: type of roof

* roofmatl: roof material

* exterior1st: exterior covering on house

* exterior2nd: exterior covering on house (if more than one material)

* masvnrtype: masonry veneer type

* masvnrarea: masonry veneer area in square feet

* exterqual: exterior material quality

pg. - 10 -
* extercond: present condition of the material on the exterior

* foundation: type of foundation

* bsmtqual: height of the basement

* bsmtcond: general condition of the basement

* bsmtexposure: walkout or garden level basement walls

* bsmtfintype1: quality of basement finished area

* bsmtfinsf1: type 1 finished square feet

* bsmtfintype2: quality of second finished area (if present)

* bsmtfinsf2: type 2 finished square feet

* bsmtunfsf: unfinished square feet of basement area

* totalbsmtsf: total square feet of basement area

* heating: type of heating

* heatingqc: heating quality and condition

* centralair: central air conditioning

* electrical: electrical system

* 1stflrsf: first floor square feet

* 2ndflrsf: second floor square feet

* lowqualfinsf: low quality finished square feet (all floors)

* grlivarea: above grade (ground) living area square feet

* bsmtfullbath: basement full bathrooms

* bsmthalfbath: basement half bathrooms

* fullbath: full bathrooms above grade

* halfbath: half baths above grade

* bedroom: number of bedrooms above basement level

* kitchen: number of kitchens

pg. - 11 -
* kitchenqual: kitchen quality

* totrmsabvgrd: total rooms above grade (does not include bathrooms)

* functional: home functionality rating fireplaces: number of fireplaces

* fireplacequ: fireplace quality

* garagetype: garage location

* garageyrblt: year garage was built

* garagefinish: interior finish of the garage

* garagecars: size of garage in car capacity

* garagearea: size of garage in square feet

* garagequal: garage quality

* garagecond: garage condition

* paveddrive: paved driveway

* wooddecksf: wood deck area in square feet

* openporchsf: open porch area in square feet

* enclosedporch: enclosed porch area in square feet

* 3ssnporch: three season porch area in square feet

* screenporch: screen porch area in square feet

* poolarea: pool area in square feet

* poolqc: pool quality

* fence: fence quality

* miscfeature: miscellaneous feature not covered in other categories

* miscval: $value of miscellaneous feature

* mosold: month sold

* yrsold: year sold

* saletype: type of sale

pg. - 12 -
* salecondition: condition of sale"""

#correlation matrix

import matplotlib.pyplot as plt

import seaborn as sns

corrmat = df_train.corr()

f, ax = plt.subplots(figsize=(15, 12))

sns.heatmap(corrmat, vmax=.8, square=true)

#saleprice correlation matrix

k = 10 #number of variables for heatmap

cols = corrmat.nlargest(k, 'saleprice')['saleprice'].index

cm = np.corrcoef(df_train[cols].values.t)

sns.set(font_scale=1.25)

plt.figure(figsize=(10,10))

hm = sns.heatmap(cm, cbar=true, annot=true, square=true, fmt='.2f',


annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)

plt.show()

"""**so what is log transformation:-log transformation is used to transform


skewed data to approximately conform to normality.**"""

'''#before log transformation

sns.distplot(df_train['saleprice']);

fig_saleprice = plt.figure(figsize=(12,5))

result1 = stats.probplot(df_train['saleprice'],plot = plt)'''

'''#applying log transformation

df_train['saleprice'] = np.log(df_train['saleprice'])'''

'''#after log transformation

sns.distplot(df_train['saleprice']);
pg. - 13 -
fig_saleprice2 = plt.figure(figsize=(12,5))

result3 = stats.probplot(df_train['saleprice'],plot = plt)'''

"""below code is used to see top 10 highly correlated columns with saleprice in
which overallqual,grlivearea,garagecars,garagearea,totalbsmtsf and 1stflrsf are
highly correlated"""

#below code is used to see which column is more correlated to dependent


variable so first ten columns are more correlated compare to other columns

corr = df_train.corr()["saleprice"]

corr[np.argsort(corr, axis=0)[::-1]]

"""# **outliers**

We are going to plot first 10 highly correlated columns to see how many outliers
we have in our dataset

"""

fig = plt.subplots()

plt.scatter(x = df_train['grlivarea'], y = df_train['saleprice'])

plt.ylabel('saleprice', fontsize=13)

plt.xlabel('grlivarea', fontsize=13)

plt.show()

fig1= plt.subplots()

plt.scatter(x = df_train['overallqual'], y = df_train['saleprice'])

plt.ylabel('saleprice', fontsize=13)

plt.xlabel('overallqual', fontsize=13)

plt.show()

fig2= plt.subplots()

plt.scatter(x = df_train['garagecars'], y = df_train['saleprice'])

plt.ylabel('saleprice', fontsize=13)

pg. - 14 -
plt.xlabel('garagecars', fontsize=13)

plt.show()

fig3= plt.subplots()

plt.scatter(x = df_train['garagearea'], y = df_train['saleprice'])

plt.ylabel('saleprice', fontsize=13)

plt.xlabel('garagearea', fontsize=13)

plt.show()

fig4= plt.subplots()

plt.scatter(x = df_train['totalbsmtsf'], y = df_train['saleprice'])

plt.ylabel('saleprice', fontsize=13)

plt.xlabel('totalbsmtsf', fontsize=13)

plt.show()

fig5= plt.subplots()

plt.scatter(x = df_train['1stflrsf'], y = df_train['saleprice'])

plt.ylabel('saleprice', fontsize=13)

plt.xlabel('1stflrsf', fontsize=13)

plt.show()

fig6= plt.subplots()

plt.scatter(x = df_train['fullbath'], y = df_train['saleprice'])

plt.ylabel('saleprice', fontsize=13)

plt.xlabel('fullbath', fontsize=13)

plt.show()

fig7= plt.subplots()

plt.scatter(x = df_train['totrmsabvgrd'], y = df_train['saleprice'])

pg. - 15 -
plt.ylabel('saleprice', fontsize=13)

plt.xlabel('totrmsabvgrd', fontsize=13)

plt.show()

fig8= plt.subplots()

plt.scatter(x = df_train['yearbuilt'], y = df_train['saleprice'])

plt.ylabel('saleprice', fontsize=13)

plt.xlabel('yearbuilt', fontsize=13)

plt.show()

'''#deleting outliers

df = df.drop(df[(df['grlivarea']>4000) & (df['saleprice']<300000)].index)

df = df.drop(df[(df['garagearea']>1200) & (df['saleprice']<500000)].index)

df = df.drop(df[(df['totalbsmtsf']>3000) & (df['saleprice']<700000)].index)

df = df.drop(df[(df['1stflrsf']>2700) & (df['1stflrsf']<700000)].index)'''

#scatterplot

sns.set()

columns = ['saleprice', 'overallqual', 'grlivarea', 'garagecars', 'totalbsmtsf',


'1stflrsf']

sns.pairplot(df_train[columns], size = 3)

plt.show();

"""# some feature engineering

Here I have merged some columns to just reduce complexity I have tried with all
the columns but I didn't get this much accuracy which I am getting right now

"""

#feature engineering

df_train['totalsf'] = df_train['totalbsmtsf']+df_train['1stflrsf']+df_train['2ndflrsf']

df_train=df_train.drop(columns={'1stflrsf', '2ndflrsf','totalbsmtsf'})
pg. - 16 -
df_train['wholeexterior'] = df_train['exterior1st']+df_train['exterior2nd']

df_train=df_train.drop(columns={'exterior1st','exterior2nd'})

df_train['bsmt'] = df_train['bsmtfinsf1']+ df_train['bsmtfinsf2']

df_train = df_train.drop(columns={'bsmtfinsf1','bsmtfinsf2'})

df_train['totalbathroom'] = df_train['fullbath'] + df_train['halfbath']

df_train = df_train.drop(columns={'fullbath','halfbath'})

df_test['totalsf'] = df_test['totalbsmtsf']+df_test['1stflrsf']+df_test['2ndflrsf']

df_test=df_test.drop(columns={'1stflrsf', '2ndflrsf','totalbsmtsf'})

df_test['wholeexterior'] = df_test['exterior1st']+df_test['exterior2nd']

df_test=df_test.drop(columns={'exterior1st','exterior2nd'})

df_test['bsmt'] = df_test['bsmtfinsf1']+ df_test['bsmtfinsf2']

df_test = df_test.drop(columns={'bsmtfinsf1','bsmtfinsf2'})

df_test['totalbathroom'] = df_test['fullbath'] + df_test['halfbath']

df_test = df_test.drop(columns={'fullbath','halfbath'})

"""**we're going to merge the datasets here before we start editing it so we don't
have to do these operations twice. Let’s call it features since it has features only.
so our data has 2919 observations and 79 features to begin with...**"""

frames = [df_train,df_test]

df = pd.concat(frames,keys=['train','test'])

"""there are 2919 observations with 76 columns. including the target variable
saleprice and id.the train set has 1460 observations while the test set has 1459
observations, the target variable saleprice is absent in test. the aim of this study is
to train a model on the train set and use it to predict the target saleprice of the test
set."""

df

df_missing=df.isnull().sum().sort_values(ascending=false)

pg. - 17 -
df_missing

"""now we are separating categorical columns and numerical columns for filling
missing values"""

cat_col = df.select_dtypes(include=['object'])

cat_col.isnull().sum()

cat_col.columns

num_col = df.select_dtypes(include=['int64', 'float64'])

num_col.isnull().sum()

num_col.columns

"""In below cell you have your numerical columns so I just replace nan by 0. I
have also tried mode, median and mean but I got best result in 0.if you want to do
it then just fork my notebook and apply that functions. If you want that other
function's code then just comment below I will give you the code in comment
section.

# handling missing data

# Numerical columns

"""

# handling missing values of numerical columns

df['lotfrontage'] = df['lotfrontage'].fillna(value=0)

df['garageyrblt'] = df['garageyrblt'].fillna(value=0)

df['masvnrarea'] = df['masvnrarea'].fillna(value=0)

df['bsmtfullbath'] = df['bsmtfullbath'].fillna(value=0)

df['bsmthalfbath'] = df['bsmthalfbath'].fillna(value=0)

df['garagearea'] = df['garagearea'].fillna(value=0)

df['garagecars'] = df['garagecars'].fillna(value=0)

pg. - 18 -
df['bsmtunfsf'] = df['bsmtunfsf'].fillna(value=0)

df['bsmt'] = df['bsmt'].fillna(value=0)

df['totalsf'] = df['totalsf'].fillna(value=0)

"""I have applied same technique as I applied in numerical columns where I put 0
and here i have replaced all the nan values with none. That means if the original
dataset have nan values, it means that the particular house is doesn't have that
thing. For example, if id no = 220 do not have garage then why we put values
that id no = 220 has a garage.

so i replaced them with none.

# Categorical columns

"""

# handling missing values of categorical columns

df['mszoning'] = df['mszoning'].fillna(value='none')

df['garagequal'] = df['garagequal'].fillna(value='none')

df['garagecond'] = df['garagecond'].fillna(value='none')

df['garagefinish'] = df['garagefinish'].fillna(value='none')

df['garagetype'] = df['garagetype'].fillna(value='none')

df['bsmtexposure'] = df['bsmtexposure'].fillna(value='none')

df['bsmtcond'] = df['bsmtcond'].fillna(value='none')

df['bsmtqual'] = df['bsmtqual'].fillna(value='none')

df['bsmtfintype2'] = df['bsmtfintype2'].fillna(value='none')

df['bsmtfintype1'] = df['bsmtfintype1'].fillna(value='none')

df['masvnrtype'] = df['masvnrtype'].fillna(value='none')

df['utilities'] = df['utilities'].fillna(value='none')

df['functional'] = df['functional'].fillna(value='none')

df['electrical'] = df['electrical'].fillna(value='none')

pg. - 19 -
df['kitchenqual'] = df['kitchenqual'].fillna(value='none')

df['saletype'] = df['saletype'].fillna(value='none')

df['wholeexterior'] = df['wholeexterior'].fillna(value='none')

"""top 40 correlated columns after data preprocessing"""

#saleprice correlation matrix

k = 40 #number of variables for heatmap

cols = corrmat.nlargest(k, 'saleprice')['saleprice'].index

cm = np.corrcoef(df_main[cols].values.t)

sns.set(font_scale=1.25)

plt.figure(figsize=(10,10))

hm = sns.heatmap(cm, cbar=true, square=true, fmt='.2f', annot_kws={'size': 10},


yticklabels=cols.values, xticklabels=cols.values)

plt.show()

eid = df_main.loc['test']

df_test = df_main.loc['test']

df_train = df_main.loc['train']

eid = eid.id

df_test.drop(['saleprice','id'], axis =1, inplace=true)

x_train = df_train.drop(['saleprice','id'], axis = 1)

y_train = df_train['saleprice']

import xgboost

xgboost = xgboost.xgbregressor(learning_rate=0.05,

colsample_bytree = 0.5,

subsample = 0.8,

n_estimators=1000,

max_depth=5,
pg. - 20 -
gamma=5)

xgboost.fit(x_train, y_train)

y_pred = xgboost.predict(df_test)

y_pred

#making main csv file

main_submission = pd.dataframe({'id': eid, 'saleprice': y_pred})

main_submission.to_csv("submission.csv", index=false)

main_submission.head()

pg. - 21 -
OUTPUT SCREEN
***
#saleprice correlation matrix
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df_train[cols].values.T)
sns.set(font_scale=1.25)
plt.figure(figsize=(10,10))
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f',
annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

OUTPUT:-

SalePrice
OverallQual
GrLiveArea
GarageCars
GarageArea
TotalBsmtSF
1stFlrSF
FullBath
TotRmsAbvGrd
YearBuilt
***

#values of correlation
abs(df_train.corr()['SalePrice']).nlargest(10)
***

pg. - 22 -
OUTPUT:-

SalePrice 1.000000
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
1stFlrSF 0.605852
FullBath 0.560664
TotRmsAbvGrd 0.533723
YearBuilt 0.522897
Name: SalePrice, dtype: float64
***

#sum of missing data


df.isnull().sum().sort_values(ascending=False)

***
OUTPUT:-

SalePrice :1459
MSZoning: 4
LotFrontage: 486
Alley: 2721
Utilities: 2
Exterior1st: 1
Exterior2nd: 1
MasVnrType: 24
MasVnrArea: 23
BsmtQual: 81
BsmtCond: 82
BsmtExposure: 82
BsmtFinType1: 79
BsmtFinSF1: 79
BsmtFinType2: 80
pg. - 23 -
BsmtFinSF2: 1
BsmtUnfSF: 1
TotalBsmtSF: 1
Electrical: 1
BsmtFullBath: 2
BsmtHalfBath: 2
KitchenQual: 1
Functional: 2
FireplaceQu: 1420
GarageType: 157
GarageYrBlt: 159
GarageFinish: 159
GarageCars: 1
GarageArea: 1
GarageQual: 159
GarageCond: 159
PoolQC: 2909
Fence: 2348
MiscFeature:2814
MoSold: Month Sold
YrSold: Year Sold
SaleType: 1
Length: 36, dtype: int64
#encoded

df_main = pd.get_dummies(df)
df_main.shape
***

OUTPUT:-

(2919, 339)

pg. - 24 -
***

#rmse
y_test = y_train.drop([10], axis=0)
from math import sqrt
print('xgb rmse', sqrt(mean_squared_error(y_test, y_pred1)))

print('gbr rmse', sqrt(mean_squared_error(y_test, y_pred2)))


print('rf rmse', sqrt(mean_squared_error(y_test, y_pred3)))
print('lightgbm rmse', sqrt(mean_squared_error(y_test, y_pred4)))
print('svr rmse:', sqrt(mean_squared_error(y_test, y_pred5)))
print('stacked rmse:', sqrt(mean_squared_error(y_test, y_pred6)))

***
OUTPUT:-

xgb rmse: 0.1223501568206363


gbr rmse :0.5585375883105338
rf rmse : 0.43600854434323927
lightgbm rmse : 0.5596622356678556
SVR rmse: 0.5246953605047906
stacked rmse: 0.5026308085477498

pg. - 25 -
CONCLUSION

• In today’s real estate world, it has become tough to store such huge data and
extract them for one’s own requirement. Also, the extracted data should be
useful. The system makes optimal use of the Linear Regression Algorithm. The
system makes use of such data in the most efficient way. The linear regression
algorithm helps to fulfill customers by increasing the accuracy of estate choice
and reducing the risk of investing in an estate.

• A lot’s of features that could be added to make the system more widely
acceptable. One of the major future scopes is adding estate database of more
cities which will provide the user to explore more estates and reach an accurate
decision. More factors like recession that affect the house prices shall be added.
In-depth details of every property will be added to provide ample details of a
desired estate. This will help the system to run on a larger level.

pg. - 26 -
REFERENCES

• WIKIPEDIA

• HTTPS://WWW.CRIO.DO/

• HTTPS://WWW.GEEKSFORGEEKS.ORG/

• HTTPS://WWW.KAGGLE.COM/

• HTTPS://WWW.GITHUB.COM/

*********

pg. - 27 -

You might also like