Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 6

House price prediction using gradient

boosting and linear regression


Utkarsh Gupta Himadri Vaidya Rahul Chauhan
Computer Science and Engineering Computer Science and Engineering Computer Science and Engineering
Graphic Era Hill University Graphic Era Hill University Graphic Era Hill University
Dehradun, Uttarakhand Dehradun, Uttarakhand Dehradun, Uttarakhand, India
utkarshgupta2430@gmail.com himadrivaidya8@gmail.com chauhan14853@gmail.com@gmail.com

Chandradeep Bhatt
Computer Science and Engineering
Graphic Era Hill University
Dehradun, Uttarakhand
bhattchandradeep@gmail.com

Abstract— Everyone dreams of buying and living in their Machine Learning starts predicting new values from the
own house, which is suitable for their lifestyle and suits previous data given to them. Now a day’s house prices
their personality. The main objective for which people are increasing day by day due to overpopulation. People
look at a house is the square footage of the house, the
who are unaware of house prices may suffer a lot to get
number of bedrooms and bathrooms, the location of the
a desire house. In dataset that we have used in our
house, and the year it was built. This qualitative research
helps to get an accurate price for the house that is suitable model in that we have divided it into two parts like
for their budget and will not affect them financially. It training set and testing set. We can use any percentage
also helps a person buy a house according to his or her to divide it into train and test. For example, I have taken
requirements. There are many machine learning 70% of the data to train the model and the rest 30% data
algorithms out there, like Gradient boosting regression, to test the model. There are many algorithms that can
linear regression, polynomial regression, and random be used to predict house price, But I have used here
forest regression. After researching on every algorithm, Linear Regression algorithm that perform sudden task
the best accuracy algorithm will be chosen to predict the
to do prediction of house accurate.
house price.
1. Collecting data: Gather a large dataset containing
Keywords: Ml Algorithm, House price prediction model,
Regression Techniques. Gradient boosting regression, information about houses such as location of house,
linear regression, polynomial regression number of bedrooms and bathrooms, in which year the
house is built, etc.
I. INTRODUCTION
2. Data Cleaning: Clean the data by removing missing
Buying a new own house is everyone’s most important values, outliers and duplicates.
decision that a person makes in his life. Everyone’s
dream is to live in their dream house with a price range 3. Splitting of data: Split the data into two sets (i-) A
of their budget. The house price depends on a various Training set – The training set of the dataset is that set
variety of factor range such as the number of bedrooms which is used to train our model, and (ii-) A Testing
and bathrooms in the house, location of the house, set- A testing set is used to test the model’s
ready to move or not, as well as the year built. Without performance.
data we can’t train our model that why data is called
4. Train the model: Training a regression model using
heart of Machine Learning model. So, we give certain
the training set. By this our model learns how to predict
information like location of house, number of
the accurate price of the house by the given data.
bedrooms, bathrooms and other amenities to our model
to predict the house price accurately.
5. Evaluating the model: by using the testing set of the house price. Random forest algorithm is giving less
data we judge the model’s performance. performance than the ensemble algorithm. Ensemble
also gives the good result as compare with random
6. Deploy the model: when the performance of the forest algorithm.
model is satisfied, then model is ready to make
prediction on new data. Darshil Shah (2020) [2], he has shown in his paper that
the models which are already there in the market are
7. Monitor the model: keep monitoring the model having dataset which is very old and to solve this
performance and update it periodically to ensure that it problem, he introduced automated system with best
stays accurate. accuracy of predicting house prices. He has done this
by using Light GBM, XG Boost and Random forest
Data Data Train-Test techniques. He used these techniques to train and test
Collection Cleaning Splitting the model to predict house price with more accuracy.

G. Naga Satish [3], he has used linear regression,


gradient boosting and lasso regression techniques to
Model Model Creating
predict the house values and he has taken some
Evaluation Selection pipeline
variables that can be predictable and which get the
output in bar graph type and he comes to know that in
his model 5 bedrooms house is showing less amount
Cross Testing the OUTPUT and 3 bedrooms house is showing more price. He
Validation pipeline on test
resolve this problem and made a project that can able to
data
guess the price of the house with accuracy.
Fig.1 Research Flow Diagram
P. Durganjali made[4] a model of resale of house price
This model aims to explain that how a real-world prediction using classification algorithm. In this model,
problem can be solved by using various machine in her paper she has used different algorithm like
learning techniques. and how accurate the prediction Random forest, Decision tree and linear regression. In
can be done using data-driven approaches. This model her project she has consider RMSE as the performance
is also very helpful for buyers and sellers make better matrix for the dataset and after that she applied these
decision and improve the overall efficiency of the real algorithm to get the most accurate result.
estate market.
Patel and Upadhyay[5], they have discuss many
II. LITERATURE SURVEY pruning methods in their paper. ID3 is used to breaks
the attributes which is based on entropy. One more
For making prediction of house prices lots of research
algorithm which build classification rules by the help of
has been made. By using different methodologies many
decision tree. One more thing they have used for testing
levels of result and prediction have been gained. A
is weka interface.
research has been made by Bahia[1] predicting the price
of the house with the help of data mining techniques. Shinde and Gawande [6], they have used algorithm like
He has used the idea to construct a neural network with lasso, SVR and logistic regression and then compared
the help of two neural network. In his idea the first their accuracy. Alfiyatin et al. [7], he has done is
neural network was Feed forward neural network and prediction by using partial swarm optimization (PSO)
second one was Cascade forward neural network. A and regression. In his paper it is mentioned that by
relative study has been made by Madhuri, she has used using PSO with regression the prediction accuracy has
a regression technique, LASSO algorithm, and ridge. In been improved.
her paper she has used common parameters like the
square feet and the price. Another, one study has been Timothy C. Au [8], he has talked about absent level
done using ensemble learning that was done by Tang et. problems in his paper, the absent level problem is in
this method is known as perfect tool for predicting Decision Trees and Random Forests. By this they have
explained about how this affects the working of the well mentioned fields and attributes that can be used to
price predictor. re-label the dataset. A good dataset having a great
accuracy to predict the price. I have attempted on
III. METHODOLOGY various datasets from kaggle, and taken the best dataset
which suits my project aim. After some searching I
In this model, I have used linear regression and
have found this dataset.
polynomial regression to build my project of predicting
the price of the house. I have presented “House Price
Prediction Using Machine Learning” I have predicted
the value of the price by using some features. In this
system, I am able to train my model by using various
items numbers of bedrooms, bathrooms, area of the
house in square feet and location of house etc. I have
taken a specific dataset to train and test my model. from
complete dataset I have taken 70% dataset for training
the model and remaining 30% to test the accuracy of
the model. I have stored the raw data in .csv format. For
implementing the mode I have used three libraries like
‘numpy’- which is used for splitting the dataset into
training and testing set, ’pandas’- is used for loading
‘.csv’ file in jupyter notebook, and last one is ‘scikit
learn’- which is having various inbuilt functions and Fig.3 Data collected for houses
help us to solve the various problems.
2. Data cleaning- Cleaning of data is done so that the
error can be removed of the dataset. Hence, by doing
Start
this the value of the data increases. Wrangling tools are
used to clean the data. Wrangling tool removes the
complex information. So, that data can be used for
Housing Data Sets predicting the house prices.

Raw data into


understandable
data
Train data set Test data set

Regression Algorithm Build Model Data Cleaning Data


Transformation

House Price
Prediction
Data
Fig.2 Proposed architecture of house price prediction model
Reduction
1. Data Collection- The collection of data is the Fig.4 Preprocessing steps
mandatory part for making machine able to predict
3. Pre-processing of the dataset- Pre-processing means
price. For training the machine learning project a big
breaking the dataset into two parts a training and testing
amount of dataset is required. A perfect dataset having
module. As in dataset, there present some non- 1. The number of parameters present must be smaller
numerical features also such as the house environment, than the number of observations made.
location and house is ready to move or not. I use one
hot ender and label encoder function which is the 2. The mean error value of Ɛ has to be 0, by this we
library of sci-kit learn, these libraries helps to convert know that the term Ɛ is distributed normally.
the non-numerical into numerical features. In dataset
some empty set also there so to remove that I have used
the mean of the column using the simple imputer
function which is also the library of the scikit learn.

4. Training the model- Training of the model is done


to make our model predictable in the future by
providing it old data. I have breaks the dataset in two
parts training and testing set. I have taken 70% percent
data in training purpose and remaining 30% data for
testing the model.
Fig.5 Mean error value

Input ML Algorithm Output 7. Polynomial Regression- Polynomial regression is


known as the simple linear regression. Like we have
seen in the linear regression a straight regression line fit
5. Testing the model- Once the training process between the dependent and independent values, but in
finished. Then they are tested with the testing dataset. this case a line cannot be fit between target and the
After training and testing model will gives the predicted values because there is not any linear
prediction with accuracy for the processed dataset. relationship. Here, curve is fitted between the variables
instead of straight line.
6. Simple Linear Regression – In this model a linear
relation has been build between the dependent This is done by putting a degree in nonlinear data which
variable(A) and a single independent variable(B). With forms a curvilinear relationship Equation of polynomial
the help of regressor line fitting between dependent and regression is:
independent variables linear relationship is built. The
equation of line is shown by: Y = x+y1A1 +y2A2 +y3A3 +........+ynAn

B = x + yA (1) Advantages of polynomial regression are:

Whereas variables ‘x’ and ’y’ are known as the model 1. This regression techniques gives the best calculation
parameters. When we take A as 0, we get value of ‘x’ between dependent variable and the independent
which is intercept of B and ‘y’ is slope which show variables.
change of the variable B with A. If ‘y’ value is larger
then if we make a smaller in A that will lead to make a 2. The dataset with the largest power of the polynomial
large change in B. The value of ‘x’ and ‘y’ by the least suits good in dataset.
square method. Every times the predicted values is not
3. By changing the degree we can able to fit many
accurate so sometimes there will be a difference, for
curves into polynomial regression.
that we include one term to the equation (1) which is
known as error term, by doing this is help to predict Disadvantage of polynomial regression are:
better values.
1. This algorithm is very much reactive towards the
B= x + yA + Ɛ (2) outliers that are present, due to the outliers the variance
of the model increases.
Some prediction has been done in simple linear
regression that is:
8. Data Analysis- Before proceeding further we have to
be clear about the data that it is accurate and ready to
use in the model. For doing this I have scanned my data
based on some features. By analyzing the data I have
found.

Fig. 7 Accuracy plot

V. CONCLUSION

In conclusion, in today’s era prices of house houses are


increasing regularly due to population increase and
people are facing problem to get a get a good house of
their choice and needs. So, this model helps people to
buy a house of their own choice and in a budget
amount. To know the accurate price of the house we
need some data like location of house, number of
bedrooms ,bathrooms and the house is ready or not.
Many times a person pay a more amount of the house
then the actual price, Similarly many times a seller of
the house also gets a low amount then the original
Fig.6 Data analytics plot price, because of this many real estate company also
faces problem. The main purpose model is to solve the
IV. RESULTS
problem of people facing when they are predicting the
For implementing the house price prediction firstly, we house price. My project looks for useful models for
have to download import the required libraries for our predicting the value of house. I have collected dataset
project like NumPy, pandas, matplotlib. pyplot, scikit- for training and testing my project, first of all the data is
learn. After all the libraries imported, the data imported changed into a clean dataset. After that Different modes
which is used to train our model for prediction of the are used to achieve an optimal solution.
house stored in the form of .csv format. After
REFERENCES
everything imported the data cleaning process start to
remove all the errors from the data so that it becomes [1] Bahia has made research on price prediction by
clean, and then data preprocessing start. After data using data mining techniques. “The idea that was
preprocessing model is implemented and judged. For proposed in his paper was that the two types of
neural network constructed that will form a neural
judging the model I have used train-test splitting network model”.
method, in this data is divided into two parts like 70% [2] Darshil Shah, Harshad Rajput and Jay Chheda,
of the dataset is used for training purpose and “House Price Prediction Using Machine Learning
remaining 30% is used to test the model. After this and RPA”, Volume 7 Issue 3 (2020).
[3] G. Naga Satish, Ch. V. Raghavendran,
splitting the simple regression and polynomial M.D.Sugnana Rao, Ch.Srinivasulu, “House Price
regression is implemented in the model to check the Prediction Using Machine Learning”, Volume 8
accuracy. Issue 9 (2019).
[4] Durganjali made a model of resale of house price
prediction using classification algorithm.
Linear Regression gives the accuracy of 85.64% and
[5] Gupta, R., Kabundi, A., & Miller, S. M. (2011).
where the decision Tree gives the accuracy of 56.02% Using the US real house price index: Structural and
non-structural models with and without
fundamentals. Economic Modelling, 28(4), 2013-
2021.
[6] Neelam Shinde and Kiran Gawande, “Done a
research of house prices using Predictive
Techniques”.
[7] Adyan Nur Alfiyatin and Hilman Taufiq,
“Researched on House Price Prediction using
Regression Analysis”, (IJACSA) International
Journal of Advanced Computer Science and
Applications.
[8] Timothy C. Au, “Done by using Random Forests,
Decision Trees, and Categorical Predictors: The
Absent Levels Problem”.
[9] D. Banerjee and S. Dutta, "Predicting the housing
price direction using machine learning techniques",
2017 IEEE International Conference on Power,
Control, Signals and Instrumentation Engineering
(ICPCSI), pp. 2998-3000, 2017.
[10] Y. Tang, S. Qiu and P. Gui, "Predicting Housing
Price Based on Ensemble Learning Algorithm",
2018 IEEE International Conference on Artificial
Intelligence and Data Processing (IDAP), pp. 1-5,
2018.
[11] T. D. Phan, "Housing Price Prediction Using
Machine Learning Algorithms: The Case of
Melbourne Cit y, Australia", 2018 IEEE
International Conference on Machine Learning and
Data Engineering (iCMLDE), pp. 35-42, 2018.
[12] Y. Chen, R. Xue and Y. Zhang, "House price
prediction based on machine learning and deep
learning methods," 2021 International Conference
on Electronic Information Engineering and
Computer Science (EIECS), pp. 699-702, 2021
[13] J. J. Wang et al., "Predicting House Price With a
Memristor-Based Artificial Neural Network," in
IEEE Access, vol. 6, pp. 16523-16528, 2018.
[14] Tan F, Cheng C, Wei Z., "Time-Aware Latent
Hierarchical Model for Predicting House Prices",
In2017 IEEE International Conference on Data
Mining (ICDM), pp. 1111- 1116, 2017.

You might also like