Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 13

Contents:

• 1. title of the project.


• 2. motivation
• 3. Introduction (problem and statement)
• 4. timeline
• 5. base papers (research papers)
• 6. references
Title
House Prices Prediction
Motivation
• House price is a continuously hot topic. In fact,
a lot of factors should be taken into
consideration if given this topic.
• Even though we seldom think about the type
and material of roof or the height of basement
ceiling, they indeed have impacts on the
determination of the house price.
• That is why we find the prediction of house
price a really complicated problem and worth
furthering. In this project, we would like treat it
with methods from machine learning.
Introduction

• Data Source and Variables


Kaggle competition:
-- “House Prices: Advanced Regression Techniques”
-- Dataset preared by Dean De Cock
Variables:
--79 variables present in the dataset
Variable named “SalePrice”
-Dependent Variable
-Represent final price at which the house was sold
Remaining 78 variables
-Represent different attributes of the house like area, car
parking, number of fireplaces, etc.
Introduction (...ctd)
• Data Processing
Normalizing Response Variable
Training Vs Validation split
–Train data – 75%
– Validation Data – 25%
Data cleaning
Variable treatments
■ Missing value treatment:
– Continuous variables
– Character variables
■ Outlier treatment
Variable creations:
– Character variables were converted to indicators
– Based on train data, further grouping of character variables were done and
new indicators were created
Base papers
Publisher: IEEE
• SECTION I.
• Introduction
• Development of civilization is the foundation of increase of demand of houses day by day.
Accurate prediction of house prices has been always a fascination for the buyers, sellers and
for the bankers also. Many researchers have already worked to unravel the mysteries of the
prediction of the house prices. There are many theories that have been given birth as a
consequence of the research work contributed by the various researchers all over the
world. Some of these theories believe that the geographical location and culture of a
particular area determine how the home prices will increase or decrease whereas there are
other schools of thought who emphasize the socio-economic conditions that largely play
behind these house price rises. We all know that house price is a number from some
defined assortment, so obviously prediction of prices of houses is a regression task. To
forecast house price one person usually tries to locate similar properties at his or her
neighborhood and based on collected data that person will try to predict the house price.
All these indicate that house price prediction is an emerging research area of regression
which requires the knowledge of machine learning. This has motivated to work in this
domain.
Base Paper (ctd...)
• SECTION II.
• Related Work
• There are two major challenges that researchers have to face. The biggest challenge is to identify the optimum number of features
that will help to accurately predict the direction of the house prices. Kahn [7] mentions that productivity growth in various
residential construction sectors does impact the growth of the housing prices. The model that Kahn worked with shows how
housing prices can have an appearently trendy appearance in which housing wealth rises faster than income for an extended period,
then collapses and experiences an extended decline.
• Lowrance [2] mentions in his doctoral thesis that he found the interior living space to be the most influential factor determining the
housing prices with his research work. He also cites the medium income of the census tract that holds the house to be a very
influential factor in determining the house prices.
• Pardoe [1] utilizes features such as floor size, lot size category, number of bathrooms, and number of bedrooms, standardized age
and garage size as features and utilizes linear regression techniques for predicting the house prices.
• The second major challenge that is faced by the researchers is to find out the machine learning technique that will be the most
effective when it comes to accurately predicting the house prices. Ng and Deisenroth [4] constructs a cell phone based application
using Gaussian processes for regression. Hu et al. [5] uses maximum information coefficient (MIC) to build accurate mathematical
models for predicting house prices. Limsombunchao [6] builds a model by using features like house size, house age, house type,
number of bedrooms, number of bathrooms, number of garages, amenities around the house and geographical location. His work
on the house price issue in New Zealand compared accuracy performance between Hedonic and Artificial Neural Network models
and observed that neural networks perform better compared to the hedonic models when it comes to accurately predicting the
prices of the houses. Bork and Moller [3] uses time series based models for predicting the prices of the houses.
• The present work is unique from all these works as instead of looking at the problem from the regression perspective that tries to
predict a price for the house, the work constructs the problem as a classification problem i.e. predicting whether the price of the
house will increase or decrease.
Base Paper (ctd...)
• The complete work process can be divided into following
four segments. These are :
A. feature definition
B. feature selection
C. application of machine learning techniques
D. performance measurement procedures.
• A. Feature Definition
The current work utilizes data from the web resource
Kaggle.com and the dataset has been used from a
competition hosted by that web application.
Base Paper (ctd...)
• B. Feature Selection
This work utilizes feature selection techniques such as variance influence factor, Information
value, principle component analysis and data transformation techniques such as outlier and
missing value treatment as well as box-cox transformation techniques for the feature
selection and subsequent transformation process.
These techniques are used in the following way:
• Information Value Computation Information Value of a predictor variable is a nonparametric
measure that calculates the level of information contained in the predictor variable about the
target variable. In our work the target variable is consisting of 2 values i.e. 0 and 1 whereby 0
indicates the price decrease and 1 indicates the increase in the price. The information value is
computed for all the features and then the features with the largest information values are
selected as most important features for further improvement. The python tool of Canopy is
utilized in this process.
• Data Transformation The data transformation techniques are applied to those very features
which have been selected from the information value computation process. The data
transformation processes include outlier and null value removal techniques then followed by
the box-cox transformation process. In the box-cox transformation process the original value
is transformed into square, inverse and exponential values.
Base Paper (ctd...)

• Principle Component Analysis


Principle component analysis or PCA is employed on the features after the data
transformation process. The principle component analysis is performed through
the “pca” package in Python. This is done to ensure that there is no
multicolinearity in the feature set.
• Variance Influence Factor
The Variance Inflation Factor (VIF) of a variable is a measure of the correlation of
that variable with other variables. If the correlation between the variables is high,
and hence the Variance Inflation Factor (VIF) will also be high as a thumb rule, we
try to keep a set of variables such that the Variance Inflation Factors (VIFs) of all
the variables are less than 1.5-2.0. We use “statsmodels” package in Python to
implement the variance influence factor. Now we find the following table that
contains the most important features with respective Information value.
Base Paper (ctd...)
C. Machine Learning Technique
Once we complete selecting the features we use three techniques Support Vector Machines
(SVM), Random Forest and artificial neural network (ANN).
• Support Vector Machines
Support vector machines are linear discriminant functions (classifier) with the maximum
margin is the best. The margin is defined as the width that the boundary could be increased
by, before hitting a data point
• Random Forest
Random Forests are ensemble classifiers constructed from of a set of Decision Trees, with
the output of the classifier being the mode of the output of the Decision Trees. Random
Forests combine the “bagging” idea of Breiman with the idea of random selection of features.
The algorithm for inducing a Random Forest was developed by Leo Breiman and Adele Cutler.
• Artificial Neural Network
The artificial neural networks use neurons or perceptrons as the basic units. These
perceptrons use a vector of real-valued inputs. These inputs are always having a linear
combination between themselves. The Output is 1 is the function is more than if the result is
greater than a threshold value and Output is 0, otherwise.
Base Paper (ctd...)
• D. Performance Measurement
This work uses the following metrics for performance measurement. We
consider true positives as those values in which the classifier predicts 1
when the target value is 1, true negatives are those values in which the
classifier predicts 0 when the target value is 0, false positives are those
values in which the classifier predicts 1 when the target value is 0 and
false negatives are those values in which the classifier predicts 0 but the
target values are 1.
i) Accuracy: We Define Accuracy as
Accuracy=(tp+tn)∗100/(tp+tn+fp+fn) ........(1)
where, tp= true positives, tn=true: negatives, fp=false positivesand fn=false negatives .
Base Paper (ctd...)
ii) Precision: We Define Precision as
Precision=(tp∗100)/(tp+fp) .............(2)
where, tp=true positives, tn=true negatives, fp=false positives.

iii) Specificity: We Define Specificity as


Specificity=(tn*100)/(tn+fp) ..............(3)
where, tn=true negatives, fp=false positives.

iv) Sensitivity: We Define Sensitivity as


Sensitivity=(tp∗100)/(tp+fn) ..............(4)
where, tp=true positives and fn=false negatives.

You might also like