Internship Report (L.N.V.Kumar)

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47

Real Estate Price Prediction with Machine Learning

A Practical Training Report


submitted
in partial fulfilment
for the award of the Degree of
Bachelor of Technology
in Department of Computer Science and Engineering

Guide:- Submitted by:-


Dr. Mohit Agarwal L. Nagendra Venkata Kumar
Designation: A20405220073
Assistant Professor

Department of Computer Science and Engineering


Amity School of Engineering & Technology
Amity University Rajasthan, Jaipur
November 2020

1
CANDIDATE’S DECLARATION

I hereby declare that the work, which is being presented in the Summer training report, entitled
“REAL ESTATE PRICE PREDICTION – MACHINE LEARNING” in partial fulfilment
for the award of Degree of “Bachelor of Technology” in Department of Computer Science and
Engineering, and submitted to the Department of Computer Science and Engineering, Amity
School of Engineering & Technology, Amity University, Rajasthan. This is a record of my own
training preparing under the guidance of Dr. Mohit Agarwal

Lankalapalli Nagendra Venkata Kumar


Computer Science and Engineering
Enrolment No.: A20405220073
Amity University Rajasthan, Jaipur

Counter Signed by
Dr. Mohit Agarwal

2
CERTIFICATE

3
ACKNOWLEDGEMENT
It is indeed a great pleasure and matter of immense satisfaction for me to express my deep
sense profound gratitude towards all the people who have helped, inspired me in my term paper
work.

First, I would like to give my gratitude to Dr. Mohit Agarwal for the efforts taken by his/her
right from the selection of project to its completion. He/she spent his/her precious time
whenever I was in need of her guidance.

I would like to thank my parents for their well wishes to complete this work. Finally thanking
my friends for their love and support all the time.

Lankalapalli Nagendra Venkata Kumar

4
TABLE OF CONTENTS
CONTENTS

Abstract .............................................................................................................. 8
1.Organization Profile.........................................................................................9

2.Introduction.....................................................................................................11

2.1.What is Machine Learning.................................................11

2.2.Need of Machine Learning................................................ 11


2.3.Future of Machine Learning...............................................12
3.Machine Learning Process..............................................................................13

4.Classes of Machine Learning..........................................................................14

4.1Supervised Learning............................................................15

4.2Unsupervised Learning........................................................16

4.3Semi-supervised Learning...................................................16

4.4Reinforcement Learning......................................................16

5.Machine Learning Algorithms........................................................................18


5.1Supervised Learning Algorithms.........................................18
5.1.1 Classification.............................................18

5.1.2 Regression.................................................19

5.2Unsupervised Learning Algorithms....................................20

5.2.1 Clustering...................................................20
6.Applications of Machine Learning..................................................................22

7.Pros and Cons of Machine Learning...............................................................25

7.1Advantages of ML............................................................... 25

7.2 Disadvantages of ML.........................................................26

8. Real estate price prediction using ML project..............................................27

8.1 Introduction........................................................................ 27

5
8.2Statement..............................................................................27

8.3Project description................................................................27

8.4Methodology…………………………………………….....28

8.4.1 Description of datasets..........................................28

8.4.2 Data Cleaning and Integration…………………..29

8.4.3 Detection of Outliers………………………….…29

8.5 Libraries used.....................................................................29


8.5.1NumPy...................................................................29

8.5.2Pandas....................................................................30

8.5.3Matplotlib,Seaborn................................................30

8.5.4Scikit-learn.............................................................31
8.6 Evaluation…………………………………………………31
8.6.1 Root mean Square error…………………………31
8.6.2 Evaluation of result……………………………..32
8.7Deployment of model using flask..........................................32

8.8Implementation.......................................................................34

8.8.1Software Requirements...........................................34
8.8.2Hardware Requirements..........................................34

8.8.3Installation...............................................................34

9.Coding and Result...............................................................................................35

10.Future Scope and Conclusion..........................................................................46

Bibliography and References………………………………………………………....47

6
LIST OF FIGURES
Description Page No.
Fig- 3.1 Machine Learning Process 13
Fig- 4.1 Machine Learning Types and their algorithm 14
Fig- 4.2 Reinforcement Learning 17
Fig- 5.1 Machine Learning Algorithms 21
Fig-8.1 Architecture Diagram 28
Fig-8.2 Heat Map 30
Fig-8.3 flask working 32
Fig-8.4 UI of the project 34
Fig-9.1 Result of the prediction 46

7
ABSTRACT
This project is done as a mini project for partial completion of the Degree Bachelor of
Technology in Department of Computer Science and Engineering offered by Amity University
Rajasthan.
Nowadays, large amounts of data is available everywhere. Therefore, it is very important to
analyze this data in order to extract some useful information and to develop an algorithm based
on this analysis. This can be achieved through data mining and Machine Learning. Machine
Learning is an integral part of Artificial Intelligence, which is used to design algorithms based
on the data trends and historical relationships between data.

Machine Learning is used in various fields such as bioinformatics, intrusion detection,


Information retrieval, game playing, marketing, malware detection, image deconvolution and
so on. This paper presents the work done by various authors in the field of Machine Learning
in various application areas and also real estate price prediction.

Real estate is the least transparent industry in our ecosystem. Housing prices keep changing
day in and day out and sometimes are hyped rather than being based on valuation. Predicting
housing prices with real factors is the main crux of our research project. Here we aim to make
our evaluations based on every basic parameter that is considered while determining the price.
We use various regression techniques in this pathway, and our results are not sole determination
of one technique rather it is the weighted mean of various techniques to give most accurate
results. The results proved that this approach yields minimum error and maximum accuracy
than individual algorithms applied.

In this prediction model a dataset of a certainarea was used for prediction purpose. The raw
data was cleaned, pre-processed and then analysed using some python libraries and machine
learning techniques. The results obtained showed that the methods used can determine the price
of house. This model can be used with any data to predict its price.

8
CHAPTER-1

INTRODUCTION
1.1 About the Company

I had done my training from IBM. IBM (International Business Machines) ranks among the
world's largest information technology companies, providing a wide spectrum of
hardware, software and services offerings.

IBM, frequently referred to as "Big Blue," got its start in hardware and prospered in that
business for decades, becoming the top supplier of mainframe computers. Over the years, the
company shifted its focus from hardware to software and services. By the 2010s, IBM further
modified its business mix to emphasize such fields as cloud-based services and cognitive
computing. IBM Watson, a cognitive system, has become the company's high-visibility
offering in the latter technology segment.

Products and Services provided by IBM are:

IBM has a large and diverse portfolio of products and services. As of 2016, these offerings fall
into the categories of cloud computing, Artificial intelligence, machine learning and many
more.

IBM Cloud includes infrastructure as a service (IaaS), software as a service (SaaS)


and platform as a service (PaaS) offered through public, private and hybrid cloud delivery
models. For instance, the IBM Bluemix PaaS enables developers to quickly create complex
websites on a pay-as-you-go model. IBM SoftLayer is a dedicated server, managed
hosting and cloud computing provider, which in 2011 reported hosting more than 81,000
servers for more than 26,000 customers. IBM also provides Cloud Data Encryption Services
(ICDES), using cryptographic splitting to secure customer data.

IBM also hosts the industry-wide cloud computing and mobile technologies conference
InterConnect each year.

9
IT outsourcing also represents a major service provided by IBM, with more than 60 data
centers worldwide. Alpha Works is IBM's source for emerging software technologies,
and SPSS is a software package used for statistical analysis. IBM's Kenexa suite
provides employment and retention solutions, and includes BrassRing, an applicant tracking
system used by thousands of companies for recruiting. IBM also owns The Weather Company.

Smarter Planet is an initiative that seeks to achieve economic growth, near-term


efficiency, sustainable development, and societal progress, targeting opportunities such
as smart grids, water management systems, solutions to traffic congestion, and greener
buildings.

Services provisions include Redbooks, which are publicly available online books about best
practices with IBM products, and developer Works, a website for software developers and IT
professionals with how-to articles and tutorials, as well as software downloads, code samples,
discussion forums, podcasts, blogs, wikis, and other resources for developers and technical
professionals.

IBM Watson is a technology platform that uses natural language processing and machine
learning to reveal insights from large amounts of unstructured data. Watson debuted in 2011
on the American game-show Jeopardy!, where it competed against champions Ken
Jennings and Brad Rutter in a three-game tournament and won. Watson has since been applied
to business, healthcare, developers, and universities. For example, IBM has partnered
with Memorial Sloan Kettering Cancer Center to assist with considering treatment options for
oncology patients. Also, several companies have begun using Watson for call centers, either
replacing or assisting customer service agents. In January 2019, IBM introduced its first
commercial quantum computer, IBM Q System One.

IBM also provides infrastructure for the New York City Police Department through their IBM
Cognos Analytics to perform data visualizations of CompStat crime data.

In March 2020 it was announced that IBM will build the first quantum computer in Germany.
The computer should allow researchers to harness the technology without falling foul of the
EU's increasingly assertive stance on data sovereignty.

CHAPTER-2
10
INTRODUCTION TO ML

2.1 What is Machine Learning?


Machine Learning is a subfield of Artificial Intelligence (AI). The goal of Machine Learning
generally is to understand the structure of data and fit that data into models that can be
understood and utilized by people.

Although Machine Learning is a field within computer science, it differs from traditional
computational approaches. In traditional computing, algorithms are sets of explicitly
programmed instructions used by computers to calculate or problem solve. Machine Learning
algorithms instead allow for computers to train on data inputs and use statistical analysis in
order to output values that fall within a specific range. Because of this, Machine Learning
facilitates computers in building models from sample data in order to automate decision-
making processes based on data inputs.

Any technology user today has benefitted from Machine Learning. Facial recognition
technology allows social media platforms to help users tag and share photos of friends. Optical
character recognition (OCR) technology converts images of text into movable type.
Recommendation engines, powered by Machine Learning, suggest what movies or television
shows to watch next based on user preferences. Self-driving cars that rely on Machine Learning
to navigate may soon be available to consumers.

Machine Learning is a continuously developing field. Because of this, there are some
considerations to keep in mind as you work with Machine Learning methodologies, or analyze
the impact of Machine Learning processes.

2.2 Need of Machine Learning

Machine Learning is needed for tasks that are too complex for humans to code directly. Some
tasks are so complex that it is impractical, if not impossible, for humans to work out all of the
nuances and code for them explicitly. So instead, we provide a large amount of data to a
Machine Learning algorithm and let the algorithm work it out by 10 exploring that data and
searching for a model that will achieve what the programmers have set it out to achieve. How

11
well the model performs is determined by a cost function provided by the programmer and the
task of the algorithm is to find a model that minimises the cost function.

2.3 Future of Machine Learning:


While Machine Learning algorithms have been around for decades, they've attained new
popularity as Artificial Intelligence (AI) has grown in prominence. Deep learning models in
particular power today's most advanced AI applications. Machine Learning platforms are
among enterprise technology's most competitive realms, with most major vendors, including
Amazon, Google, Microsoft, IBM and others, racing to sign customers up for platform services
that cover the spectrum of Machine Learning activities, including data collection, data
preparation, model building, training and application deployment. As Machine Learning
continues to increase in importance to business operations and AI becomes ever more practical
in enterprise settings, the Machine Learning platform wars will only intensify.

● ML will be an integral part of all AI systems, large or small.


● As ML assumes increased importance in business applications, there is a strong
possibility of this technology being offered as a Cloud-based service known as Machine
Learning-as-a-Service (MLaaS).
● Connected AI systems will enable ML algorithms to “continuously learn,” based on
newly emerging information on the internet.
● Machine Learning will help machines to make better sense of context and meaning of
data.
● Future advancement in “unsupervised ML algorithms” will lead to higher business
outcomes.
● Quantum Computing will greatly enhance the speed of execution of ML algorithms in
high-dimensional vector processing. This will be the next conquest in the field of ML
research.

12
CHAPTER-3

MACHINE LEARNING PROCESS


We try to think of the Machine Learning process as:

● Formulating a Question
● Finding and Understanding the Data
● Cleaning the Data and Feature Engineering
● Choosing a Model
● Tuning and Evaluating
● Using the Model and Presenting Results.
The whole machine learning process is described below.

Fig-3.1 Machine Learning Process

13
CHAPTER-4

CLASSES OF MACHINE LEARNING


Types of Machine Learning:
There some variations of how to define the types of Machine Learning Algorithms but
commonly they can be divided into categories according to their purpose and the main
categories are the following:

● Supervised Learning.
● Unsupervised Learning.
● Semi-supervised Learning.
● Reinforcement Learning.

Types of Machine Learning and their respective algorithms are as shown below:

Fig- 4.1 Machine Learning types and their algorithms

14
4.1.Supervised Learning
● I like to think of supervised learning with the concept of function approximation, where
basically we train an algorithm and in the end of the process we pick the function that best
describes the input data, the one that for a given X makes the best estimation of y (X -> y).
Most of the time we are not able to figure out the true function that always makes the correct
predictions and another reason is that the algorithm relies upon an assumption made by humans
about how the computer should learn.
● Here the human experts act as the teacher where we feed the computer with training
data containing the input/predictors and we show it the correct answers (output) and from the
data the computer should be able to learn the patterns.
● Supervised learning algorithms try to model relationships and dependencies between
the target prediction output and the input features such that we can predict the output values
for new data based on those relationships which it learned from the previous data sets.

The main types of supervised learning problems include Regression and Classification and
Forecasting problems.

4.1.1 Classification: In classification tasks, the machine learning program must draw a
conclusion from observed values and determine to what category new observations belong.
For example, when filtering emails as ‘spam’ or ‘not spam’, the program must look at existing
observational data and filter the emails accordingly.
4.1.2 Regression: In regression tasks, the machine learning program must estimate – and
understand – the relationships among variables. Regression analysis focuses on one dependent
variable and a series of other changing variables – making it particularly useful for prediction
and forecasting.
4.1.3 Forecasting: Forecasting is the process of making predictions about the future based
on the past and present data, and is commonly used to analyse trends.

4.1.4List of Algorithms:
● K-Nearest Neighbour.
● Naive Bayes.
● Decision Trees.
● Linear Regression.
● Support Vector Machines (SVM).

15
● Neural Networks.
4.2Unsupervised Learning
The computer is trained with unlabeled data.

● Here there’s no teacher at all, actually the computer might be able to teach you new
things after it learns patterns in data, these algorithms are particularly useful in cases where the
human expert doesn’t know what to look for in the data.
● These algorithms are the family of Machine Learning algorithms which are mainly used
in pattern detection and descriptive modelling. However, there are no output categories or
labels here based on which the algorithm can try to model relationships.

The main types of unsupervised learning algorithms include Clustering algorithms and
Association rule learning algorithms.

4.2.1Clustering: Clustering involves grouping sets of similar data (based on defined


criteria). It’s useful for segmenting data into several groups and performing analysis on each
data set to find patterns.

4.2.2Association: Association rule learning is a rule-based Machine Learning method for


discovering interesting relations between variables in large databases.

4.3Semi-supervised Learning
Semi-supervised learning is similar to supervised learning, but instead uses both labelled and
unlabelled data. Labelled data is essentially information that has meaningful tags so that the
algorithm can understand the data, whilst unlabelled data lacks that information. By using this
combination, Machine Learning algorithms can learn to label unlabelled data.

4.4Reinforcement Learning
This method aims at using observations gathered from the interaction with the environment to
take actions that would maximize the reward or minimize the risk. Reinforcement learning
algorithm (called the agent) continuously learns from the environment in an iterative fashion.
In the process, the agent learns from its experiences of the environment until it explores the
full range of possible states.

There are many different algorithms that tackle this issue. As a matter of fact, Reinforcement
Learning is defined by a specific type of problem, and all its solutions are classed as

16
Reinforcement Learning algorithms. In the problem, an agent is supposed to decide the best
action to select based on his current state. When this step is repeated, the problem is known
as a Markov Decision Process.

Fig 4.2: Reinforcement Learning.

17
CHAPTER-5

MACHINE LEARNING ALGORITHMS


5.1 Supervised Learning Algorithms:
Supervised learning algorithms try to model relationships and dependencies between the target
prediction output and the input features such that we can predict the output values for new data
based on those relationships which it learned from the previous data sets.
5.1.1 Classification:
In classification tasks, the machine learning program must draw a conclusion from observed
values and determine to what category new observations belong.
5.1.1.1 Decision Tree:
A decision tree is a flow-chart-like tree structure that uses a branching method to illustrate
every possible outcome of a decision. Each node within the tree represents a test on a specific
variable – and each branch is the outcome of that test.
5.1.1.2 Super Vector Machine:
Support Vector Machine algorithms are supervised learning models that analyze data used for
classification and regression analysis. They essentially filter data into categories, which is
achieved by providing a set of training examples, each set marked as belonging to one or the
other of the two categories. The algorithm then works to build a model that assigns new values
to one category or the other.
5.1.1.3 Naive Bayes Algorithm:
The Naïve Bayes classifier is based on Bayes’ theorem and classifies every value as
independent of any other value. It allows us to predict a class/category, based on a given set of
features, using probability.
Despite its simplicity, the classifier does surprisingly well and is often used due to the fact it
outperforms more sophisticated classification methods.

18
5.1.1.4 k-Nearest Neighbour:
K-Nearest Neighbours is one of the most basic yet essential classification algorithms in
Machine Learning. It belongs to the supervised learning domain and finds intense application
in pattern recognition, data mining and intrusion detection.
It is widely disposable in real-life scenarios since it is non-parametric, meaning it does not
make any underlying assumptions about the distribution of data (as opposed to other algorithms
such as GMM, which assume a Gaussian distribution of the given data).

We are given some prior data (also called training data), which classifies coordinates into
groups identified by an attribute.

5.1.1.5 Random Forest:


Random forests or ‘random decision forests’ is an ensemble learning method, combining
multiple algorithms to generate better results for classification, regression and other tasks. Each
individual classifier is weak, but when combined with others, can produce excellent results.
The algorithm starts with a ‘decision tree’ (a tree-like graph or model of decisions) and an input
is entered at the top. It then travels down the tree, with data being segmented into smaller and
smaller sets, based on specific variables.
5.1.1.6 Logistic Regression:
Logistic regression focuses on estimating the probability of an event occurring based on the
previous data provided. It is used to cover a binary dependent variable that is where only two
values, 0 and 1, represent outcomes.
5.1.2 Regression:
Regression analysis focuses on one dependent variable and a series of other changing variables
– making it particularly useful for prediction and forecasting.
5.1.2.1 Linear Regression:
Linear regression is a statistical approach for modelling the relationship between a dependent
variable with a given set of independent variables. Simple linear regression is an approach for
predicting a response using a single feature. It is assumed that the two variables are linearly
related. Hence, we try to find a linear function that predicts the response value(y) as accurately
as possible as a function of the feature or independent variable(x).

19
5.1.2.2 Multi-Linear Regression:
Multiple Linear Regression attempts to model the Relationship between two or more features
and a response by fitting a linear equation to observed data. The steps to perform multiple linear
Regression are almost similar to that of simple linear Regression. The Difference Lies in the
Evaluation. We can use it to find out which factor has the highest impact on the predicted
output and now different variables relate to each other.
5.1.2.3 Polynomial Regression:
Polynomial Regression is a form of linear regression in which the relationship between the
independent variable x and dependent variable y is modelled as an nth degree polynomial.
Polynomial regression fits a nonlinear relationship between the value of x and the
corresponding conditional mean of y, denoted E(y |x).

5.2 Unsupervised Learning Algorithms:


These algorithms are the family of Machine Learning algorithms which are mainly used in
pattern detection and descriptive modelling. However, there are no output categories or labels
here based on which the algorithm can try to model relationships.

5.2.1 Clustering:
Clustering involves grouping sets of similar data.

5.2.1.1 K-means Clustering:


The K Means Clustering algorithm is a type of unsupervised learning, which is used to
categorize unlabelled data, i.e. data without defined categories or groups. The algorithm works
by finding groups within the data, with the number of groups represented by the variable K.
5.2.1.2 Hierarchical Clustering:
In data mining and statistics, hierarchical clustering analysis is a method of cluster analysis
which seeks to build a hierarchy of clusters i.e. tree type structure based on the hierarchy.
Basically, there are two types of hierarchical cluster analysis strategies –

5.2.1.2.1 Agglomerative Clustering:


Also known as bottom-up approach or hierarchical agglomerative clustering (HAC). A
structure that is more informative than the unstructured set of clusters returned by flat
clustering. This clustering algorithm does not require us to pre-specify the number of clusters.
Bottom-up algorithms treat each data as a singleton cluster at the outset and then successively

20
agglomerates pairs of 20 clusters until all clusters have been merged into a single cluster that
contains all data.

5.2.1.2.2 Divisive Clustering:


Also known as top-down approach. This algorithm also does not require to prespecify the
number of clusters. Top-down clustering requires a method for splitting a cluster that contains
the whole data and proceeds by splitting clusters recursively until individual data have been
splitted into singleton clusters.

Fig 5.1: Machine Learning Algorithms

CHAPTER-6
21
APPLICATIONS OF MACHINE
LEARNING
The following are some of the most common applications of Machine Learning.

1.Machine Learning Applications in Healthcare:


Manufacturing or discovering a new drug is an expensive and lengthy process as thousands of
compounds need to be subjected to a series of tests, and only a single one might result in a
usable drug. Machine learning can speed up one or more of these steps in this lengthy multi-
step process.

Imagine when you walk in to visit your doctor with some kind of an ache in your stomach.
After snooping into your symptoms, the doctor inputs them into the computer that extracts the
latest research that the doctor might need to know about how to treat your ache. You have an
MRI and a computer helps the radiologist detect problems that possibly could be too small for
the human eye to see. In the end, a computer scans all your health records and family medical
history and compares it to the latest research to advise a treatment protocol that is particularly
tailored to your problem. Machine learning is all set to make a mark in personalized care

2. Machine Learning Applications in Finance:


Wondering how banks know about their most valuable account holders? – The secret to this is
the underlying machine learning algorithms which confirm that best customers are those with
large balances and loans.

Wells Fargo utilized machine learning to identify that a group of homemaker moms in Florida
with huge social media presence were their most influential and preferred banking customers
in terms of referrals. The machine learning algorithm identified patterns that the humans have
missed earlier which helped Wells-Fargo target those key customers.

22
3. Machine Learning Applications in Retail:
According to The Realities of Online Personalization Report, 42% of retailers are using
personalized product recommendations using machine learning technology. It is no secret that
customers always look for personalized shopping experiences, and these recommendations
increase the conversion rates for the retailers resulting in fantastic revenue.

● The moment you start browsing for items on Amazon, you see recommendations for
products you are interested in as “Customers Who Bought this Product Also Bought”
and “Customers who viewed this product also viewed”, as well specific tailored product
recommendations on the home page, and through email. Amazon uses Artificial Neural
Networks machine learning algorithms to generate these recommendations for you.
● To make smart personalized recommendations, Alibaba has developed an “Ecommerce
Brain” that makes use of real-time online data to build machine learning models for
predicting what customers want and recommending the relevant products based on their
recent order history, bookmarking, commenting, browsing history, and other actions.

4. Machine Learning Applications in Media:


Machine learning offers the most efficient means of engaging billions of social media users.
From personalizing news feeds to rendering targeted ads, machine learning is the heart of all
social media platforms for their own and user benefits. Social media and chat applications have
advanced to a great extent that users do not pick up the phone or use email to communicate
with brands – they leave a comment on Facebook or Instagram expecting a speedy reply than
the traditional channels.

How Facebook uses Machine Learning?

Here are some machine learning examples that you must be using and loving in your social
media accounts without knowing the fact that their interesting features are machine learning
applications -

● Earlier Facebook used to prompt users to tag your friends but nowadays the social
networks artificial neural networks machine learning algorithm identifies familiar faces
from contact list. The ANN algorithm mimics the structure of the human brain to power
facial recognition.

23
● The professional network LinkedIn knows where you should apply for your next job,
whom you should connect with and how your skills stack up against your peers as you
search for a new job.

CHAPTER-7

24
PROS AND CONS OF MACHINE
LEARNING
7.1 Advantages of Machine Learning:

1. Easily Identifies trends and patterns:


Machine Learning can review large volumes of data and discover specific trends and patterns
that would not be apparent to humans. For instance, for an e-commerce website like Amazon,
it serves to understand the browsing behaviors and purchase histories of its users to help cater
to the right products, deals, and reminders relevant to them. It uses the results to reveal relevant
advertisements to them.
2. No Human intervention Needed(Automation):
With ML, you don’t need to babysit your project every step of the way. Since it means giving
machines the ability to learn, it lets them make predictions and also improve the algorithms on
their own. A common example of this is anti-virus softwares; they learn to filter new threats as
they are recognized. ML is also good at recognizing spam.
3. Continuous Improvement:
As ML algorithms gain experience, they keep improving in accuracy and efficiency. This lets
them make better decisions. Say you need to make a weather forecast model. As the amount of
data you have keeps growing, your algorithms learn to make more accurate predictions faster.
4. Handling Multidimensional and multi-variety data:
Machine Learning algorithms are good at handling data that are multidimensional and multi-
variety, and they can do this in dynamic or uncertain environments.
5. Wide Applications:
You could be an e-tailer or a healthcare provider and make ML work for you. Where it does
apply, it holds the capability to help deliver a much more personal experience to customers
while also targeting the right customers.

7.2 Disadvantages of Machine Learning:

With all those advantages to its powerfulness and popularity, Machine Learning isn’t
perfect. The following factors serve to limit it:

25
1. Data Acquisition:
Machine Learning requires massive data sets to train on, and these should be
inclusive/unbiased, and of good quality. There can also be times where they must wait for new
data to be generated.
2. Time and Resources:
ML needs enough time to let the algorithms learn and develop enough to fulfill their purpose
with a considerable amount of accuracy and relevancy. It also needs massive resources to
function. This can mean additional requirements of computer power for you
3. Interpretation of Results
Another major challenge is the ability to accurately interpret results generated by the
algorithms. You must also carefully choose the algorithms for your purpose
4. High error-susceptibility
Machine Learning is autonomous but highly susceptible to errors. Suppose you train an
algorithm with data sets small enough to not be inclusive. You end up with biased predictions
coming from a biased training set. This leads to irrelevant advertisements being displayed to
customers. In the case of ML, such blunders can set off a chain of errors that can go undetected
for long periods of time. And when they do get noticed, it takes quite some time to recognize
the source of the issue, and even longer to correct it.

CHAPTER-8

26
REAL ESTATE PRICE PREDICTION
THROUGH MACHINE LEARNING
8.1 Introduction:
Housing prices are an important reflection of the economy, and housing price ranges are of
great interest for both buyers and sellers. In this project. house prices will be predicted given
explanatory variables that cover many aspects of residential houses. As continuous house
prices, they will be predicted with various regression techniques including Lasso, Ridge, SVM
regression, and Random Forest regression; as individual price ranges, they will be predicted
with classification methods including Naive Bayes, logistic regression, SVM including Naive
Bayes, logistic regression, SVM classification, and Random Forest classification.

I also performed PCA to improve the prediction accuracy. The goal of this project is to create
a regression model and a classification model that are able to accurately estimate the price of
the house given the features.

8.2 Statement:
To design a machine learning model that effectively predicts the price of the house from the
given features.

8.3 Project Description:


To predict the price of the house we can use any supervised machine learning algorithms. Here
random forest, logistic regression, decision tree, k-nearest neighbour and naïve bayes
algorithms will come under supervised machine learning algorithms. In this prediction model
I used random forest, decision tree and logistic regression algorithms.

8.4 Methodology:

The below diagram describes the methodology used in predicting the prices of houses.

27
Fig 8.1 Architecture diagram

8.4.1 Description of datasets:

The real estate housing data is used in this and it is taken from the UCI machine learning
repository and the ageron the data is spread across 20000 rows and has the ten attributes the
description of the data set is given below

8.4.2 Data Cleaning and Integration:

28
Data cleaning is an iterative process, the first iterate is on detecting and correcting bad records
the data taken from the repository have many inconsistencies and null values before loading
into the machine learning models the data should be corrected in order to get the high accuracy
of prediction as I am using the different tools for prediction, the cleaning process differs from
one other but the ultimate goal is to gain more accuracy. The real estate data have some missing
information they have null values in total_bedrooms columns. I identified those rows and
removed those null values using python program.

8.4.3 Detection of Outliers:

An outlier is an extremely high or extremely low-value value in the data it can be identified if
whether the value is greater than interquartile range Q3 + 1.5 or Q1 - 1.5 detecting the
interquartile range is arrange the data in an order from the lower value to the higher value, now
the mean is taken for the first set of values and second set values now by subtracting both mean
we can get the interquartile range the formula is Q3 + (1.5)(quartile range) and for Q1-
(1.5)(quartile range) and I have calculated using the python.

8.5 Libraries Used:

I have used many libraries to build the machine learning model and predict the price of the
house.

8.5.1 Numpy:

NumPy, which stands for Numerical Python, is a library consisting of multidimensional array
objects and a collection of routines for processing those arrays. Using NumPy, mathematical
and logical operations on arrays can be performed.NumPy aims to provide an array object that
is up to 50x faster that traditional Python lists.The array object in NumPy is called ndarray, it
provides a lot of supporting functions that make working with ndarray very easy.Arrays are
very frequently used in data science, where speed and resources are very important.In my
project I have used several functionalities of numpy library like slicing, reshaping , searching
, sorting and filtering of data.

8.5.2 Pandas:

29
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-
use data structures and data analysis tools for the Python programming language. Python with
Pandas is used in a wide range of fields including academic and commercial domains including
finance, economics, Statistics, analytics, etc. I used this library in my project to import and read
the data set. Also to manipulate the data, split and join the data, handling the missing data I
used this library.

8.5.3 Matplotlib and seaborn:

Matplotlib is one of the most popular Python packages used for data visualization. It is a cross-
platform library for making 2D plots from data in arrays. Seaborn is a library that uses
Matplotlib underneath to plot graphs. It will be used to visualize random distributions. I used
both the libraries to visualize the data and find the corelation between the columns.

Fig8.0 Heatmap

8.5.4 Scikit-learn:

30
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistence interface
in Python. I used random forest, logistic regression, decision tree regressor, random forest
regressor and logistic regression. Among those random forest regressor gave more accuracy.
Random forest algorithm can be used to predict both the classification and the regression it is
also be called as the regression forests. The main process is it develops lots of decision tree
based on the random selection of data and the random selection of variables and it provides the
class of dependent variable based on many trees. The main advantage of using this algorithm
to my dataset is it handle the missing values and it can maintain the accuracy of the missing
data and the chance of overfitting the model is low and we can except high dimensionality
when we apply to the large level dataset. In regression trees, the outcome will be
continuous.The data is been split into test and train 75 percent data is used for the test data and
remaining for the train data since I am using the Random forest the number of trees used is
200.

8.6 Evaluation:

The idea of a regression is to predict a real value which means number in regression model we
can compute the several values the most common terms are explained below

8.6.1 Root Mean Square Error:

RMSE is a popular formula to measure the error rate of a regression model, however, it can
only be compared between models whose errors are measured in the same units it can be
measured using the given formula

Where n is the number of instances in the data, P is the predicted value for the I instance and
O is the actual value the key concept is the predicted value is subtracted by the actual value
square that and get the sum of all instances and divided it by number of instances, the RMSE

31
will be achieved. As discussed, the essential variables are used to calculate the error value and
help to determine the how well can the algorithm predict the future prices, the below table
describes the Root mean square error for the various algorithms and displayed below.

8.6.2 Evaluation of results:

8.7 Deployment of model using flask:

The below diagram describes how a model is deployed using flask framework.

Fig 8.3 Flask Working

32
The final data obtained is subjected to a machine learning model. We will mainly use K-fold
Cross Validation and GridSearchCV technique to perform hyper parameter tuning to obtain the
best algorithm and parameters for the model. Turns out random forest regressor model gives the
best results for our data with a score above 80% which is not bad.

Now, our model needs to be exported into a pickle file which converts python objects into a
character stream. Also, we need to export the locations(columns) into a json file to be able to
interact with it from the frontend.

We will use a Flask server as our backend to host the application locally. In the server folder we
will set up two files:

• server.py

The server.py file will be responsible for handling the routes for fetching the location names and
predicting the house price. It also gets the form data from the front end and feeds it to the util.py.

• util.py

The util.py file is the main brains behind the back end. It has a function to load the JSON and
pickle file. This file takes the form data from server.py and uses the model to predict the
estimated price of the property.

The front end is made up of simple HTML, CSSand JavaScript. The user can select the latitude,
longitude, total_rooms, total_bedrooms, population, housing age, ocean proximity, and hit on
‘PREDICT’ button to get the estimated price. The JavaScript file is responsible for interacting
with both the backend flask server routes and the frontend HTML. It gets the form data filled
by the user and calls the function that uses the prediction model and renders the estimated price
in lakhs rupees.

33
Fig 8.4 UI of the project

8.8 Implementation:

The project was carried out on a computer system which requires different hardware and
software specifications.They were given below

8.8.1 Software Requirements:

The software requirements for this project are python programming language, flask frame work,
jupyter note book, anaconda, Operating systems.

8.8.2 Hardware Requirements:

The hardware requirements for this project are 2GHz intel processor, 180GB HDD, 2GB RAM.

8.8.3 Installation:

We need to install anaconda to use jupyter notebook. Then I also installed numpy, pandas,
matplotlib, scikit-learn, flask, pickle using pip command.

34
CHAPTER-9

CODING AND RESULT

35
36
37
38
39
40
FLASK CODING:

app.py

from flask import Flask, render_template, request

import pickle

from sklearn.ensemble import RandomForestRegressor

app = Flask(__name__)

@app.route('/')

def index():

return render_template('index.html')

@app.route('/predict',methods=['POST'])

def predict():

if request.method == 'POST':

longitude = request.form['longitude']

41
latitude = request.form['latitude']

housing_median_age = request.form['housing_median_age']

total_rooms = request.form['total_rooms']

total_bedrooms = request.form['total_bedrooms']

population = request.form['population']

households = request.form['households']

median_income= request.form['median_income']

ocean_proximity = request.form['ocean_proximity']

data
=[[float(longitude),float(latitude),float(housing_median_age),float(total_rooms),float(total_b
edrooms),float(population),float(households),float(median_income),float(ocean_proximity)]]

lr = pickle.load(open('price.pkl', 'rb'))

prediction = lr.predict(data)[0]

return render_template('index.html', prediction=prediction)

if __name__ == '__main__':

app.run()

index.html

<!doctype html>

<html lang="en">

<head>

<!-- Required meta tags -->

<meta charset="utf-8">

<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">

42
<title>App</title>

<style>

h2

color: darkorange;

font-size: 55px;

h3

color: deeppink;

font-size: 55px;

button

background-color: green;

body

background-repeat: no-repeat;

background-attachment: fixed;

background-size: cover;

background-image: url("{{ url_for('static', filename='images/2.jpg') }}")

43
</style>

</head>

<body>

<center>

<div class="login">

<h1>Real Estate Price Prediction</h1>

<form action="/predict" method="POST">

<input type="text" name="longitude" placeholder="longitude"><br><br>

<input type="text" name="latitude" placeholder="latitude"><br><br>

<inputtype="text"
name="housing_median_age"placeholder="housing_median_age"><br><br>

<input type="text" name="total_rooms" placeholder="total_rooms"><br><br>

<input type="text" name="total_bedrooms" placeholder="total_bedrooms"><br><br>

<input type="text" name="population" placeholder="population"><br><br>

<input type="text" name="households" placeholder="households"><br><br>

<input type="text" name="median_income" placeholder="median_income"><br><br>

<input type="text" name="ocean_proximity" placeholder="ocean_proximity"><br><br>

<button class='btnbtn-success'>Predict</button>

</form>

<div class="row">

{% if prediction %}

<h2>Price</h2><h3 class="text-warning">{{prediction}}</h3>

{% endif %}

</div>

</div>

44
</center>

</body>

</html>

Fig 9.1 Result of the prediction

45
CHAPTER-10

FUTURE SCOPE AND CONCLUSION


The main goal of this project is to determine the prediction for prices which we have
successfully done using different machine learning algorithms like a Random forest, Decision
tree regressor, linear regression ,logistic regression ,so it’s clear that the random forest have
more accuracy in prediction when compared to the others and also my research provides to find
the attributes contribution in prediction. So I would believe this project will be helpful for both
the peoples and governments and the future works are stated below.

There is always a room for improvement in any software, however efficient the system may
be. The important thing is that the system should be flexible enough for future modifications.
The system has been factored into different modules to make system adapt to the further
changes. Every effort has been made to cover all user requirements and make it user friendly.

❑ Goal achieved: The System is able provide the interface to the user so that he can get
his desired data. .
❑ User friendliness: Though the most part of the system is supposed to act in the
background, efforts have been made to make the foreground interaction with user as
smooth as possible.
During our training we learned about machine learning algorithms and flask frame work. We
learned how to use thedata for data analysis and making website more secure. This provides a
better, more efficient for analysing data which can then be used for many other purposes.

Every system and new software technology can help in the future to predict the prices. price
prediction this can be improved by adding many attributes like surroundings, marketplaces and
many other related variables to the houses. The predicted data can be stored in the databases
and an app can be created for the people so they would have a brief idea and they would invest
the money in a safer way. If there is a possibility of Realtime data the data can be connected to

46
the H2O and the machine learning algorithms can be directly connected with the interlink and
the application environment can be created.

BIBILIOGRAPHY AND REFERENCES


Machine learning reference - https://www.geeksforgeeks.org/machine-learning/

Flask reference - https://flask.palletsprojects.com/en/1.1.x/

Numpy references - https://www.w3schools.com/python/numpy_intro.asp

Pandas references - https://www.w3resource.com/pandas/index.php

47

You might also like