Linear Regression Operations Analytics

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 10

Name of the Faculty

Course Operations Analytics

Semester – II Batch of 2021-23.

Assignment Date 25/07/2022.

Submission Date 03/08/2022.

Student Name PRN

Assignment – 2

Project

Title:
Linear Regression analysis for Real estate dataset and predict ‘house price per unit’ based on the
set of input variables.

Introduction:

Data collection:

 For this assignment, we have used the Real estate dataset taken from the standard
databank website kaggle.
 The Date consists of various parameters such as the number of a convenience stores near
the house, the distance from the nearest MRT station.
 The problem statement is to predict ‘House price of unit area’ based on the set of input
features.

There are 7 attributes in each case of the dataset. They are

1. Transaction Date

2. House Age

3. Distance to the nearest MRT station

4. Number of convenience stores

5. Latitude

6. Longitude

7. Price

Objective:

1) To understand use of linear regression by Real Estatedata set

2) To understand use of ggplot() function in R and generating different plots.

3) To understand how to check if the model is best fit or not.


Data Analysis:

1) Price variable:

getwd()

library(MASS)

library(ggplot2)

RE= Real_estate

RE

dim(RE)

names(RE)

str(RE)

set.seed(1)

row.number=sample(1:nrow(RE),0.8*nrow(RE))# Splitting data

train=RE[row.number,]

test=RE[-row.number,]

train

test

#Explore the data

ggplot(train, aes(price)) + geom_density(fill="blue")

#model building 1

#Let us make default model.

model1 = lm(((price))~.,data=train)

summary(model1)

par(mfrow=c(2,2)) #Determines vector with row and column values)

par(bg="white")

plot(model1)
#Model Building - Model 2

model2 = update(model1, ~.-longitude)

summary(model2)

# Predict model 2

pred1=predict(model2, newdata = test)

pred1

exp(pred1)

 The R language's dim() function can be used to either get or set the dimension of a given
matrix, array, or data frame. dim(RE) will give the dimension for the Boston dataset
Output:

 names() function is used to get or set the names of an object.


Output: names(RE)

 str() function is used to compactly display the internal structure of a R object.


Output: str(RE) :
 set.seed(1) : 1 is set as the random number value. The main purpose of using seed is to be
able to reproduce a particular sequence of random numbers.
 Split the Data: row.number=sample(1:nrow(RE),0.8*nrow(RE)), using this code we can
split the dataset into training and testing set and make the model for the training dataset.
 After splitting the data in 80-20 splits, the training dataset has 331 observations and the
testing dataset has 102 observations.

 ggplot() is a plotting package in R that is used to create complex plots from data in the
data frame. It can be used to declare the input data frame for a graphic and to specify or
set aesthetics.
 We used the Density Plot to visualize the probability distribution of the data by getting
the appropriate continuous curve.
 In the above plot we are checking the distribution of the response variable ‘price’.

Model 1:

 For model1 we are considering all the input variables.


 Summary() function is used to print summary statistics for this model.
 summary(model1) output:

 Residuals indicate the error between the prediction of the model and actual results. The
smaller the value of the residual better is the model.

Model 1 analysis:

 The multiple R-squared value is used to check whether the model is fit or not, As it
indicated how much variation is captured by the model.
 F- Statistic value is used to check if there is any relationship between dependent and
independent variables. The value of the F-Statistic is 74.19 which is far greater than 1 and
hence indicates a relationship between dependent and independent variables.
 Multiple r squared value closer to 1 indicates that the model explains the large value of
the variation of the model and hence a good fit.
 In this model, the value is 0.5787, which means the 57.87% of the change in
thedependent variable is explained by the independent variable. Hence this model can be
considered as moderate fit.
 Based on the ‘p-value’ we can conclude, which variable is less or more significant. The
lesser the ‘p’ value the more significant is the variable. From the summary data, we can
see that ‘Longitude’ is the less significant features as the p-value is large for them.
 We can also refer to stars in front of the each variable to decode which variable is less
significant.
 par() function is used to set the graphical parameters and par(mfrow) is used to
determine a vector with row and column values.
 par(mfrow=c(2,2)) this will divide the window into 2*2 grid (2 rows and 2 columns).

 Residual vs fitted plot: If the plotted dots are close to the red line then the errors are zero
and the average of the residual plot should be close to zero. From the above plot, we can
see that the red trend line is almost at zero except at the starting and end locations.
 Normal Q-Q Plot: This plot indicates if the residuals are normally distributed. In the
above graph, almost all the plotted dots are on the red line except at the end.
 Scale Location: This is used to display how the residuals are spread and indicate if
residuals have an equal variance or not.
 Residuals vs Leverage: This plot is useful to identify the influential observations. The
dots which are outside the dashed line will be the influential point.

Model 2 Building: we can remove the four lesser significant features ‘Longitude’ and do the
analysis again.

 summary(model2) Output:

Model 2 analysis:

 The value of the Residual standard error is slightly less as compared to model 1
indicating the error between the prediction and the actual result is less for this model.
 Multiple R squared value is 0.5787, indicating the model is a moderate fit.
 F- Statistic value is 86.28 which is greater than the model 1 value, indicating a
relationship between dependent and independent variables.
The predict() function is used to predict the values based on the previous data behaviors and thus
by fitting that data to the model.

You might also like