Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

DEPARTMENT OF INFORMATION TECHNOLOGY ENGINEERING

Semester BE Semester-VIII INFORMATION TECHNOLOGY


ENGINEERING
Subject R Programming
Subject Professor In- Prof. Shruti Agrawal
charge
Lab Professor In- Prof. Shruti Agrawal
charge

Student Details
Sr. no Roll Number Name Division Year

1 18101A0062 Manasi Sherkhane A BE


2 18101A0048 Omkar Wadekar A BE
3 18101A0052 Ritesh Rahatal A BE

Grade and Subject


Teacher’s Signature

Resources /
Apparatus Hardware: Computer Software: RStudio
Required
Project Title Employee Attrition

Dataset Description: We are using a dataset from Kaggle. The dataset contains
4671 rows and 17 features such as (Price, Bedroom, Bathroom, Floors,
Sqft_Living, Sqft_Lot, etc).

Source Link: https://www.kaggle.com/code/ysthehurricane/house-price-


prediction-using-r-programming/data

Problem Objective: In today’s world, everyone wishes for a house that suits their
Statement lifestyle and provides amenities according to their needs. House prices
keep on changing very frequently which proves that house prices are of-
ten exaggerated. There are many factors that have to be taken into con-
sideration for predicting house prices such as location, number of rooms,
carpet area, how old the property is? and other basic local amenities.
The main objective of this project is to develop a House Price prediction
system using machine learning techniques.
The steps include:
 Extracting data from a large Dataset
 Perform Exploratory analysis
 Visualizations through plots and interpretation of results.
 Using Mining algorithm for prediction – Linear Regression

Theory
R Language: R is a programming language and software environment
for statistical analysis, graphics representation and reporting. R was
created by Ross Ihaka and Robert Gentleman at the University of
Auckland, New Zealand, and is currently developed by the R
Development Core Team.
The core of R is an interpreted computer language which allows branching
and looping as well as modular programming using functions. R allows in-
tegration with the procedures written in the C, C++, .Net, Python or FOR-
TRAN languages for efficiency.
R is freely available under the GNU General Public License, and pre-com-
piled binary versions are provided for various operating systems like
Linux, Windows and Mac.

Features of R
As stated earlier, R is a programming language and software environment
for statistical analysis, graphics representation and reporting. The
following are the important features of R −
 R is a well-developed, simple and effective programming language
which includes conditionals, loops, user defined recursive functions
and input and output facilities.
 R has an effective data handling and storage facility,
 R provides a suite of operators for calculations on arrays, lists, vec-
tors and matrices.
 R provides a large, coherent and integrated collection of tools for
data analysis.
 R provides graphical facilities for data analysis and display either
directly at the computer or printing at the papers.
As a conclusion, R is world’s most widely used statistics programming
language. It's the first choice of data scientists and supported by a vibrant
and talented community of contributors.
R Studio: The RStudio is a free open-source IDE which is a set of integ-
rated tools designed to help you be more productive with R and Python. It
includes a console, syntax-highlighting editor that supports direct code exe-
cution, and a variety of robust tools for plotting, viewing history, debugging
and managing your workspace.  Its interface is organized so that the user can
clearly view graphs, data tables, R code, and output all at the same time. It also of-
fers an Import-Wizard-like feature that allows users to import CSV, Excel, SAS
SPSS and Stata files into R without having to write the code to do so. RStudio’s
primary purpose is to create free and open-source software for data sci-
ence, scientific research, and technical communication.

GGPLOT2: ggplot2 is a powerful and a flexible R package, implemented


by Hadley Wickham, for producing elegant data Visualisations Using the
Grammar of Graphics. It is a system for 'declaratively' creating graphics,
based on "The Grammar of Graphics". You provide the data, tell 'ggplot2'
how to map variables to aesthetics, what graphical primitives to use, and it
takes care of the details.
The concept behind ggplot2 divides plot into three different fundamental
parts: Plot = data + Aesthetics + Geometry.
The principal components of every plot can be defined as follow:
 data is a data frame
 Aesthetics is used to indicate x and y variables. It can also be used to
control the color, the size or the shape of points, the height of bars,etc
 Geometry defines the type of graphics (histogram, box plot, line
plot, density plot, dot plot,scatter plot etc.)

There are two major functions in ggplot2 package: qplot() and ggplot()
functions:

 qplot() stands for quick plot, which can be used to produce easily


simple plots.
 ggplot() function is more flexible and robust than qplot for building a
plot piece by piece.
Code #Importing Libraries

library(ggplot2)

options(repr.plot.width = 12, repr.plot.height = 8)

#Importing Dataset

data <- read.csv(file = '../input/housedata/data.csv')


head(data)

#Data exploration
tail(data)
print(paste("Number of records: ", nrow(data)))
print(paste("Number of features: ", ncol(data)))
summary(data)
colnames(data)
unique(data$city)

#Feature Selection

maindf <- data[,c("price","bedrooms","sqft_living","floors", "sqft_lot",


"condition", "view", "yr_built")]
head(maindf)

#Figure out house age


maindf$oldbuilt <- as.integer(format(Sys.Date(), "%Y")) - maindf$yr_built

drops <- c("yr_built")


maindf = maindf[ , !(names(maindf) %in% drops)]

#Plot Correlation matrix


cor(maindf)
library(ggcorrplot)
corr <- round(cor(maindf), 1)

# Plot
ggcorrplot(corr,
type = "lower",
lab = TRUE,
lab_size = 5,
colors = c("tomato2", "white", "springgreen3"),
title="Correlogram of Housing Dataset",
ggtheme=theme_bw)

#Plot scatterplot matrix


pairs(~bedrooms + sqft_living + floors + condition, data = maindf,
main = "Scatterplot Matrix")

#Plot boxplot for checking outliers


par(mfrow=c(2, 3)) # divide graph area in 2 columns
boxplot(maindf$bedrooms, main="Bedrooms")
boxplot(maindf$sqft_living, main="sqft_living")
boxplot(maindf$floors, main="floors")
boxplot(maindf$condition, main="condition")
boxplot(maindf$view, main="view")
boxplot(maindf$oldbuilt, main="oldbuilt")
#Plot scatterplots
theme_set(theme_bw())
g <- ggplot(maindf, aes(bedrooms, floors))
g + geom_count(col="tomato3", show.legend=F) +
labs(y="floors",
x="bedrooms",
title="Bedrooms vs Floors")

plot(x = maindf$sqft_living, y = maindf$sqft_lot,


xlab = "sqft_living",
ylab = "sqft_lot",
xlim = c(0, 3000),
ylim = c(0, 20000),
main = "sqft_living vs sqft_lot"
)

#Plot density plot to check normality


library(e1071)

par(mfrow=c(2, 3))

plot(density(maindf$bedrooms), main="Density Plot: Bedrooms",


ylab="Frequency",
sub=paste("Skewness:", round(e1071::skewness(maindf$bedrooms),
2)))
polygon(density(maindf$bedrooms), col="green")

plot(density(maindf$sqft_living), main="Density Plot: sqft_living", ylab="Fre-


quency",
sub=paste("Skewness:", round(e1071::skewness(maindf$sqft_living),
2)))
polygon(density(maindf$sqft_living), col="orange")

plot(density(maindf$sqft_lot), main="Density Plot: sqft_lot", ylab="Fre-


quency",
sub=paste("Skewness:", round(e1071::skewness(maindf$sqft_lot), 2)))
polygon(density(maindf$sqft_lot), col="green")
plot(density(maindf$condition), main="Density Plot: condition", ylab="Fre-
quency",
sub=paste("Skewness:", round(e1071::skewness(maindf$condition),
2)))
polygon(density(maindf$condition), col="orange")

plot(density(maindf$floors), main="Density Plot: floors", ylab="Frequency",


sub=paste("Skewness:", round(e1071::skewness(maindf$floors), 2)))
polygon(density(maindf$floors), col="green")

plot(density(maindf$oldbuilt), main="Density Plot: oldbuilt", ylab="Fre-


quency",
sub=paste("Skewness:", round(e1071::skewness(maindf$oldbuilt), 2)))
polygon(density(maindf$oldbuilt), col="orange")

#Plot univariate linear regression between sqft_living and price


ggplot(maindf,aes(y=price,x=sqft_living)) +
geom_point() +
xlim(0, 9000) +
ylim(0, 5000000) +
geom_smooth(formula = y ~ x,method="lm")

#Multi univariate linear regression


linearmodel = lm(price~bedrooms + sqft_living + floors + sqft_lot + condi-
tion + view + oldbuilt,
data = maindf)
summary(linearmodel)

Output

Plot Correlation Matrix


Plot scatterplot matrix

Plot boxplot for checking outliers


Plot scatterplots
Plot density plot to check normality
Plot univariate linear regression between sqft_living and price
Findings Result:
 Logistic Regression model accuracy without feature selection :
89.65%
 Logistic Regression model accuracy with feature selection : 90.46%
 Random Forest model accuracy is 84.95%
 Decision Tree model accuracy is 84.49%
 Naïve Bayes model accuract is 76.77%

Interpretations :
 Out of the 3 departments, it is seen that employees in reasearch and
human resources department people who are paid less tend to
leave the company.

 In Education levels, it can be seen that employees having same


education level but are paid less leave the company.

 People who work overtime are much more likely to leave as


compared to the ones who do not work overtime.

 People on extreme side of the stock option level specturm are more
likely to leave that the ones in between.

 From work environment and job involvement rating levels it can be


seen that it does not have a significant impact on the attrition, but
the in both the cases people who are paid less tend to leave the
company

 By comparing the attrition and salary and work environment and job
involvement levels we can see that employyes who are paid less
tend to leave the company, irrespective of what they rate their work
environment or what job involvement level is.

 As age goes on increasing, the salary increases.

 Also Higher the job level greater is the salary.

 Divorced Women are least likely to leave the job and Single men are
most likely to leave.
 People who travel frequently are more likely to leave as compared to
the ones who travel less frequently or those who dont travel at all for
work.

 According to education field, people from HR and Technical Degree


are more likely to leave as compared to their counterparts
 people at lower job levels are more likely to leave as compared to
their counterparts.

You might also like