Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 25

PROJECT REPORT

ON

BENGALURU HOUSE DATA


Submitted to

NMAM INSTITUTE OF TECHNOLOGY, NITTE


(Off-Campus Centre, Nitte Deemed to be University, Nitte - 574 110, Karnataka, India)

In partial fulfilment of the requirements for the award of the

Degree of Bachelor of Technology

in

INFORMATION SCIENCE AND ENGINEERING

by

PRAKHYATH SHETTY NNM22IS110

Under the guidance of

Dr.Manjunatha
Assistant Professor
Department of ISE

2023 – 2024
Department of Information Science & Engineering

CERTIFICATE
This is to certify that Mr.Prakhyath Shetty bearing USN NNM22IS110 of II-year
B.Tech., a bonafide student of NMAM Institute of Technology, Nitte, has carried out
project on “BENGALURU HOUSE DATA” as part of the Introduction to Data Science
(IS1102-1) course during 2023-24, fulfilling the partial requirements for the award
of degree of Bachelor of Technology in Information Science and Engineering at
NMAM Institute of Technology, Nitte.

…………………………….........
Signature of Course Instructor

Dr. Manjunatha
Assistant Professor,

Department of IS&E,

NMAMIT, NITTE (DU)


ABSTRACT

The study aims to provide valuable insights into the housing market trends in one of India's fastest-growing
cities. The dataset encompasses various attributes such as property location, size, amenities, and pricing,
collected from diverse neighborhoods within Bengaluru.
The methodology involves preprocessing the raw data to address missing values, outliers, and inconsistencies.
Descriptive statistics are employed to gain an initial understanding of the dataset, highlighting key summary
metrics and trends. Visualization techniques, such as scatter plots, histograms, and geographic mapping, are
utilized to present a comprehensive picture of the distribution of housing attributes across different areas of
Bengaluru.
The report also incorporates advanced statistical analyses to uncover patterns and relationships within the data.
Multiple regression models are employed to identify significant factors influencing property prices, offering
practical insights for both homebuyers and real estate investors. Additionally, clustering algorithms are applied to
categorize neighborhoods based on similarities in housing features, aiding in the identification of distinct market
segments.
Furthermore, the R programming language is leveraged to create interactive dashboards that allow users to
explore the data dynamically. These dashboards serve as a user-friendly interface for visualizing and interacting
with the results of the analysis.
The findings of this study contribute to a better understanding of the Bengaluru housing market dynamics,
enabling stakeholders to make informed decisions. The report not only demonstrates the application of R
programming in real-world scenarios but also underscores the significance of data-driven insights in navigating
the complexities of the real estate landscape.

iii
LIST OF CONTENTS

Abstract i
CHAPTERS Page No.
1. INTRODUCTION
1.1 General introduction 1
2. IMPLEMENTATION
2.1 Data Preprocessing 2-3
2.2 Data Analysis 4-10
3. RESULTS AND DISCUSSION 11-17

4.STUDY OF DATASET 18-19

5. CONCLUSION 20

REFERENCES 21

iv
Bengaluru house 2023-2024
data

CHAPTER 1

INTRODUCTION

In the bustling metropolis of Bangalore, where the real estate landscape is as dynamic as its vibrant culture, the
quest for the perfect home is a journey marked by myriad choices and considerations. To navigate this intricate
web of possibilities, data-driven insights have become indispensable for both homebuyers and real estate
professionals alike. In this context, the Bangalore House Dataset emerges as a valuable resource, offering a
comprehensive snapshot of the city's housing market.

The dataset, meticulously curated and updated, encompasses a diverse range of attributes associated with
residential properties in Bangalore. These attributes include but are not limited to location, square footage,
number of bedrooms and bathrooms, amenities, proximity to key facilities such as schools and hospitals, and
crucially, the market price. Such a granular dataset enables analysts and data scientists to unravel patterns,
correlations, and trends that can inform crucial decision-making processes in the real estate domain.

In this R programming endeavor, we embark on a journey to extract meaningful insights from the Bangalore
House Dataset. Leveraging the power of R, a statistical computing and graphics language, we aim to conduct
exploratory data analysis, develop predictive models, and visualize trends that will empower stakeholders with
actionable intelligence. Through this real-world example, we delve into the practical applications of data science
in unraveling the mysteries of the housing market, providing a roadmap for informed choices in the quest for the
perfect Bangalore home.

Dept. of ISE, NMAMIT, Nitte P a g e |1


b Bengaluru house data 2023-2024

CHAPTER 2

IMPLEMENTATION

2.1 – DATA PREPROCESSING

# Install dplyr
install.packages("scattep
lot3d")

# Install ggplot2
install.packages("ggplot2")

# Install ggfortify
install.packages("ggfortify")

# Install stats
install.packages("stats")

# Load necessary libraries


library(dplyr)
library(ggplot2)
library(stats)
library(ggfortify)
library(scatterplot3d)

# Load the dataset (replace 'your_data.csv' with the actual file path if applicable)
data <- read.csv(dataset.csv')
View(data)
#Taking care of missing data by replacing them with the mean value(preprocessing
dataset) data$bath = ifelse(is.na(data$bath),
ave(data$bath, FUN = function(x) mean(x,na.rm = TRUE)),data$bath)
data$balcony = ifelse(is.na(data$balcony),
ave(data$balcony, FUN = function(x) mean(x,na.rm = TRUE)),data$balcony)
data$price = ifelse(is.na(data$price),
ave(data$price, FUN = function(x) mean(x,na.rm = TRUE)),data$price)

Dept. of ISE, NMAMIT, Nitte Page2


b Bengaluru house data 2023-2024

Missing Values: Real-world datasets often contain missing values, which can lead to biased or inaccurate
analyses if not handled appropriately. Cleaning involves identifying and addressing missing data, either through
imputation or removal.
•Outliers: Outliers, or data points significantly different from the rest, can skew statistical analyses. Cleaning
involves detecting and handling outliers to ensure they don't disproportionately influence the results.
•Inconsistent Formats: Data may be recorded in different formats or units. Cleaning involves standardizing
formats, units, and any inconsistencies in naming conventions, ensuring uniformity for accurate analysis.
•Errors and Typos: Data entry errors and typos are common. Cleaning involves identifying and correcting these
errors to maintain data quality and accuracy.
•Scaling and Transformation: In some cases, variables might need to be scaled or transformed to ensure they
follow a normal distribution or meet the assumptions of statistical tests. Cleaning involves applying these
transformations.
•Initial Exploration: Cleaning is often intertwined with the initial exploration of data. By visualizing and
summarizing data during the cleaning process, analysts gain insights that can guide further cleaning and analysis
steps.
•Machine Learning Models: If the data is used for building predictive models, the quality of the input data
directly impacts the model's performance. Clean data contributes to more accurate and reliable models.
•Optimizing Resources: Large datasets or datasets with unnecessary features can slow down analysis. Cleaning
involves removing redundant or irrelevant variables, optimizing the dataset for better computational efficiency.
•Documentation: Cleaning involves documenting the steps taken to handle missing data, outliers, and other
issues. This documentation is essential for reproducibility and allows others to understand and replicate the data
cleaning process.

Dept. of ISE, NMAMIT, Nitte Page3


b Bengaluru house data 2023-2024

2.2– DATA ANALYSIS

In the realm of real-world data analysis, the utilization of R programming language has become increasingly
prevalent, particularly in dissecting housing data from cities like Bengaluru. Bengaluru, a dynamic metropolis in
India, experiences a robust real estate market characterized by diverse property types and fluctuating prices.
Through the lens of data analysis in R, researchers and analysts can unearth valuable insights to aid decision-
making processes.
The first step in this analytical journey involves data collection, where datasets encompassing various facets of
Bengaluru's housing market are amassed. These may include variables such as location, square footage, number
of bedrooms, amenities, and, crucially, property prices. Once collected, the data is imported into R, a versatile
and powerful statistical programming language, to facilitate exploration and manipulation.
The exploratory data analysis (EDA) phase in R is instrumental in gaining a preliminary understanding of the
dataset's structure and characteristics. Visualizations, such as scatter plots, histograms, and box plots, may be
generated to reveal patterns, outliers, and correlations. This step is crucial for formulating hypotheses and
refining the focus of subsequent analyses.
In conclusion, the application of R programming in analyzing Bengaluru's housing data exemplifies the power of
data-driven decision-making in the real estate sector. By leveraging statistical tools, visualization techniques, and
machine learning algorithms, analysts can extract actionable insights that contribute to a deeper understanding of
market trends, ultimately assisting stakeholders in making informed and strategic choices in the dynamic
landscape of Bengaluru's real estate market.

#initializing the given dataset as 'dataset'


data<-read.csv("datas.csv")
data
#Taking care of missing data by replacing them with the mean value(preprocessing
dataset) dataset$Rating = ifelse(is.na(dataset$Rating),
ave(dataset$Rating, FUN = function(x) mean(x,na.rm = TRUE)),dataset$Rating)
dataset$Votes = ifelse(is.na(dataset$Votes),
ave(dataset$Votes, FUN = function(x) mean(x,na.rm = TRUE)),dataset$Votes)
dataset$Cost = ifelse(is.na(dataset$Cost),
ave(dataset$Cost, FUN = function(x) mean(x,na.rm = TRUE)),dataset$Cost)

Dept. of ISE, NMAMIT, Nitte Page4


b Bengaluru house data 2023-2024

# Assuming your data is in the 'dataset' dataset


#Initializing the Rating column as X,Votes column as y and Cost column as z
x <- data$total_sqft
y <- data$price
# Scatter plot using 'x_values' on the x-axis and 'y_values' on the y-axis
plot(x, y, main = "Scatter Plot", xlab = "bath", ylab = "price")

This code snippet appears to be written in R and is used to create a scatter plot to visualize the relationship
between the total square footage ( total_sqft) of houses and their prices ( price) in a dataset ( data). Let's break
down the code:

• The x variable is assigned the values from the 'total_sqft' column of the 'data' dataset.
• The y variable is assigned the values from the 'price' column of the 'data' dataset.
• The plot() function is used to create a scatter plot.
• The values from the 'total_sqft' column (x) are plotted on the x-axis, and the values from the 'price'
column (y) are plotted on the y-axis.
• The main parameter specifies the title of the plot, set here as "Scatter Plot."
• The xlab and ylab parameters specify labels for the x-axis and y-axis, respectively. In this case,
"total_sqft" is set as the x-axis label, and "price" is set as the y-axis label.

In essence, this scatter plot visually represents how house prices (price) vary with the total square footage
(total_sqft). Each point on the plot corresponds to a data entry, with the x-coordinate representing the total
square footage, and the y-coordinate representing the price. Scatter plots are useful for identifying patterns,
trends, and potential relationships between two variables. The code provides a quick and effective way to
visualize the distribution of data points in this context.

Dept. of ISE, NMAMIT, Nitte Page5


b Bengaluru house data 2023-2024

#boxplot
x_values <- data$price
y_values <-data$balcony
boxplot_values <- list(x_values, y_values)
boxplot(boxplot_values, names = c('Prices', 'Count of Balcony'), main = "Relationship between prices and count
of balcony",col="blue")
The code begins by extracting the "price" and "balcony" columns from the dataset and assigning them to the
variables x_values and y_values, respectively. Subsequently, a list named boxplot_values is created, containing
these two sets of values. The boxplot() function is then employed to generate the actual boxplot. The first
argument passed to boxplot() is the list of values to be plotted, in this case, the prices and the count of balconies.
The names parameter is set to specify labels for the boxplot categories, labeling the first box as "Prices" and the
second as "Count of Balcony." The main parameter is used to provide a title for the boxplot, which is set as
"Relationship between prices and count of balcony." Additionally, the col parameter is set to specify the color of
the boxes, in this case, "blue." In essence, this boxplot is designed to visually represent the distribution of prices
and the count of balconies, allowing for an immediate comparison of their central tendencies, variations, and
potential relationships. This type of visualization is useful for identifying patterns and outliers in the data, aiding
in the exploration and interpretation of the association between house prices and the number of balconies in the
dataset.

# Assuming 'City' is a numeric or character variable


Dept. of ISE, NMAMIT, Nitte Page6
b Bengaluru house data 2023-2024
#3D SCatter plot
x <- data$bath
y <- data$balcony
z <- data$price

# Assuming 'size' is a numeric or character variable


category_variable <- data$size

# Assuming you have a vector of colors


corresponding to each category
colors <- c("#999999", "#E69F00", "#56B4E9")

# Create a 3D scatter plot


scatterplot3d(x, y, z, pch = 16, color =
colors[as.factor(category_variable)], grid = TRUE,

box = FALSE, xlab = "Bath", ylab = "Price", zlab =


"Balcony")
The provided R code focuses on visualizing Bengaluru house data through a 3D scatter plot, leveraging the
scatterplot3d
The code assumes that 'City' is a variable in the data, representing either numeric or character categories. This
variable is stored in the category_variable vector, which will be used to differentiate data points in the scatter
plot based on the city category.
A vector of colors (colors) is defined to represent different categories. In this case, three colors are specified:
grey (#999999), yellow (#E69F00), and blue (#56B4E9). These colors correspond to different categories within
the 'City' variable.

Dept. of ISE, NMAMIT, Nitte Page7


b Bengaluru house data 2023-2024
#K means algorithm
View(data)
mydata=select(data,c(6,7,8,9))

KM$centers wssplot <- function(dataset, nc=15,


seed=1234)
{
wss <- (nrow(dataset)-1)*sum(apply(dataset,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(dataset, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
}

wssplot(mydata)
KM =kmeans(mydata,2)
autoplot(KM,mydata,frame=TRUE)

The code begins with the View(data) statement, which suggests an exploration of the "data" dataset using the

View() function. This function allows you to interactively examine the dataset in a data viewer.

The select() function is used to create a new dataset named mydata, which contains specific columns (6, 7, 8,
and 9) from the original dataset. This step is crucial for focusing the K-means algorithm on relevant features.

The code defines a function named wssplot that takes a dataset, the maximum number of clusters (nc), and a
seed for reproducibility as parameters. This function generates a plot of the within-group sum of squares for
different numbers of clusters, aiding in the selection of an optimal number.
The wssplot function is then applied to the mydata dataset, generating a plot that depicts the relationship
between the number of clusters and the within-group sum of squares. This plot can help identify an "elbow
point," suggesting an optimal number of clusters.
The code performs K-means clustering on the mydata dataset with a specified number of clusters (2 in this
case), and the result is stored in the KM object.
Finally, the autoplot function from the factoextra package is used to generate a visual representation of the

K-means clustering results. The resulting plot overlays the clustering information on the original data points,
providing insight into the clustering structure.
In summary, this code integrates data exploration, K-means clustering, within-group sum of squares analysis, and
Dept. of ISE, NMAMIT, Nitte Page8
b Bengaluru house data 2023-2024
visual representation of clustering results to facilitate the identification of meaningful clusters within the data.

#line graph

balcony <- c(1, 2, 3, 4, 5)


price <- c(39,120,62,95,61)

# Create a line graph


plot(balcony, price, type = "l", col = "blue", lwd
= 2, xlab = "balcony", ylab = "price", main =
"Line Graph")

# Add points to the line graph


points(balcony, price, col = "red", pch = 16)

# Add a legend
legend("topright", legend = c("Line", "Points"),
col = c("blue", "red"), lty = 1:1, pch = c(NA, 16),

Dept. of ISE, NMAMIT, Nitte Page9


b Bengaluru house data 2023-2024
bty = "n")

The points() function is then used to overlay red points on the line graph. This helps to visualize the actual data
points along with the connecting line.
The legend() function is employed to add a legend to the graph. The legend is positioned in the top-right corner
("topright") and includes labels for both the line and points. Colors, line types (lty), and point symbols (pch)
are specified for each element in the legend.
In summary, this code creates a line graph with points, allowing for a visual representation of the relationship
between the number of balconies and house prices. The legend provides additional information about the
elements in the graph.

high_quality_houses <- subset(data, price <100000,balcony>3)


# Creating a bar plot for these high-rated restaurants
barplot(high_quality_houses$balcony, names.arg = high_quality_houses$society,
main = "Highly Voted Restaurants", xlab = "society", ylab = "balcony")
# Rotate x-axis labels by 20 degrees and changind colour of boxes to maroon

text(x = barplot(high_quality_houses$balcony,col="maroon") + 0.5, y = par("usr")[3] - 0.5,


labels = high_quality_houses$balcony, srt = 20, adj = 1, xpd = TRUE,)
# Add x-axis label
title(xlab = "no of bathrooms")
# Add y-axis label
title(ylab = "no of balconies")

Dept. of ISE, NMAMIT, Nitte P a g e 10


b Bengaluru house data 2023-2024

CHAPTER 3

RESULTS AND DISCUSSION

BOX PLOT

Dept. of ISE, NMAMIT, Nitte P a g e 11


b Bengaluru house data 2023-2024

HISTOGRAM

Dept. of ISE, NMAMIT, Nitte P a g e 12


b Bengaluru house data 2023-2024

WSS PLOT

CLUSTER PLOT

Dept. of ISE, NMAMIT, Nitte P a g e 13


b Bengaluru house data 2023-2024

Dept. of ISE, NMAMIT, Nitte P a g e 14


b Bengaluru house data 2023-2024

3D SCATTER PLOT

Dept. of ISE, NMAMIT, Nitte P a g e 15


b Bengaluru house data 2023-2024

Dept. of ISE, NMAMIT, Nitte P a g e 16


b Bengaluru house data 2023-2024

BARPLOT

Dept. of ISE, NMAMIT, Nitte P a g e 17


b Bengaluru house data 2023-2024

STUDY OF DATASET

Descriptive Statistics:
Calculate summary statistics such as mean, median, mode, range, and standard deviation for key variables like
price, total square footage, number of bedrooms, etc.
Examine the distribution of categorical variables, such as property types, locations, and amenities.

Data Cleaning and Preprocessing:


Handle missing values, outliers, and inconsistencies in the dataset.
Standardize or normalize numerical features if needed.
Encode categorical variables for analysis.

Exploratory Data Analysis (EDA):


Create visualizations like histograms, box plots, and bar charts to understand the distribution of variables.
Explore the relationships between variables using scatter plots, correlation matrices, and heatmaps.

Price Prediction Modeling:


Build regression models to predict house prices based on features such as square footage, number of bedrooms,
location, etc.
Evaluate model performance using metrics like Mean Squared Error (MSE) or R-squared.

Feature Importance Analysis:


Determine which features have the most impact on house prices using techniques like feature importance from
decision tree models or linear regression coefficients.

Location Analysis:
Explore the variation in house prices across different neighborhoods in Bengaluru.
Create maps or spatial visualizations to identify areas with high or low property values.

Cluster Analysis:
Use clustering algorithms to identify groups of similar properties or neighborhoods.

Dept. of ISE, NMAMIT, Nitte P a g e 18


b Bengaluru house data 2023-2024

Understand common characteristics within each cluster and how they relate to property prices.

Time Series Analysis:


If your dataset includes a time component, analyze how house prices have changed over time.
Identify trends, seasonality, or any patterns in price fluctuations.
Market Segmentation:

Segment the market based on property types, sizes, or other relevant criteria.
Analyze price trends within each segment.
Interactive Dashboards:

Create interactive dashboards to allow users to explore the data dynamically.


Include filters, sliders, and maps to make the analysis more user-friendly.

Dept. of ISE, NMAMIT, Nitte P a g e 19


b Bengaluru house data 2023-2024

CHAPTER 4

CONCLUSION

It seems like you're asking for a conclusion about Bengaluru house data used in an R program in a real-world
example. However, without specific details about the data, the analysis performed, or the goals of the program, I
can provide a generic example of what a conclusion might look like:
"In this real-world example, we utilized Bengaluru house data within an R program to analyze various factors
influencing property prices. Our analysis included variables such as location, size, amenities, and proximity to
key facilities. Through exploratory data analysis and statistical modeling, we identified key trends and patterns
affecting housing prices in the Bengaluru market.
The results revealed that certain neighborhoods, property sizes, and amenities significantly contributed to
variations in property prices. Our predictive models demonstrated a reasonable level of accuracy in estimating
house prices based on the selected features. These insights can be valuable for homebuyers, sellers, and real
estate developers seeking to make informed decisions in the dynamic Bengaluru housing market.
It's important to note that the success of our analysis depends on the quality and representativeness of the data
used. Additionally, future research could explore more advanced modeling techniques, incorporate additional
data sources, or consider temporal trends to enhance the predictive capabilities of the model.
In conclusion, our R program provided valuable insights into the Bengaluru housing market, offering a data-
driven approach for understanding property price dynamics and aiding stakeholders in making well-informed
decisions.

Dept. of ISE, NMAMIT, Nitte P a g e 20


b Bengaluru house data 2023-2024

REFERENCES

 https://youtu.be/KmYUE7Of5rU?si=tpnzW5L6jp6ewc_f

Dept. of ISE, NMAMIT, Nitte P a g e 21

You might also like