1641 gcs2007 08 Nguyen Xuan Nam Assignment 1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 27

lOMoARcPSD|31576789

1641 gcs2007 08 Nguyen Xuan Nam assignment 1

Business Intelligence (Trường Đại học FPT)

Studocu is not sponsored or endorsed by any college or university


Downloaded by ??ng Kirito (kiritodang88@gmail.com)
lOMoARcPSD|31576789

Higher Nationals in Computing

Unit 14: Business Intelligence


ASSIGNMENT 1

Learner’s name: Nguyen Xuan Nam


ID: GCS200708
Class: GCS0905A
Subject ID: 1641
Assessor name: Nguyen Xuan Sam

Assignment due:4 / 3 / 2 0 2 3 Assignment submitted:4 / 3 / 2 0 2 3

1|Page

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

ASSIGNMENT 1 FRONT SHEET

Qualification BTEC Level 5 HND Diploma in Computing

Unit number and title Unit 14: Business Intelligence

Submission date Date Received 1st submission

Re-submission Date Date Received 2nd submission

Student Name Nguyen Xuan Nam Student ID GCS200708

Class GCS0905A Assessor name Nguyen Xuan Sam

Student declaration
I certify that the assignment submission is entirely my own work and I fully understand the consequences of plagiarism. I understand that
making a false declaration is a form of malpractice.

Student’s signature thanh

Grading grid
P1 P2 M1 M2 D1 D2

1|Page

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

 Summative Feedback:  Resubmission Feedback:

Grade: Assessor Signature: Date:

Signature & Date:

2|Page

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

ASSIGNMENT 1 BRIEF
Qualification BTEC Level 5 HND Diploma in Computing

Unit number and title Unit 14: Business Intelligence

Assignment title Assignment 1: Discover business process and BI technologies

Academic Year 2023

Unit Tutor Nguyen Xuan Sam

Issue date Submission date

Submission Format:

Format: The submission is in the form of an individual written report that shows how you have manage
the project. This should be written in a concise, formal business style using single spacing
and font size 12. You are required to make use of headings, paragraphs and subsections as
appropriate, and all work must be supported with research and referenced using the
Harvard referencing system. Please also provide a bibliography using the Harvard
referencing system.
Submission Students are compulsory to submit the assignment in due date and in a way requested by
the Tutors. The form of submission will be a soft copy in PDF posted on corresponding
course of http://cms.greenwich.edu.vn/
Note: The Assignment must be your own work, and not copied by or from another student or from
books etc. If you use ideas, quotes or data (such as diagrams) from books, journals or other sources, you
must reference your sources, using the Harvard style. Make sure that you know how to reference
properly, and that understand the guidelines on plagiarism. If you do not, you definitely get fail
Assignment Brief and Guidance:

Your company is currently working in [Assumed Domain] for 2 years. For a new, young company,
the competition in the market is very high. Therefore, the Board of Director has decided to apply
Business Intelligence to improve the company business process by making better decisions.
The Board of Directors assigns a small group including you in Research & Development Department

3|Page

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

to study business intelligence to apply for the company in the coming years.
You need to research about business processes and decision support processes in the company and
identify the types of data (unstructured, semi-structured or structured) generated by these
processes with examples. You also need to research about current software used in the business
process or decision support process and evaluate these usages (benefits and drawbacks).
Next you need to understand the types of support for decision-making at different levels
(operational, tactical and strategic) within the company and study which business intelligence
features can help on that types of support. Study the information systems or technologies (of BI) can
be used in this case, compare and contrast them to conclude which should be used.
Your group needs to present the research results to the board in a presentation of 30 minutes.

Learning Outcomes and Assessment Criteria

Pass Merit Distinction


LO1 Discuss business processes and the mechanisms used to support D1. Critically evaluate the
business decision-making project management
P1 Examine, using examples, M1 Differentiate between process and appropriate
the terms ‘Business Process’ unstructured and semi-structured research methodologies
and ‘Supporting Processes’. data within an organisation. applied.

LO2 Compare the tools and technologies associated with business D2 Compare and contrast a range
intelligence functionality of information systems and
technologies that can be used to
P2 Compare the types of M2 Justify, with specific examples, support
support available for business the key features of business organisations at operational,
decision-making at varying intelligence functionality. tactical and strategic levels.
levels within an organisation.

4|Page

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

Table of Content
1.Introduction ............................................................................................................................... 6

1.1 Overview of problems .............................................................................................................. 6

1.2 Motivations ............................................................................................................................. 6

1.3 Objectives ................................................................................................................................ 7

2. Related works and dataset ......................................................................................................... 7

2.1 Related works .......................................................................................................................... 7

2.2 Dataset .................................................................................................................................... 8

2.3 Summary ............................................................................................................................... 10

3. Proposed model ....................................................................................................................... 10

3.1 Correlation ............................................................................................................................. 10

3.2 Linear regression .................................................................................................................... 12

3.3 Multiple regression ................................................................................................................ 12

4 Simulating scenarios and Results ............................................................................................... 16

4.1 Package installation ............................................................................................................... 16

4.2 Correlation ............................................................................................................................. 19

4.3 Scenarios and analysis ............................................................................................................ 20

5 Conclusions and future works ................................................................................................... 24

5.1 Conclusions ............................................................................................................................ 24

5.2 Future works .......................................................................................................................... 24

5|Page

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

ASSIGNMENT 1 ANSWERS

Topic: Predicting house prices in Housing Price Prediction dataset using


Regression models
1 Introduction
1.1 Overview of problems
Nowadays, real estate is one of significant problems. In the figure 1, we describe relationships between house’s
prices and factors that are impacted on.

Figure 1: The factors impact on house price.

Nowadays, there are many projects that data scientists have built for price prediction in machine learning. In
machine learning, we can easily predict a new data based on some features that we already have. One of the
most models for predictive analysis is regression. As we know, the purpose of the model is for predicting future
results that has been applied in many fields of life like economics, business, banking sector, healthcare industry,
e-commerce. entertainment, sports and so on. Therefore, this technique is popularly used in building a model
based on some features for prices prediction.

1.2 Motivations
The least transparent sector of our economy is real estate. Real estate prices fluctuate daily and sometimes
prices are inflated and not based on estimates. When people decide to buy a home, they look for one that is
affordable and meets all their requirements. With machine learning, we can easily predict house prices and

6|Page

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

decide whether the house is worth buying or selling for a higher price. In this report, we will forecast home
prices in King County, USA. Some features like the size, location, square footage, etc. of the home can be key
factors in determining the price.

1.3 Objectives
In this job, there are several important goals that I focus on:

How does the size of the house affect the house price?

– How does the size of the housing area affect the house price?

– The area of the house campus affects the area of the house

- How does the area of the house affect the bathroom?

- Multiple regression of all features

To answer the questions in the first chapter, I will show the dataset. There are several steps to get information
from raw data. These steps are shown in Figure 1 below, namely data collection

In the first chapter, I introduced my work and outlined the goals of the project. The rest of this work includes
showcasing my dataset, methods, and results, as well as a demo of the application.

2 Related works and dataset


2.1 Related works
In the study, the authors used some algorithms such as Multiple Linear Regression, Ridge Regression, LASSO
Regression, Elastic Net Regression, Ada Boosting Regression, and Gradient Boosting. The purpose of this study is
that the authors want to compare different methods and compare the model error of each method. The results
show that multiple regression has a fairly low error statistic, proving that multiple regression is one of the
suitable models for predicting housing prices.

In a further study, the authors divide the characteristics affecting housing prices into three categories:
structural conditions, concepts, and locations. Physical features are those characteristics of the house that can
be seen with the human eye, such as: B. Size of the house, number of bedrooms, presence of a kitchen and
garage, presence of a garden, size of the plot and other structures, and age of the house. On the other hand,
conceptual features are concepts provided by developers to attract buyers, such as: B. The concept of
minimalist home, healthy and environmentally friendly, and elite environment. Research has proven that these
characteristics are significantly correlated with real estate prices.

7|Page

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

In summary, there are many studies on predicting house prices using different machine learning methods or
models. In my project, I will use linear regression and multiple regression for model building and forecasting.
I’m going to take advantage of all the features in this dataset and decide to build a good model.

2.2 Dataset
2.2.1 Data collection

I got the data from Kaggle. The dataset is house price forecasts for 2014-2015. The raw data set contains over
21000 entries and 21 columns. In this dataset, the price column is the dependent variable and the rest of the
columns except ID and Date are the independent objects. This is the beginning of the plot data set.In Figure 3,
the result of this study is that the value is continuously dependent, and the price and other variables are the
independent variables.

In Figure 3, the result of this study is that the value is continuously dependent, and the price and other variables
are the independent variables.

2.2.2 Description dataset

The dataset includes:

 Id: the unique identifier of each house


 Date: the date when the house was sold
 Price: the price of the house (thêm đơn vị)
 Bedrooms: number of bedrooms
 Bathrooms: the number of bathrooms
 Sqft_living: the footage of the house
 Sqft_lot: the footage of the lot
 Floors: number of floors
 Waterfront: house that has waterfront view
 View: the house has view
 Condition: the condition of the house on scale of 1-5 (overall)
 Grade: the grade of the house unit on scale of 0-10 (overall)
 Sqft-above: living area of the home, excluding the basement
 Sqft_basement: living square footage of the basement
 Yr_built: year that the house built
 Yr_renovated: year that the house renovated

8|Page

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

 Zipcode: zipcode of the house


 Lat: latitude coordinate
 Long: longitude coordinate
 Sqft_living15: The area of the interior where the 15 closest neighbors' living spaces are locate
 Sqft_lot15: the area of the 15 closest neighbors' nearest land lots

2.2.3 Data cleaning

Data cleaning is one of the most important steps before discovery, analysis, and modeling in machine learning.
The purpose of data cleaning is to deal with abnormal data such as missing data, outliers, unwanted data, or
inconsistent data. There are many ways to clean your data. For example, deleting data, replacing data, changing
the data type of a value, and so on. Before cleaning, first examine the raw dataset to see what to do next:

2.2.4 Data processing

For data processing, there are many things to do with the raw dataset:

a. Change the datatype of sqft_basement from int into float

9|Page

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

b. Change the datatype and create more columns for date column

I change the date column from object into date datatype

c. Change the datatype of yr_renovated

In the raw dataset, this column has the datatype of int, so I change it into float.

2.3 Summary
In this chapter, I've cleaned up my raw dataset into a better dataset that's easy to explore and analyze.

I believe this is the most important step before doing any prediction or modeling in machine learning. In the
next chapter, I begin to build my model and explain and some visualizations will be shown for better analysis.

3 Proposed model
3.1 Correlation
Basically, correlation measures the difference between two variables (Hauke and Kossowski, 2011). The
correlation coefficient formula follows (David Groebner, 2017).

The above function is called the Pearson product moment correlation. To know if two variables are correlated,
we can look at the scatter plot model as shown below:

10 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

The correlation coefficient, which is r, can range from +1.0 (the perfect positive correlation) to -1.0 (the perfect
negative correlation). If r =0 which means that there is no correlation between the x and y variable. If all points
on the scatter plot fall on a straight line, this is the perfect correlation. As a result, the stronger the linear link
between the two variables, the more the correlation differs from 0.0. The correlation coefficient's sign reflects
the relationship's direction (David Groebner, 2017).

11 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

3.2 Linear regression


The basic equation of simple linear regression is studied in (David Groebner, 2017). In equation 1 where the
dependent variable is the result and x is the dependent variable, the relationship is shown as follows:

3.3 Multiple regression


In this project, I use multiple regression to predict the average rating of a book based on 3 characteristics: the
number of pages of the book, the number of text reviews, and the number of ratings. In multiple regression,
this is the formula (David Groebner, 2017):

12 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

3.4 R-
squares and Adjusted R-squares

The adjusted coefficients of determination 𝑅2 or 𝑅2, which indicate how much of a change in the response is
explained by the model, may be the most frequently used statistic in regression to assess the goodness of fit of
a model (Akossou and Palm, 2013).

13 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

Adjusted R-squared calculates the percentage of variance that can be explained by only the independent
variables that have a significant impact on the interpretation of the dependent variable. R-squares only increase
if the independent variable affects the dependent variable.

3.5 Model accuracy

The average absolute error between the actual value and the predicted value is called MAE (Mean Absolute
Error). L1 loss, also known as absolute error, is a row-level error calculation that determines the non-negative

14 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

difference between prediction and reality. We can better evaluate the model's performance in the entire data
set by testing the MAE, which is the sum of these errors.

The mean squared difference between the actual value and the predicted value is called the MSE (Mean
Squared Error). The difference between the predicted and the actual squared in the row-level error calculation
is called the squared error, sometimes referred to as the L2 loss. We can better evaluate the model's
performance on the entire data set by testing the MSE, which is the average of these errors.

15 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

Root Mean Square Error (RMSE) is the standard deviation of the residual (prediction error). The distance
between the data points and the regression line is measured by residuals, and the difference of these residuals
is measured by RMSE.

4 Simulating scenarios and Results


4.1 Package installation
Step 1: Install the basic packages for this job

There are three packages that need to be installed for data discovery, analysis, and application:

Pandas, Numpy and Streamlit. We can install using pip or anaconda:

Using pip:

Using Anaconda:

Pandas version: 1.5.3

Numpy version: 1.24

Streamlit version: 1.11.1

Step 2: Install packages for data visualization

There are two packages I would use for visualization: seaborn and matplotlib

Use pip:

Use Anaconda:

16 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

Matplotlib version: 3.7.1

Seaborn version: 0.12.2

Step 3: Install packages for modeling I am using Scikit-learn package for modeling, and this package requires:

- Python (>= 3.8)

- NumPy (>= 1.24)

- SciPy (>= 1.4.1)

- joblib (>= 0.11)

Install Scikit-learn by

Using pip:

Using Anaconda:

Scikit-learn version: 1.1.0

The second package is Statsmodels

Step 4: Install package for map

I am using Folium package for showing map:

Using pip:

Using Anaconda:

Folium version: 0.14.0

After installing all of the required packages for this work, I will import all of them in Jupyter

Notebook, except Streamlit package:

17 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

4.2 Correlation
I will explore the correlation of the dataset. I envision a heatmap to show this:

18 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

As we can see from the heatmap, I collect some high correlation pair because the correlation score above 0.5:

- price and sqft_above: 0.605567

- price and sqft_living: 0.702035

- bathrooms and sqft_living: 0.754665

- sqft_above and sqft_living: 0.876597

4.3 Scenarios and analysis


Scenario 1: How does the size of the housing area affect the house price?

19 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

With only sqft_living, the R-square = 0.493 means that it affects 49.3% of the actual house prices. The Figure
below shows how strong relationship between sqft_living and house prices

Sqft_living from 0 to 6000, the house price will be average, but from nearly 8000 and above, the house price
will increase dramatically but rarely decrease. And house prices peaked near 8000000 when sqft_living reached
over 12000. We can see that sqft_living has a lot of influence on the price of the house.

The formula of this model:

20 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

Scenario 2: How does the size of the house affect the house price?

With only sqft_above, the R-square = 0.367 means that it affects 36.7% of the actual house prices. The Figure
below shows how strong relationship between sqft_above and house prices

Sqft_above from 0 to about 5000, house prices will be average, but most of the time house prices will increase
more than decrease and sqft_above from more than 6000, house prices will all increase steadily but rarely
decrease. And house prices will peak near 8000000 when sqft_living gets to over 8300. We can see that
sqft_above has little effect on the price of the house.
21 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

The formula of this model:

Scenario 3: How does the area of the house area affect the bathroom?

With only sqft_living, the R-square = 0.570 means that it affects 57% of the actual house prices. The Figure
below shows how strong relationship between sqft_living and house bathrooms

22 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

Sqft_living from 0 to about 6000 then the number of bathrooms will be on average from 0 to 6 but there are
some cases where even if sqft_living increases in the above range, the number of rooms can be more than
average and possibly 0 bathrooms. And sqft_living from more than 8000 or more, the number of bathrooms of
the house will be at 4 or more, but rarely decrease below 4. Finally, we can see that sqft_living has a lot of
influence on the number of bathrooms in the house.

The formula of this model:

Scenario 4: How does the area of the house campus affect the area of the house?

With only sqft_living, the R-square = 0.582 means that it affects 58.2% of the actual house prices. The Figure
below shows how strong relationship between sqft_above and sqft_above

23 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

sqft_living from 0 to about 8000 then sqft_above will have a much increasing average. However, when
sqft_living is more than 8000, we can see that almost sqft_above does not increase but only decreases but very
little. It can be seen that sqft_living from 0 to about 8000 has a lot of influence on the house sqft_above.

The formula of this model:

5 Conclusions and future works


5.1 Conclusions
The size of the house (sqft_ living) affects the house price by 49.3% and the size of the house (sqft_above)
affects the house price by 36.7%. The odds are pretty high with just one home price prediction feature.

In addition, the living area (sqft_living) affects the number of bathrooms of a house with a rate of 57% which is
also quite high.

Finally, the size of the house (sqft_living) also affects the area of the house (sqft_above) with a rate of 58.2%
which is high.

5.2 Future works


If this job took more time and information and budget, I would predict house prices based on more
characteristics like the location of homes near hospitals, schools, parks or supermarkets. Another development
I would like to take is to build other models using other algorithms and compare them with Multiple Regression.
Furthermore, the work I can do to make this project better is I will develop a better UI/UX for this Streamlit app
and make it more user friendly.

24 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)


lOMoARcPSD|31576789

REFERENCES
ASHISH. 2023. Housing Price Prediction ( Linear Regression ) [Online]. Available:
https://www.kaggle.com/code/ashydv/housing-price-prediction-linear-regression/notebook [Accessed
1/3/2023]

J Manasa; Radha Gupta; N S Narahari. 2020. Machine Learning based Predicting House Prices using Regression
Techniques [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9074952 [Accessed 1/3/2023]

Jim Frost. 2023. How to Interpret Adjusted R-Squared and Predicted R-Squared in Regression Analysis [Online].
Available: https://statisticsbyjim.com/regression/interpret-adjusted-r-squared-predicted-r-squared-regression/
[Accessed 1/3/2023]

Jim Frost. 2023. How To Interpret R-squared in Regression Analysis [Online]. Available:
https://statisticsbyjim.com/regression/interpret-r-squared-regression/ [Accessed 1/3/2023]

CHICCO, D., WARRENS, M. J. & JURMAN, G. 2021. The coefficient of determination R-squared is more
informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation [Online]. Available:
https://pubmed.ncbi.nlm.nih.gov/34307865/ [Accessed 1/3/2023]

HAUKE, J. & KOSSOWSKI, T. 2011. Comparison of Values of Pearson's and Spearman's Correlation Coefficients
on the Same Sets of Data [Online]. Available: https://ideas.repec.org/a/vrs/quageo/v30y2011i2p87-93n9.html
[Accessed 1/3/2023]

VARMA, A., SARMA, A., DOSHI, S. & NAIR, R. 2018. House price prediction using machine learning and neural
networks [Online]. Available: https://ieeexplore.ieee.org/document/8473231 [Accessed 1/3/2023]

25 | P a g e

Downloaded by ??ng Kirito (kiritodang88@gmail.com)

You might also like