Professional Documents
Culture Documents
1641 gcs2007 08 Nguyen Xuan Nam Assignment 1
1641 gcs2007 08 Nguyen Xuan Nam Assignment 1
1641 gcs2007 08 Nguyen Xuan Nam Assignment 1
1|Page
Student declaration
I certify that the assignment submission is entirely my own work and I fully understand the consequences of plagiarism. I understand that
making a false declaration is a form of malpractice.
Grading grid
P1 P2 M1 M2 D1 D2
1|Page
2|Page
ASSIGNMENT 1 BRIEF
Qualification BTEC Level 5 HND Diploma in Computing
Submission Format:
Format: The submission is in the form of an individual written report that shows how you have manage
the project. This should be written in a concise, formal business style using single spacing
and font size 12. You are required to make use of headings, paragraphs and subsections as
appropriate, and all work must be supported with research and referenced using the
Harvard referencing system. Please also provide a bibliography using the Harvard
referencing system.
Submission Students are compulsory to submit the assignment in due date and in a way requested by
the Tutors. The form of submission will be a soft copy in PDF posted on corresponding
course of http://cms.greenwich.edu.vn/
Note: The Assignment must be your own work, and not copied by or from another student or from
books etc. If you use ideas, quotes or data (such as diagrams) from books, journals or other sources, you
must reference your sources, using the Harvard style. Make sure that you know how to reference
properly, and that understand the guidelines on plagiarism. If you do not, you definitely get fail
Assignment Brief and Guidance:
Your company is currently working in [Assumed Domain] for 2 years. For a new, young company,
the competition in the market is very high. Therefore, the Board of Director has decided to apply
Business Intelligence to improve the company business process by making better decisions.
The Board of Directors assigns a small group including you in Research & Development Department
3|Page
to study business intelligence to apply for the company in the coming years.
You need to research about business processes and decision support processes in the company and
identify the types of data (unstructured, semi-structured or structured) generated by these
processes with examples. You also need to research about current software used in the business
process or decision support process and evaluate these usages (benefits and drawbacks).
Next you need to understand the types of support for decision-making at different levels
(operational, tactical and strategic) within the company and study which business intelligence
features can help on that types of support. Study the information systems or technologies (of BI) can
be used in this case, compare and contrast them to conclude which should be used.
Your group needs to present the research results to the board in a presentation of 30 minutes.
LO2 Compare the tools and technologies associated with business D2 Compare and contrast a range
intelligence functionality of information systems and
technologies that can be used to
P2 Compare the types of M2 Justify, with specific examples, support
support available for business the key features of business organisations at operational,
decision-making at varying intelligence functionality. tactical and strategic levels.
levels within an organisation.
4|Page
Table of Content
1.Introduction ............................................................................................................................... 6
5|Page
ASSIGNMENT 1 ANSWERS
Nowadays, there are many projects that data scientists have built for price prediction in machine learning. In
machine learning, we can easily predict a new data based on some features that we already have. One of the
most models for predictive analysis is regression. As we know, the purpose of the model is for predicting future
results that has been applied in many fields of life like economics, business, banking sector, healthcare industry,
e-commerce. entertainment, sports and so on. Therefore, this technique is popularly used in building a model
based on some features for prices prediction.
1.2 Motivations
The least transparent sector of our economy is real estate. Real estate prices fluctuate daily and sometimes
prices are inflated and not based on estimates. When people decide to buy a home, they look for one that is
affordable and meets all their requirements. With machine learning, we can easily predict house prices and
6|Page
decide whether the house is worth buying or selling for a higher price. In this report, we will forecast home
prices in King County, USA. Some features like the size, location, square footage, etc. of the home can be key
factors in determining the price.
1.3 Objectives
In this job, there are several important goals that I focus on:
How does the size of the house affect the house price?
– How does the size of the housing area affect the house price?
– The area of the house campus affects the area of the house
To answer the questions in the first chapter, I will show the dataset. There are several steps to get information
from raw data. These steps are shown in Figure 1 below, namely data collection
In the first chapter, I introduced my work and outlined the goals of the project. The rest of this work includes
showcasing my dataset, methods, and results, as well as a demo of the application.
In a further study, the authors divide the characteristics affecting housing prices into three categories:
structural conditions, concepts, and locations. Physical features are those characteristics of the house that can
be seen with the human eye, such as: B. Size of the house, number of bedrooms, presence of a kitchen and
garage, presence of a garden, size of the plot and other structures, and age of the house. On the other hand,
conceptual features are concepts provided by developers to attract buyers, such as: B. The concept of
minimalist home, healthy and environmentally friendly, and elite environment. Research has proven that these
characteristics are significantly correlated with real estate prices.
7|Page
In summary, there are many studies on predicting house prices using different machine learning methods or
models. In my project, I will use linear regression and multiple regression for model building and forecasting.
I’m going to take advantage of all the features in this dataset and decide to build a good model.
2.2 Dataset
2.2.1 Data collection
I got the data from Kaggle. The dataset is house price forecasts for 2014-2015. The raw data set contains over
21000 entries and 21 columns. In this dataset, the price column is the dependent variable and the rest of the
columns except ID and Date are the independent objects. This is the beginning of the plot data set.In Figure 3,
the result of this study is that the value is continuously dependent, and the price and other variables are the
independent variables.
In Figure 3, the result of this study is that the value is continuously dependent, and the price and other variables
are the independent variables.
8|Page
Data cleaning is one of the most important steps before discovery, analysis, and modeling in machine learning.
The purpose of data cleaning is to deal with abnormal data such as missing data, outliers, unwanted data, or
inconsistent data. There are many ways to clean your data. For example, deleting data, replacing data, changing
the data type of a value, and so on. Before cleaning, first examine the raw dataset to see what to do next:
For data processing, there are many things to do with the raw dataset:
9|Page
b. Change the datatype and create more columns for date column
In the raw dataset, this column has the datatype of int, so I change it into float.
2.3 Summary
In this chapter, I've cleaned up my raw dataset into a better dataset that's easy to explore and analyze.
I believe this is the most important step before doing any prediction or modeling in machine learning. In the
next chapter, I begin to build my model and explain and some visualizations will be shown for better analysis.
3 Proposed model
3.1 Correlation
Basically, correlation measures the difference between two variables (Hauke and Kossowski, 2011). The
correlation coefficient formula follows (David Groebner, 2017).
The above function is called the Pearson product moment correlation. To know if two variables are correlated,
we can look at the scatter plot model as shown below:
10 | P a g e
The correlation coefficient, which is r, can range from +1.0 (the perfect positive correlation) to -1.0 (the perfect
negative correlation). If r =0 which means that there is no correlation between the x and y variable. If all points
on the scatter plot fall on a straight line, this is the perfect correlation. As a result, the stronger the linear link
between the two variables, the more the correlation differs from 0.0. The correlation coefficient's sign reflects
the relationship's direction (David Groebner, 2017).
11 | P a g e
12 | P a g e
3.4 R-
squares and Adjusted R-squares
The adjusted coefficients of determination 𝑅2 or 𝑅2, which indicate how much of a change in the response is
explained by the model, may be the most frequently used statistic in regression to assess the goodness of fit of
a model (Akossou and Palm, 2013).
13 | P a g e
Adjusted R-squared calculates the percentage of variance that can be explained by only the independent
variables that have a significant impact on the interpretation of the dependent variable. R-squares only increase
if the independent variable affects the dependent variable.
The average absolute error between the actual value and the predicted value is called MAE (Mean Absolute
Error). L1 loss, also known as absolute error, is a row-level error calculation that determines the non-negative
14 | P a g e
difference between prediction and reality. We can better evaluate the model's performance in the entire data
set by testing the MAE, which is the sum of these errors.
The mean squared difference between the actual value and the predicted value is called the MSE (Mean
Squared Error). The difference between the predicted and the actual squared in the row-level error calculation
is called the squared error, sometimes referred to as the L2 loss. We can better evaluate the model's
performance on the entire data set by testing the MSE, which is the average of these errors.
15 | P a g e
Root Mean Square Error (RMSE) is the standard deviation of the residual (prediction error). The distance
between the data points and the regression line is measured by residuals, and the difference of these residuals
is measured by RMSE.
There are three packages that need to be installed for data discovery, analysis, and application:
Using pip:
Using Anaconda:
There are two packages I would use for visualization: seaborn and matplotlib
Use pip:
Use Anaconda:
16 | P a g e
Step 3: Install packages for modeling I am using Scikit-learn package for modeling, and this package requires:
Install Scikit-learn by
Using pip:
Using Anaconda:
Using pip:
Using Anaconda:
After installing all of the required packages for this work, I will import all of them in Jupyter
17 | P a g e
4.2 Correlation
I will explore the correlation of the dataset. I envision a heatmap to show this:
18 | P a g e
As we can see from the heatmap, I collect some high correlation pair because the correlation score above 0.5:
19 | P a g e
With only sqft_living, the R-square = 0.493 means that it affects 49.3% of the actual house prices. The Figure
below shows how strong relationship between sqft_living and house prices
Sqft_living from 0 to 6000, the house price will be average, but from nearly 8000 and above, the house price
will increase dramatically but rarely decrease. And house prices peaked near 8000000 when sqft_living reached
over 12000. We can see that sqft_living has a lot of influence on the price of the house.
20 | P a g e
Scenario 2: How does the size of the house affect the house price?
With only sqft_above, the R-square = 0.367 means that it affects 36.7% of the actual house prices. The Figure
below shows how strong relationship between sqft_above and house prices
Sqft_above from 0 to about 5000, house prices will be average, but most of the time house prices will increase
more than decrease and sqft_above from more than 6000, house prices will all increase steadily but rarely
decrease. And house prices will peak near 8000000 when sqft_living gets to over 8300. We can see that
sqft_above has little effect on the price of the house.
21 | P a g e
Scenario 3: How does the area of the house area affect the bathroom?
With only sqft_living, the R-square = 0.570 means that it affects 57% of the actual house prices. The Figure
below shows how strong relationship between sqft_living and house bathrooms
22 | P a g e
Sqft_living from 0 to about 6000 then the number of bathrooms will be on average from 0 to 6 but there are
some cases where even if sqft_living increases in the above range, the number of rooms can be more than
average and possibly 0 bathrooms. And sqft_living from more than 8000 or more, the number of bathrooms of
the house will be at 4 or more, but rarely decrease below 4. Finally, we can see that sqft_living has a lot of
influence on the number of bathrooms in the house.
Scenario 4: How does the area of the house campus affect the area of the house?
With only sqft_living, the R-square = 0.582 means that it affects 58.2% of the actual house prices. The Figure
below shows how strong relationship between sqft_above and sqft_above
23 | P a g e
sqft_living from 0 to about 8000 then sqft_above will have a much increasing average. However, when
sqft_living is more than 8000, we can see that almost sqft_above does not increase but only decreases but very
little. It can be seen that sqft_living from 0 to about 8000 has a lot of influence on the house sqft_above.
In addition, the living area (sqft_living) affects the number of bathrooms of a house with a rate of 57% which is
also quite high.
Finally, the size of the house (sqft_living) also affects the area of the house (sqft_above) with a rate of 58.2%
which is high.
24 | P a g e
REFERENCES
ASHISH. 2023. Housing Price Prediction ( Linear Regression ) [Online]. Available:
https://www.kaggle.com/code/ashydv/housing-price-prediction-linear-regression/notebook [Accessed
1/3/2023]
J Manasa; Radha Gupta; N S Narahari. 2020. Machine Learning based Predicting House Prices using Regression
Techniques [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9074952 [Accessed 1/3/2023]
Jim Frost. 2023. How to Interpret Adjusted R-Squared and Predicted R-Squared in Regression Analysis [Online].
Available: https://statisticsbyjim.com/regression/interpret-adjusted-r-squared-predicted-r-squared-regression/
[Accessed 1/3/2023]
Jim Frost. 2023. How To Interpret R-squared in Regression Analysis [Online]. Available:
https://statisticsbyjim.com/regression/interpret-r-squared-regression/ [Accessed 1/3/2023]
CHICCO, D., WARRENS, M. J. & JURMAN, G. 2021. The coefficient of determination R-squared is more
informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation [Online]. Available:
https://pubmed.ncbi.nlm.nih.gov/34307865/ [Accessed 1/3/2023]
HAUKE, J. & KOSSOWSKI, T. 2011. Comparison of Values of Pearson's and Spearman's Correlation Coefficients
on the Same Sets of Data [Online]. Available: https://ideas.repec.org/a/vrs/quageo/v30y2011i2p87-93n9.html
[Accessed 1/3/2023]
VARMA, A., SARMA, A., DOSHI, S. & NAIR, R. 2018. House price prediction using machine learning and neural
networks [Online]. Available: https://ieeexplore.ieee.org/document/8473231 [Accessed 1/3/2023]
25 | P a g e