Statistics Spreadsheet Intermediate

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

STATISTICS &

SPREADSHEET
A journey from data preparation & cleaning until the
Exploratory Descriptive Analytics (EDA) involving statistics
of a property listing dataset in Kuala Lumpur, Malaysia

LILIEK DARMAWAN TH
DATA OVERVIEW
TABLE OF Short overview about the dataset

CONTENT DATA CLEANING & OUTLIER


Have some quick sight ol what you'll get inside! REMOVAL OVERVIEW
This section proves the reliability of data removal and method to
imputation of missing values and displays the final result of data
distribution and statistical measurements after done the cleaning
process. The comparison between the beginning and final result will
be shown at the last page of the section.

EDA & INSIGHTS


Taking two kind of problems to perform the Exploratory Data Analytics
(EDA) and give some possible insights. The cases will be the overview
of top 3 property listings based on AVG price, and the second one is
price prediction using linear regression on the next section.

LINEAR REGRESSION
Further analysis of price prediction using linear regression method.
DATA DATA OVERVIEW
The dataset has 5000 property listings within in Kuala Lumpur which the

OVERVIEW dataset encloses the information about:


Specified location in Kuala Lumpur
Price of property
Dataset about property listings in Kuala Lumpur No of rooms (bedroom, bathroom, maid room, car park)
Area and type of building (Built-up / Land area)
Property type
Furnishing status

DATA CLEANING
The dataset contains of 5000 data of property listing, however some
data has not enough information to be analyzed. Before we could
conduct a further analyses, we need to do some cleaning and
preparation for the dataset, including:
Removal of missing and irrelevant value
Handling of missing value
Repair and conversion to proper data type
Removal of duplicates value
DATA REMOVAL RELIABILITY
TOLERANCE = 10%

REMOVAL REMOVAL REMOVAL OF REMOVAL OF REMOVAL


DATA WITH DATA WITH DUPLICATES OUTLIER 1 DATA WITH
EMPTY PRICE EMPTY AREA (PRICE, AREA) EMPTY ROOM

36 100 164 92 76

Removal = 468 / 5000 = 9.36%


IMPUTATION OF MISSING VALUE
SIMPLE LINEAR REGRESSION (TRENDLINE METHOD)

DATA SELECTION:
Since the missing value occurs at 'Area' in Cheras with
various types of condominium, the sample can be limited
within these conditions:
City: Cheras
Type: Condominium
Remove some outliers

AREA =
(1.27 * PRICE / 1000 + 453)
PRICE DISTRIBUTION

Price Percentile Price

Mean RM 1,889,018 10 RM 480,000

Standard Error RM 26,225 20 RM 628,000

Median RM 1,282,500 30 RM 800,000


> RM 4.950.000 (Upper Fence) Mode RM 1,200,000 40 RM 1,038,000
Considered as outlier but may not be removed
Standard Deviation RM 1,810,839 50 RM 1,285,000
(to supress the number of data removal)
Sample Variance 3279138840205 60 RM 1,600,000

Coef. Of Variance 95.86% 70 RM 2,050,000

Kurtosis 7.013766036 80 RM 2,750,000

Skewness 2.373218475 90 RM 4,000,000

Range RM 11,998,850 25 RM 700,000

Minimum RM 1,150 50 RM 1,285,000

Maximum RM 12,000,000 75 RM 2,400,000

Sum RM 9,006,835,892
Price Distribution is skewed positively
Count 4768 IQR RM 1,700,000
Upper Fence RM 4,950,000
Lower Fence -RM 1,850,000
AREA DISTRIBUTION

Area Percentile Price

Mean 2,235 sq. ft. 10 840 sq. ft.

Standard Error 26 sq. ft. 20 1003 sq. ft.

> 5.135 sq.ft (Upper Fence) Median 1,600 sq. ft. 30 1170 sq. ft.
Considered as outlier but may not be removed
Mode 1,650 sq. ft. 40 1389 sq. ft.
(to supress the number of data removal)
Standard Deviation 1,824 sq. ft. 50 1600 sq. ft.

Sample Variance 3327596.845 60 1870 sq. ft.

Coef. Of Variance 81.62% 70 2357 sq. ft.

Kurtosis 5.682683647 80 3200 sq. ft.

Skewness 2.244273777 90 4284 sq. ft.

Range 11,670 sq. ft. 25 1080 sq. ft.

Minimum 130 sq. ft. 50 1600 sq. ft.

Maximum 11,800 sq. ft. 75 2702 sq. ft.

Sum 10,655,679 sq. ft.


Area Distribution is skewed positively
Count 4768 IQR 1622 sq. ft.
Upper Fence 5135 sq. ft.
Lower Fence -1353 sq. ft.
PRICE AREA
SKEWNESS SKEWNESS

7.26 2.37 59.07 2.24

MEDIAN MEAN MEDIAN MEAN


RM 1.282.500 RM 1.889.018 1600 sq.ft. 2235 sq.ft.

MODE KURTOSIS MODE KURTOSIS


RM 1.200.000 7.01 1650 sq.ft. 5.68
EDA & TOP 3 PROPERTY LISTINGS
In this study case, we will discover the top 3 property listings based on

INSIGHTS average price, supported by sufficient statistical measurements and will


be enclosed with some insights. Please note here is only one example of
the EDA performed.
We take 2 cases to perform the EDA.

PRICE PREDICTION
The next study case will involve a correlation matrix and linear
regression to find a prediction model for price using given variables.
However, we need to check on some conditions to make sure the
regression model fits to the actual condition, which will be shown in the
some following slides.
TOP 3
BASED ON
AVG PRICE
BUKIT KIARA
Price Rooms Maid Room Bathrooms Car Parks

Mean RM 4,947,432 Mean 4.75 Mean 0.5 Mean 5.5 Mean 3.5

Standard Error RM 180,753 Standard Error 0.25 Standard Error 0.2886751346 Standard Error 0.2886751346 Standard Error 0.2886751346

Median RM 4,988,888 Median 5 Median 0.5 Median 5.5 Median 3.5

Mode RM 4,988,888 Mode 5 Mode 1 Mode 6 Mode 4

Standard Deviation RM 361,506 Standard Deviation 0.5 Standard Deviation 0.5773502692 Standard Deviation 0.5773502692 Standard Deviation 0.5773502692

Sample Variance 130686423860 Sample Variance 0.25 Sample Variance 0.3333333333 Sample Variance 0.3333333333 Sample Variance 0.3333333333

Kurtosis 1.756096756 Kurtosis 4 Kurtosis -6 Kurtosis -6 Kurtosis -6

Skewness -0.6759949129 Skewness -2 Skewness 0 Skewness 0 Skewness 0

Range RM 877,707 Range 1 Range 1 Range 1 Range 1

Minimum RM 4,467,122 Minimum 4 Minimum 0 Minimum 5 Minimum 3

Maximum RM 5,344,829 Maximum 5 Maximum 1 Maximum 6 Maximum 4

Sum RM 19,789,727 Sum 19 Sum 2 Sum 22 Sum 14

Count 4 Count 4 Count 4 Count 4 Count 4


DAMANSARA HEIGHTS
Price Rooms Maid Room Bathrooms Car Parks

Mean RM 4,509,822 Mean 4.210526316 Mean 0.7763157895 Mean 4.822368421 Mean 2.407894737

Standard Error RM 248,091 Standard Error 0.1471915485 Standard Error 0.04398002535 Standard Error 0.1751616227 Standard Error 0.1849032189

Median RM 4,000,000 Median 4 Median 1 Median 4 Median 2

Mode RM 2,100,000 Mode 4 Mode 1 Mode 4 Mode 0

Standard Deviation RM 3,058,675 Standard Deviation 1.814699285 Standard Deviation 0.5422221683 Standard Deviation 2.15953752 Standard Deviation 2.279639983

Sample Variance 9355490755550 Sample Variance 3.293133496 Sample Variance 0.2940048797 Sample Variance 4.6636023 Sample Variance 5.196758452

Kurtosis -0.3137257113 Kurtosis -0.1693858052 Kurtosis -0.1867490677 Kurtosis 0.1882247896 Kurtosis 0.327117642

Skewness 0.8040943885 Skewness -0.03534967296 Skewness -0.1055613426 Skewness 0.5684633676 Skewness 0.8686508788

Range RM 11,670,000 Range 9 Range 2 Range 10 Range 10

Minimum RM 330,000 Minimum 0 Minimum 0 Minimum 1 Minimum 0

Maximum RM 12,000,000 Maximum 9 Maximum 2 Maximum 11 Maximum 10

Sum RM 685,493,007 Sum 640 Sum 118 Sum 733 Sum 366

Count 152 Count 152 Count 152 Count 152 Count 152
FEDERAL HILLS
Price Rooms Maid Room Bathrooms Car Parks

Mean RM 4,257,500 Mean 4.5 Mean 1 Mean 6 Mean 0

Standard Error RM 649,338 Standard Error 0.5 Standard Error 0 Standard Error 0 Standard Error 0

Median RM 3,950,000 Median 4 Median 1 Median 6 Median 0

Mode RM 5,950,000 Mode 4 Mode 1 Mode 6 Mode 0

Standard Deviation RM 1,298,676 Standard Deviation 1 Standard Deviation 0 Standard Deviation 0 Standard Deviation 0

Sample Variance 1686558333333 Sample Variance 1 Sample Variance 0 Sample Variance 0 Sample Variance 0

Kurtosis -1.303375785 Kurtosis 4 Kurtosis 0 Kurtosis 0 Kurtosis 0

Skewness 0.8399526679 Skewness 2 Skewness 0 Skewness 0 Skewness 0

Range RM 2,770,000 Range 2 Range 0 Range 0 Range 0

Minimum RM 3,180,000 Minimum 4 Minimum 1 Minimum 6 Minimum 0

Maximum RM 5,950,000 Maximum 6 Maximum 1 Maximum 6 Maximum 0

Sum RM 17,030,000 Sum 18 Sum 4 Sum 24 Sum 0

Count 4 Count 4 Count 4 Count 4 Count 4


KEY TAKEAWAYS

# # Maid # #
Location Avg of Price # Props Type of Props
Rooms rooms Bathrooms Car Parks

Bukit Kiara RM 4,947,432 4 4-sty Terrace/Link-House 4-5 0-1 5-6 3-4

2, 2,5, 4-sty Terrace/Link House;


Damansara Heights RM 4,534,464 152 Bungalow; Condominium; Semi- 0-9 0-2 1 - 11 0 - 10
detached House; Serviced Residential

Bungalow, Semi-detached House,


Federal Hill RM 4,257,500 4 4-6 1 6 -
Town House

Among the top 3 property listings based on average price, Observing from the descriptive statistics, the statistics for
Damansara Height has more varied choice due to type of both Bukit Kiara and Federal Hills are easily distracted if
property, # of rooms, maid room, bathroom, and also the car there is any new property listed to the location due to the less
parks; while Bukit Kiara and Federal Hills are limited to 4 number of property in these 2 locations.
buildings only.
STUDY ABOUT
PRICE IN
MONT KIARA
CORRELATION MATRIX

AREA = 0.91
ROOMS = 0.71
BATHROOM = 0.70
MAID ROOM = 0.47
These 4 aspects have moderate to strong correlation
which could possibly affect the price in Mont Kiara
location. While the availability of car park might not
affect the price too much.

The larger the area, or the more the


rooms or bathrooms may affect the CAR PARKS = 0.21
higher price of property in Mont Kiara.
LINEAR REGRESSION CONDITION
This section will perform linear regression check to make price prediction
model in Mont Kiara location

High correlation is not allowed between each


01 Non-multicollinearity independent variable

02 Homoscedasticity Data distribution has finite similar variance

03 Non-autocorrelation Data is not related to time-series type


NON-MULTICOLLINEARITY

'Area' has high correlation to most of the


other independent variables. So, 'Area' should
'Area' vs ' Room' has be excluded from the regression model
high correlation. One
needs to be excluded
HOMOSCEDASTICITY CHECK
SIGNIFICANCE TEST
Presentations are communication tools that can be
used as lectures.

Regression Statistics
01 Simultaneous (F-test)
Multiple R 0.7687

R Square 0.5909
02 Partial (t-test)
Adjusted R Square 0.5881

Standard Error 564346.842


03 Hypotheses

Observations 592
Reject null hypotheses, H0, if:
Alpha Threshold > 5%
SIGNIFICANCE TEST: SIMULTANEOUS

df SS MS F Significance F

Regression 4 270103371078313 67525842769578 212.0204807 0

Residual 587 186952079193022 318487358080

Total 591 457055450271335


SIGNIFICANCE TEST: PARTIAL

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95,0% Upper 95,0%

Intercept -446545.5325 92590.62526 -4.822794223 0.000001806 -628394.7739 -264696.2912 -628394.7739 -264696.2912

Rooms 429652.6354 39313.56883 10.9288637 0 352440.254 506865.0167 352440.254 506865.0167

Maid Room 160801.5799 54300.74872 2.96131423 0.0031871308 54154.17438 267448.9854 54154.17438 267448.9854

Bathrooms 201482.1485 24861.0243 8.104338186 0 152654.7603 250309.5368 152654.7603 250309.5368

Car Parks 14617.51825 19260.46205 0.758939127 0.4481936057 -23210.28993 52445.32644 -23210.28993 52445.32644

Coefficients for each Less significance but doesn't


variable in the disturb the regression. Still
regression model taken into account
SIGNIFICANCE TEST α = 5%
Presentations are communication tools that can be
used as lectures.

Regression Statistics 01 Simultaneous (F-test)

Multiple R 0.7687 Significance F = 0


Reject H0
R Square 0.5909
02 Partial (t-test)
Adjusted R Square 0.5881
Room Reject H0
Standard Error 564346.842 Maid room Reject H0
Bathroom Reject H0
Car Park Accept H0
Observations 592
(Although the statistical is accept H0, but it
doesn't disturb the regression. So car park
variable may be taken into account to the
regression model
PRICE PRICE =
- 466.545 + 429.652 * Room + 160.801 * Maid Room
PREDICTION + 201.482*Bathroom + 14.617*Car Parks
From the linear regression

INTUITIVELY..
The price of 1 room is RM 429.562
The price of 1 maid room is RM 160.801
The price of 1 bathroom is RM 201.482
The price of 1 car park is RM 14.617

However, the dataset limits the regression model


to have minimum rooms since the constant, as the
starting price) is negative (- RM 466.545).
PRICE CHECK FOR...
3 Rooms + 2 Bathrooms + 2 Car Parks + 100 sq.ft.
PREDICTION = - 466.545 + 429.652 * Room + 160.801 * Maid Room
From the linear regression
+ 201.482*Bathroom + 14.617*Car Parks
= - 466.545 + 429.652 (3) + 160.801 (0)
+ 201.482 (2) + 14.617 (2)
= RM 1.274.612

cc urate to the
A
given data
ANY FURTHER DISCUSSION?

LET'S CONNECT!
LILIEK DARMAWAN TH
www.linkedin.com/in/liek-darmawan

You might also like