Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

Part A

1. Method of analyzing the data set


"House Price Data Project" is a data set including 18 450 observations concerning essential
information for trading a property such as price, number of bedrooms, bathrooms, size, furnished
or unfurnished houses, and which level the properties are on. I will analyze 501 observations
from observation 2400 to observation 2900 in this data set using these analyzing methods with a
concentration on descriptive and inferential statistics.

2. Data sources
According to Anderson et al. (2017), data is a collection of information that contains letters,
numbers, photos, and so on. As a result, individuals can better envision the issue and use it in a
variety of sectors such as technology, engineering, science, and business. In particular, in
business and economics, primary data and secondary data are the two types of data that are often
concentrated and used the most.
2.1 Primary data
Primary data is data that has not previously been gathered, is being obtained for the first time,
and is being collected by the researcher himself using four typical methods: experiment,
observation, interview, and survey. (Froese, 2000). In reality, when secondary data cannot match
the study needs or cannot be found, researchers must conduct surveys and gather primary data.
Primary data collection examples include information regarding the customer's current usage of
the product or their appraisal of the product's characteristics and related services.
Advantages: As the original source of information and data, primary data have the benefit of
being more accurate and trustworthy (because the researcher can directly monitor and evaluate
the research process). Following that, primary data is made up mostly of fresh, up-to-date, and
timely information sources.
Disadvantages: Because primary data must also go through the research process, gathering
primary data may take a long time and cost a lot of money. Furthermore, the source of
knowledge from primary data might be relatively limited since it only comes from the study of a
few scholars.
2.2. Secondary data
Secondary data is information gathered by one researcher from the work of other researchers
(Vartanian, 2010). To put it another way, they are indirect data sources. Data that has already
been gathered and is accessible to the public or, in certain cases, is already available inside an
organization are examples of secondary sources. Typically, secondary data is gathered from the
following sources: Data warehouses of businesses and organizations, governmental agency data
warehouses, the Internet, periodicals, and research papers (research journals). Secondary data
examples include socioeconomic surveys and scientific reports.
Advantages: Secondary data is simple to obtain in a short period of time and at a minimal cost.
Furthermore, the data range is broad since it is gathered and processed by many different
academics for a variety of study aims.
Disadvantages: Because secondary data is frequently processed, determining the quality and
credibility of the data source can be challenging. They may serve a variety of uncertain
objectives because the data originates from too many sources and is processed at too many
stages. In addition, a lot of information sources could be outdated.
3. Data collection methods
With primary data, these data are usually collected by two methods:
Quantitative data collecting:
It is based on calculations conducted using mathematics and a variety of forms, such as closed-
ended questions, correlation and regression approaches, mean, median, and mode metrics,
according to Sapsford and Jupp (1996). This approach is less costly and faster than qualitative
data collecting procedures.
Qualitative data collecting:
According to Taylor (2017), this method is connected to non-quantifiable items. One-on-one
interviews, surveys, observations, case studies, focus groups, and other approaches are used to
collect this information.
Secondary data provides a plethora of information linked to many elements, such as information
about consumers, suppliers, workers, goods, business and economics, and so on. As a result, the
gathering approach for this sort of data is to get it from a variety of sources, such as:
Publications data from government public papers, journals, commercial records, historical and
statistical sources, and so forth.
Diaries, letters, unpublished biographies or customer service statistics are examples of
unpublished data.
Part B

There are missing values in 4 variables which are: the price of the property with 1 missing data,
number of bedrooms with 1 missing data, number of bathrooms with 1 missing data, and the area
of the property by m2 with 2 missing data. With a total of 5 missing data.
Outliers can only appear in the quantitative variables which are price, bedrooms, bathrooms, and
area.
It can be seen that the outliers appearing in the above variables belong to the Upper Limit (Q3
+1.5IQR). The upper limits of the variables price, bedrooms, bathrooms and area are: 6 150 000,
4.5, 4.5, and 288 respectively. From this, it can be concluded that:
- Outliers of the price of the property appear from values greater than 6 150 000
- Outliers of the number of bedrooms appear from values greater than 4.5
- Outliers of the number of bathrooms appear from values greater than 4.5
- Outliers of the area of the property by m2 appear from values greater than 288.

Accordingly, the outliers will be removed using the filter as above. These outliers, after being
filtered out, will be shown with a slash in the order number as shown below.
This is the result after replacing the missing values with the mean of each variable.
2.
Variable “level” has been recoded into 2 groups including group 1: Ground, 1 st, and 2nd,
and group 2 are the remaining levels.
- For qualitative variables, I used frequency tables for all 3 variables. Meanwhile, I used
bar chart for the type of property variable and used pie chart for Furnished and
Level_group variables.

The type of property


Cumulative
Frequency Percent Valid Percent Percent
Valid Apartment 454 90.6 90.6 90.6
Duplex 33 6.6 6.6 97.2
Penthouse 7 1.4 1.4 98.6
Studio 6 1.2 1.2 99.8
Unknown 1 .2 .2 100.0
Total 501 100.0 100.0
As can be seen from the frequency table and bar chart above, the type of property is mostly
apartments with 456 out of 501 observations, accounting for 90.6%. Next is duplex with 33
observations, accounting for 6.6%. Penthouse and studio are two types of properties that account
for almost equal and negligible proportions with 1.4% and 1.2% respectively. And the least type
of property unknown with only 1 observation (0.2%).

Is the property Furnished or not


Cumulative
Frequency Percent Valid Percent Percent
Valid No 324 64.7 64.7 64.7
Yes 177 35.3 35.3 100.0
Total 501 100.0 100.0

The pie chart demonstrates whether the property is furnished or not. Accordingly, the percentage
of unfurnished properties accounts for the majority with 324 observations, accounting for 64.7%.
Almost double that of furnished properties with 177 observations and 35.3%.

Level_group
Cumulative
Frequency Percent Valid Percent Percent
Valid 1 281 56.1 56.1 56.1
2 220 43.9 43.9 100.0
Total 501 100.0 100.0

Level_group variable is a variable that has been recoded and divided from the level variable into
2 groups: Group 1 includes ground, 1st, and 2nd levels and group 2 includes all remaining levels.
group 1 accounts for more than half of the chart with 281 observations and 56.1%. About 12%
more than group 2 with 220 observations.

For quantitative variables, I used summary statistics for all 4 variables "Price, bedrooms,
bathrooms, and Area" which were cleaned up above by replacing missing values with mean of
each variable. Besides, I used histogram for the two variables price and area and for the two
variables bedrooms and bathrooms, I used boxplot to analyze them.

Statistics
SMEAN(Bedroo SMEAN(Bathroo
SMEAN(Price) ms) ms) SMEAN(Area)
N Valid 501 501 501 501
Missing 0 0 0 0
Mean 2038710.21600 2.766 2.078 156.926
00
Std. Error of Mean 78889.9447126 .0322 .0366 2.6375
7
Median 1633000.00000 3.000 2.000 150.000
00
Mode 3500000.00000 3.0 2.0 140.0
Std. Deviation 1765795.94284 .7206 .8197 59.0348
291
Variance 3118035311760 .519 .672 3485.109
.496
Skewness 1.809 .739 .795 1.061
Std. Error of Skewness .109 .109 .109 .109
Kurtosis 5.149 3.867 2.392 2.072
Std. Error of Kurtosis .218 .218 .218 .218
Range 12940000.0000 6.0 6.0 391.0
0
Minimum 60000.00000 1.0 1.0 39.0
Maximum 13000000.0000 7.0 7.0 430.0
0
Sum 1021393818.21 1385.8 1041.1 78619.9
600
Percentiles 25 650000.000000 2.000 2.000 118.500
0
50 1633000.00000 3.000 2.000 150.000
00
75 2850000.00000 3.000 3.000 185.500
00

The above table demonstrates statistics such as mean, medium, mode, std.deviation, minimum,
maximum, etc of information about 501 properties including price, number of bedrooms,
bathrooms, and area of properties in m2.

As can be seen from the chart above, in this 501 properties, the price of these properties ranges
from 0 to 13 000 000, frequency from 0 to 100. The average price of a property is 2 038 710 and
most properties are priced below or equal to 2 500 000. It can be said that the price of these 501
properties has relatively high variability around the moving average with a standard deviation
of about 1765795.
The area of 501 properties is calculated in m2 ranging from 39m2 to 430m2. As can be seen in
the frequency column, the value between 130 and 150 accounts for the most value. This also
partly explains why the mode of these 501 properties has a value of 140 m2.

Both the number of bedrooms and number of bathrooms variables range from 1 to 7. and an
indicator that is also quite similar between these two variables is the mean when they have an
average of about 2.7 and 2.1 respectively. Meanwhile, the mode of the number of bedrooms is 3
while the mode of the number of bathrooms is 2.

3. Correlation
3.1. Quantitative variables
As can be seen from the correlations table above, in terms of the correlation between three
variables which are bedrooms, bathrooms, area, and the price variable, it can be seen that they all
have a sig coefficient of 0.000 (<0.01), which means they are correlated with each other.
Besides , all three quantitative variables bedrooms, bathrooms and area have a positive
correlation with the price variable. Since they all have positive Pearson correlation values.
Next in terms of strengths, the relationship between price and bathrooms is the highest of the
three variables with Pearson correlation of 0.573, assessed as a moderate positive correlation
(as shown in the table above). Next is the correlation between price and area and the correlation
between price and bedrooms with pearson correlation of 0.459 and 0.381, respectively. Both of
these correlations belong to low positive correlation with values in the range of 0.3 to 0.5.
Applying scatter plot, it can be seen that all three variables number of bathrooms, bedrooms and
area of the property in m2 show the upward trend of the values. Combined with the
determination that all three variables have a positive correlation with the price variable as
analyzed above. The following can be drawn that the more bedrooms, and bathrooms, the higher
the price of the property. And the larger the area of the property, the higher the price of the
property.

3.2. Qualitative variables


As can be seen from the table above, with four types of properties defined as apartment, duplex,
penthouse and studio, they all have different mean price and standard deviation. With the
highest average price is the duplex with about 3.5 million, followed by the penthouse, which is
slightly lower than the duplex, about 3.25 million. Two types of properties with the lowest value
are apartment and studio with the average price of about 1.9 million and 980 thousand
respectively. The standard deviation of duplex also has the largest value of 2376362, followed by
penthouses, apartments and studios with steadily decreasing values of 2100030, 1667243 and
413650, respectively. With two values is the average and std.deviation of the price of the
property in terms of each type of properties, I calculated the coefficient of variation of each type
of properties by taking the std. deviation of a property divided by the average price of that
property and multiplying it by 100. From there, it can be seen that the variation of apartments is
the largest, followed by duplex, penthouse and studio. In short, it can be seen that the price of a
property depends on what type of property it is.

It can be seen that the average price of furnished and unfurnished properties has only a slight
difference of about 100 000 ( 2 074 572 for unfurnished properties and 1 973 064 for furnished
properties). The variation of these two values is not too different with 89 compared to 81. In
addition, furnished properties usually cost more than unfurnished properties, but in these 501
observations, the opposite is true. From these two things, it can be said that the price of the
properties is almost independent of whether the properties are furnished or not.

The mean price of the properties in group 1 including ground, 1 st, and 2nd levels is 2 285 151,
relatively higher than the average price of properties in the remaining levels at 1 723 937.
Besides, group 1 also has a smaller coefficient of varation than group 2 (81 vs. 92). It can be
concluded that the price of properties depends on the level of those properties, low-level
properties from the ground, 1st and 2nd floors are more expensive than properties at the remaining
levels.

4. The rationale for choosing the methods of communication


Part B's data is all evaluated using descriptive statistics and displayed in the form of charts suited
for each kind of variable to enable tracking the results easier. To quantify and illustrate
qualitative data, frequency tables, bar charts, and pie charts are employed as graphical and
numerical summaries. By doing away with the requirement to digitize the signatures of
qualitative variables, the frequency table may save time and effort. Users may detect patterns in
data changes by comparing data discrepancies over predefined time points using bar charts. With
straightforward and basic figures, pie charts enable data to be graphically represented as a tiny
fraction of a bigger total.
For quantitative data, summary statistics were used to summarize the data set and present the
most information in the simplest manner feasible. The mean is a statistical indicator that may be
used to analyze the frequency distribution of data and is one of the parameters of this approach.
If a data point deviates significantly from the mean, the data set is highly skewed; the more
dispersed the data, the larger the standard deviation. The correlation between two variables was
then determined using cross-tables and scatter plots. The usage of cross-tables enables for easier
attribute evaluation and comparison, as well as speedier data gathering due to the removal of the
time component. For scatter plots, the dependent variable is plotted on the Y-axis, and the
independent variable is shown on the X-axis, using dots to indicate the values of the two separate
variables. The result of the dependent variable may therefore be predicted using the measure of
the independent variable, and it is then feasible to establish if there is a high or weak correlation
between the two variables.

Part C
1. T-test

Independent Samples Test


Levene's Test for Equality of
Variances

F Sig. t df Sig. (2-ta


The price of property Equal variances assumed .741 .390 -.615 499

Equal variances not -.642 409.384


assumed

This t-test I used in this section to compare the price of properties which are furnished and
unfurnished. The F hypothesis test is used to examine the presence of variation at the 0.05 level
of significance.
H 0 :σ furnished =σ unfurnished
H 1 : σ furnished ≠ σ unfurnished
p-value (Sig.) = 0.390 > 0.05 -> Do not reject H 0 -> H 1 is rejected
 Equal variances assumed.
Next, assuming µ1 is mean price of properties that are furnished and µ2 is mean price of
properties that are unfurnished. The T-test is used to determined if these two means are equal.
H 0 : μfurnished =μunfurnished
H 1 : μ furnished ≠ μunfurnished
p-value (Sig.) = 0.539 > 0.05 -> Do not reject H 0 -> H 1 is rejected
Therefore, it can be concluded that the price of furnished houses and the price of unfurnished
houses are not different at the 0.05 significance level.

Independent Samples Test

Levene's Test for Equality of Variances

F Sig. t df Sig. (2-tailed)


The price of property Equal variances assumed 1.661 .199 -.339 187 .735

Equal variances not assumed -.349 100.981 .728

This t-test I used in this section to compare the price of properties which are belonged to group
1 and group 2 of the variable “Level_group”. The F hypothesis test is used to examine the
presence of variation at the 0.05 level.

H 0 :σ group1=σ group2
H 1 : σ group 1 ≠ σ group 2
p-value (Sig.) = 0.199 > 0.05 -> Do not reject H 0 -> H 1 is rejected
 Equal variances assumed.
Next, assuming µ1 is mean price of properties that are furnished and µ2 is mean price of
properties that are unfurnished. The T-test is used to determined if these two means are equal.
H 0 : μgroup 1=μ group2
H 1 : μ group 1 ≠ μgroup 2
p-value (Sig.) = 0.735 > 0.05 -> Do not reject H 0 -> H 1 is rejected
Therefore, it can be concluded that the price of houses in group 1 and the price of houses in
group 2 are not different at the 0.05 significance level.

2. Regression model
Model Summary
Std. Error of the
Model R R Square Adjusted R Square Estimate
1 .590a .348 .342 1432770.80597328

a. Predictors: (Constant), Level by group 1,2, Furnished_Dummy, Number of


bedrooms, Number of bathrooms, The Area of the property by m2

As can be seen from the table above with an R square value of 0.348. This means
that the five independent variables furnished, bedrooms, bathrooms, area and
level_group explain 34.8% of the variation in the dependent variable “price”.
Besides, with the Adjusted R Square value of 0.342, this means random error and
extraneous factors account for 65.8% of the variance.

ANOVAa
Sum of
Model Squares df Mean Square F Sig.
1 Regression 54286572556 5 10857314511 52.889 .000b
7831.250 3566.250
Residual 10161519303 495 20528321824
12412.200 49.318
Total 15590176558 500
80243.500
a. Dependent Variable: The price of property
b. Predictors: (Constant), Level by group 1,2, Furnished_Dummy, Number of
bedrooms, Number of bathrooms, The Area of the property by m2

Test for overall significant

{H 0 : R 2=0
2
H 1: R > 0
P_value = 0.000 < α =0.05=¿ Reject H 0
 Model is overall significant

Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) -246674.553 327138.090 -.754 .451
Number of bedrooms -135011.351 130210.798 -.055 -1.037 .300
Number of bathrooms 1043997.729 115064.034 .485 9.073 .000
The Area of the 5327.491 1618.073 .178 3.292 .001
property by m2
Furnished_Dummy -274087.585 136315.600 -.074 -2.011 .045
Level by group 1,2 -173570.367 132372.260 -.049 -1.311 .190
a. Dependent Variable: The price of the property
^
Price=−246674.553−135011.35∗Bedrooms+1043997.729∗Bathrooms +5327.491∗Area−274087.585∗Furni
Level_group = 1
^
Price=−246674.553−135011.35∗Bedrooms+1043997.729∗Bathrooms +5327.491∗Area−274087.585∗Furni
Level_group = 2
^
Price=−246674.553−135011.35∗Bedrooms+1043997.729∗Bathrooms +5327.491∗Area−274087.585∗Furni
Furnished = 0

^
Price=−246674.553−135011.35∗Bedrooms+1043997.729∗Bathrooms +5327.491∗Area−274087.585∗0−17
Furnished = 1
^
Price=−246674.553−135011.35∗Bedrooms+1043997.729∗Bathrooms +5327.491∗Area−274087.585∗1−17

 Conclusions:
- If increasing 1 more bedroom, the mean price of the property will decrease by 135 011
- If increasing 1 more bathroom, the mean price of the property will increase by 1 043 997
- If the area increases by 1 m2, the mean price of the property will increase by 5327
- Unfurnished properties are higher in mean price than furnished properties by a coefficient value
of -274087.585
- Properties on the ground floor, 1 and 2 have a higher average price than properties on the 3rd
and higher floors by a coefficient value of -173570.367

Test for coefficients:


{ H 0 : βi=0
H 1 : βi ≠ 0

- Bedrooms: P_value = 0.300 > α =0.05=¿ Donot reject H 0


=> The price variable and the bedroom variable have no relationship
- Bathrooms: P_value = 0.000 < α =0.05=¿ Reject H 0
=> The price variable and bathrooms variable are related

- Area: P_value = 0.001< α =0.05=¿ Reject H 0


=> The price variable and area variable are related
- Furnished: P_value = 0.045 < α =0.05=¿ Reject H 0
=> The price variable and furnished variable are related
- Level_group: P_value = 0.190 > α =0.05=¿ Do not reject H 0
=> The price variable and the level_group variable have no relationship.

 We can narrow down the model when we exclude the variable “Bedrooms” and
“Level_group” out of the model.

3. Evaluation of summary statistics and hypothesis testing


Tables and graphs were used to depict the data for part B's summary statistics, a subset of
descriptive statistics. Before doing analysis and charting, the variables must be divided into two
groups: qualitative variables and quantitative variables. Because selecting the wrong variable
would result in incorrect numbers that do not represent the qualities of the data, the data test can
only be performed when the input conditions are completed. Meanwhile, part C's hypothesis
testing procedure uses inferential statistics. It entails selecting a sample from a population, doing
calculations, evaluating the data from that sample, comparing the findings with the hypothesis,
and then determining whether or not to accept the hypothesis in light of the outcomes.
Summary statistics are used to compile data from a sample in a concise manner. According to
Brune (2007), using inferential statistics, the data collected from a sample are extrapolated to the
sample's whole population. In contrast to inferential statistics, descriptive statistics only describe
the characteristics of the sample from which the data are produced. Inferential statistics employ
measurements from the sample to deduce the population's characteristics. Besides, while
descriptive statistics aids in explaining existing data to summarise the sample, inferential
statistics try to make inferences to understand more about the population beyond the data that is
presently available. In order to reach statistically meaningful conclusions, descriptive statistics
and inferential statistics must be combined in a proper way.
Reference list:
Church, M. R. (2002) ‘The Effective Use of Secondary Data’, Learning and Motivation,
33(1), 32-45.

Fisher, M. J., & Marshall, A. P. (2009) 'Understanding descriptive statistics', Australian


Critical Care, 22(2), 93-97.

Isaacson, E., & Keller, H. B. (2012) Analysis of numerical methods. Massachusetts:


Courier Corporation.

Ulliah, M. (2014). Primary and Secondary Data in Statistics.


Available at: https://itfeature.com/statistics/primary-and-secondary-data-in-statistics
Surbhi,S. (2016). Difference Between Primary and Secondary Data
Available at: https://keydifferences.com/difference-between-primary-and-secondary-data.html
Meares, P. A., & Lane, B. A. (1990) ‘Social Work Practice: Integrating Qualitative and
Quantitative Data Collection Techniques’, Social Work, 35(5), 1990, 452–458.

You might also like