Professional Documents
Culture Documents
060 Techniques of Data Analysis
060 Techniques of Data Analysis
Objectives
Overall: Reinforce your understanding from the main
lecture
Specific:
* Concepts of data analysis
* Some data analysis techniques
* Some tips for data analysis
Statistical Methods
Something to do with statistics
Statistics: meaningful quantities about a sample of
objects, things, persons, events, phenomena, etc.
Widely used in social sciences.
Simple to complex issues. E.g.
* correlation
* anova
* manova
* regression
* econometric modelling
Two main categories:
* Descriptive statistics
* Inferential statistics
Descriptive statistics
Use sample information to explain/make
abstraction of population phenomena.
Common phenomena:
* Association (e.g. 1,2.3 = 0.75)
* Tendency (left-skew, right-skew)
* Causal relationship (e.g. if X, then, Y)
* Trend, pattern, dispersion, range
Used in non-parametric analysis (e.g. chisquare, t-test, 2-way anova)
No. of houses
200000
150000
100000
50000
200,000
1991
150,000
2000
100,000
50,000
32635.8
38100.6
42468.1
47684.7
48408.2
61433.6
77255.7
97810.1
71719
73892
85843
95916
101107
117857
134864
86323
85534
85821
90366
101508
111952
125334
143530
154179
Ba
tu
J o Pa
ho h a
rB t
ah
r
Kl u
Ko ua
ta ng
Ti
n
M ggi
er
si
ng
M
u
Po ar
n
Se tian
ga
m
at
0
Loan t o propert y sect or (RM
250,000
million)
District
200
14
10
8
6
4
2
0
180
160
140
120
70
-7
4
60
-6
4
50
-5
4
40
-4
4
30
-3
4
20
-2
4
100
10
-1
4
04
Proportion (%)
12
80
0
20
40
60
80
100
120
50.00
180
160
140
120
%
prediction
error
40.00
100.00
80.00
60.00
40.00
20.00
0.00
-20.00
-40.00
-60.00
-80.00
-100.00
30.00
20.00
10.00
100
80
20
40
60
80
100
120
Inferential statistics
Using sample statistics to infer some
phenomena of population parameters
Common phenomena: cause-and-effect
* One-way r/ship
Y = f(X)
* Multi-directional r/ship
Y1 = f(Y2, X, e1)
Y2 = f(Y1, Z, e2)
* Recursive
Y1 = f(X, e1)
Y2 = f(Y1, Z, e2)
Examples of relationship
Dep=9t 215.8
Dep=7t 192.6
Coefficientsa
Model
1
(Constant)
Tanah
Bangunan
Ansilari
Umur
Flo_go
Unstandardized
Coefficients
B
Std. Error
1993.108
239.632
-4.472
1.199
6.938
.619
4.393
1.807
-27.893
6.108
34.895
89.440
Standardized
Coefficients
Beta
-.190
.705
.139
-.241
.020
t
8.317
-3.728
11.209
2.431
-4.567
.390
Sig.
.000
.000
.000
.017
.000
.697
Wrong technique
Correct technique
Correct technique
Using a regression
parameter
Multi-dimensional
scaling, Likert scaling
Simple regression
coefficient
Using R2
Hold-out samples
MAPE
Multi-dimensional
scaling, Likert scaling
Multi-dimensional
scaling, Likert scaling
Principles of analysis
Goal of an analysis:
* To explain cause-and-effect phenomena
* To relate research with real-world event
* To predict/forecast the real-world
phenomena based on research
* Finding answers to a particular problem
* Making conclusions about real-world event
based on the problem
* Learning a lesson from the problem
Number
6
4
10
15
Basic concepts
Central tendency
Variability
Probability
Statistical Modelling
Basic Concepts
SD
= 120,000
SST
= 210,000
DST
J.B. houses
=?
Central Tendency
Measure
Mean
(Sum of
all values
no. of
values)
Median
(middle
value)
Mode
(most
frequent
value)
Advantages
Disadvantages
Thus,
10 12
2
14 24 18 20 12
= 96/12 = 8
= 96;
= 12
135-140
140-145
145-150
150-155
155-160
137.5
142.5
147.5
152.5
157.5
1282.5
885.0
305.0
157.5
fx 687.5
130-135
135-140
140-145 155-50
150-155
Rental (RM/month)
>135
> 140
> 145
> 150
> 155
Cumulative frequency
17
23
25
Variability
Indicates dispersion, spread, variation, deviation
For single population or sample data:
standard deviation
standard deviation
Variability (contd.)
Why measure of dispersion important?
Consider returns from two categories of shares:
* Shares A (%) = {1.8, 1.9, 2.0, 2.1, 3.6}
* Shares B (%) = {1.0, 1.5, 2.0, 3.0, 3.9}
Mean A = mean B = 2.28%
But, different variability!
Var(A) = 0.557, Var(B) = 1.367
* Would you invest in category A shares or
category B shares?
Variability (contd.)
Coefficient of variation COV std. deviation as
% of the mean:
Variability (contd.)
Std. dev. of a frequency distribution
The following table shows the age distribution of second-time home buyers:
x^
Probability Distribution
Defined as of probability density function (pdf).
Many types: Z, t, F, gamma, etc.
God-given nature of the real world event.
General form:
(continuous)
(discrete)
E.g.
Dice1
1
2
3
4
2
3
4
5
3
4
5
6
4
5
6
7
5
6
7
8
6
7
8
9
7
8
9
10
10
11
10
11
12
Dice2
Discrete values
Discrete values
Frequency
Mean = 4.0628
Std. Dev. = 1.70319
N = 32
0
2.00
3.00
4.00
5.00
Rental (RM/sq.ft.)
6.00
7.00
P(Rental = RM 8) = 0
0.206
P(Rental 7) = 0.028
* Bell-shaped, symmetrical
= mean of variable x
* Has a function of
= std. dev. Of x
= ratio of circumference of a
circle to its diameter = 3.14
e = base of natural log = 2.71828
Probability distribution
1 = ?
2 = ?
3 = ?
Probability distribution
* Has the following distribution of observation
Probability distribution
There are various other types and/or shapes of
distribution. E.g.
Note: p(AGE=age) 1
How to turn this graph into
a probability distribution
function (p.d.f.)?
Z-Distribution
(X=x) is given by area under curve
Has no standard algebraic method of integration Z ~ N(0,1)
It is called normal distribution (ND)
Standard reference/approximation of other distributions. Since there
are various f(x) forming NDs, SND is needed
To transform f(x) into f(z):
x-
Z = --------- ~ N(0, 1)
160 155
E.g. Z = ------------- = 0.926
5.4
Z-distribution (contd.)
When X= , Z = 0, i.e.
When X = + , Z = 1
When X = + 2, Z = 2
When X = + 3, Z = 3 and so on.
It can be proven that P(X1 <X< Xk) = P(Z1 <Z< Zk)
SND shows the probability to the right of any
particular value of Z.
Example
Normal distributionQuestions
Your sample found that the mean price of affordable homes in Johor
Bahru, Y, is RM 155,000 with a variance of RM 3.8x107. On the basis of a
normality assumption, how sure are you that:
(a) The mean price is really RM 160,000
(b) The mean price is between RM 145,000 and 160,000
Answer (a):
160,000 -155,000
Always remember: to convert to SND, subtract the mean and divide by the std. dev.
Normal distributionQuestions
Answer (b):
X1 -
145,000 155,000
3.8x107
X2 -
160,000 155,000
Z2 = -----=
---------------=
0.811
7
3.8x10
P(Z1<-1.622)=0.0455; P(Z2>0.811)=0.1867
P(145,000<Z<160,000)
= P(1-(0.0455+0.1867)
= 0.7678
Normal distributionQuestions
You are told by a property consultant that the
average rental for a shop house in Johor Bahru is
RM 3.20 per sq. After searching, you discovered
the following rental data:
2.20, 3.00, 2.00, 2.50, 3.50,3.20, 2.60, 2.00,
3.10, 2.70
What is the probability that the rental is greater
than RM 3.00?
Students t-Distribution
Similar to Z-distribution:
* t(0,) but n1
* - < t < +
* Flatter with thicker tails
* As n t(0,) N(0,1)
* Has a function of
where =gamma distribution; v=n-1=d.o.f; =3.147
Students t-Distribution
Given n independent measurements, xi, let
Students t-Distribution
Student's t-distribution can be derived by:
* transforming Student's z-distribution using
* defining
The resulting probability and cumulative
distribution functions are:
Students t-Distribution
fr(t) =
Fr(t) =
=
=
where r n-1 is the number of degrees of freedom, -<t<,(t) is the gamma function,
B(a,b) is the beta function, and I(z;a,b) is the regularized beta function defined by
Correlation
Co-exist.E.g.
* left shoe & right shoe, sleep & lying down, food & drink
Indicate some co-existence relationship. E.g.
* Linearly associated (-ve or +ve)
Formula:
* Co-dependent, independent
But, nothing to do with C-A-E r/ship!
Example: After a field survey, you have the following
data on the distance to work and distance to the city
of residents in J.B. area. Interpret the results?
Contingency
A form of conditional co-existence:
* If X, then, NOT Y; if Y, then, NOT X
* If X, then, ALSO Y
* E.g.
+ if they choose to live close to workplace,
then, they will stay away from city
+ if they choose to live close to city, then, they
will stay away from workplace
+ they will stay close to both workplace and city
Test yourselves!
Q1: Calculate the min and std. variance of the following data:
PRICE - RM 000
SQ. M OF FLOOR
Q2: Calculate the mean price of the following low-cost houses, in various
localities across the country:
36
37
38
39
40
41
42
43
14
10
36
73
27
20
17
Test yourselves!
Q3: From a sample information, a population of housing
estate is believed have a normal distribution of X ~ (155,
45). What is the general adjustment to obtain a Standard
Normal Distribution of this population?
Test yourselves!
Q5: Find:
(AGE > 30-34)
(AGE 20-24)
( 35-39 AGE < 50-54)
Test yourselves!
Q6: You are asked by a property marketing manager to ascertain whether
or not distance to work and distance to the city are equally important
factors influencing peoples choice of house location.
You are given the following data for the purpose of testing:
Explore the data as follows:
Create histograms for both distances. Comment on the shape of the
histograms. What is you conclusion?
Construct scatter diagram of both distances. Comment on the output.
Explore the data and give some analysis.
Set a hypothesis that means of both distances are the same. Make
your conclusion.
Thank you