Professional Documents
Culture Documents
Course: Statistiek Voor Premasters
Course: Statistiek Voor Premasters
Course material:
Course-week 6
Lecturers:
Jochem de Bresser K539
Pieter-Jan Pauwelyn
Coordinator:
Pieter-Jan Pauwelyn
Overview, week 6 (Ch 5 & 19)
• Scatter plot, sample covariance and sample correlation
coefficient (for more information see chapter 5).
2
The type of questions we are studying
Examples:
I. How can we explain Y from X, if X =
‘weekly sales’ and Y = ‘weekly profit’.
II. What is the relationship between Y
and X, if X=‘advertising costs’ and
Y=‘weekly sales’
III. What is the relationship between Y
and X, if X=‘the height of a person’ and
Y=‘the weight of a person’.
Introduction
General setting: population with k +1 variables, the dependent
variable Y and the k independent variables X1,×××, Xk.
2
2
0
1
-2
hourly happin
16 wage 100 ess
12 80
60
8
40
4
20
0 GNI per capita
0
0 20 40 60 age 80
0 10000 20000 30000
å ( x - x )( y
i =1
i i - y) > 0
å ( x - x )( y
i =1
i i - y) < 0
Population dataset
N
1
population covariance s X ,Y =
N
å (x - µ
i =1
i X )( yi - µY )
Remarks:
- notice the different notations
- sX,Y is often used as estimate of the unknown sX,Y
- division by n – 1 is used since it in general gives better
estimates 10
Example 5.A (continued) X = ‘weekly sales (in €105 euro) ’
Y = ‘weekly profit (in €104 euro)’
i xi yi xi - x yi - y ( xi - x ) 2 ( yi - y ) 2 ( xi - x )( yi - y )
1 2 5 -2 -0.62 4 0.38 1.23
2 3 6 -1 0.38 1 0.15 -0.38
3 4 7 0 1.38 0 1.92 0.00
4 6 7 2 1.38 4 1.92 2.77
5 3 8 -1 2.38 1 5.69 -2.38
6 5 9 1 3.38 1 11.46 3.38
7 4 2 0 -3.62 0 13.07 0.00
8 2 2 -2 -3.62 4 13.07 7.23
9 1 3 -3 -2.62 9 6.84 7.85
10 8 9 4 3.38 16 11.46 13.54
11 3 7 -1 1.38 1 1.92 -1.38
12 5 5 1 -0.62 1 0.38 -0.62
13 6 3 2 -2.62 4 6.84 -5.23
total 52 73 0 0.00 46 75.08 26.00
2
s X
= 46 / 12 = 3.83
2
s Y = 75.08 / 12 = 6.26
s X ,Y = 26.00 / 12 = 2.17
Indeed, the covariance is positive!!
But is it large??
We don’t have a reference point !!
sample correlation s X ,Y
r = rX ,Y =
coefficient s X sY
population correlation s X ,Y
r = r X ,Y =
coefficient s XsY
s = 3.83
2
X s = 6.26
2
Y
s X ,Y = 2.17
2.17
r= = 0.44
3.83 6.26
- As expected, there exists a positive linear relationship between
‘sales’ and ‘profit’.
- However, this relationship is not very strong.
- Apparently, there are more factors that influence the variation of
‘profit’....
15
-------------------------------------------------------------------------------------------------------------
Short-cut formulae for covariance
Using a pocket calculator, you better use the calculation formulae
below that give the same answers but are easier to handle.
N N
Results: 1) å (x - µ
i =1
i X )( yi - µY ) = å xi yi - Nµ X µY
i =1
N N
1 1
2) s X ,Y =
N
å
i =1
( xi - µ X )( yi - µY ) =
N
åx y
i =1
i i - µ X µY
n n
3) å ( x - x )( y - y ) = å x y - nx × y
i =1
i i
i =1
i i
1 n 1 n n
4) s X ,Y = å
n - 1 i =1
( xi - x )( yi - y ) = å
n - 1 i =1
xi yi -
n -1
x×y
17
Regression line (for Y on X)
Introduction
General setting: population with two variables, the
dependent variable Y and the independent variable X.
In general, Y will not be strictly linear in X. That is, Y is not
precisely equal to b0 + b1X. Instead, it is written:
Y = b 0 + b1 X + e
The variable e = Y – (b0 + b1X) is called the error (variable).
yi y = b0 + b1x
b0 + b1xi
How to choose b0
and b1? We use the
least squares
xi method.
The vertical deviations, also called errors or residuals, are
y1 – (b0 + b1x1), y2 – (b0 + b1x2), × × × , yn – (b0 + b1xn)
Find b0 and b1 such that the sum of the squared deviations
is as small as possible, this results in the (best-fitting) least-
19
squares line.
population cloud sample cloud
y
25
y
20
15
y = b0 + b1 x
10
y = 2.107 + 0.070x
0
0 10 20 30 40
x x
S X ,Y
B1 = 2
and B0 = Y - B1 X
S X
Yˆ = B0 + B1 X
21
Example 19.3 (continued) Supermarket; Y = ‘weekly sales (x
€1000)’. X = ‘weekly advertising costs (x € 1000)’.
x = 2.2429, y = 84.6
s X2 = 0.1362, s X,Y = 2.47
yˆ = 43.847 + 18.17 x 22
Suppose next week advertising costs are €2100, then
yˆ = 43.847 + 18.17 × 2.1 = 82.0
We predict that next week sales will be €82000.
How precise is this estimate?
To get an idea about the predicting performance of the
regression line, we can calculate the residuals
(observations of the error term for the data points).
23
- predictions of yi : yˆ1 = 43.84 + 18.172 ´ 2.0 = 80.180
....
yˆ 7 = 43.84 + 18.172 ´ 2.9 = 96.533
- residuals ei :
(observations of the e1 = y1 - yˆ1 = 80.2 - 80.180 = 0.020
error term) ...
e7 = y7 - yˆ 7 = 93.4 - 96.533 = -3.133
-sum of squared errors
(SSE) : SSE = (0.020)2 + ××× + (−3.133)2
= 72.510
x1 x2 x
27
Example 19.3 Supermarket; Y = ‘weekly sales (x €1000)’.
X = ‘weekly advertising costs (x € 1000)’.
Sample regression line: yˆ = 43.847 + 18.17 x
Interpretation coefficients:
- Slope: spending an extra €1000 each week on advertising
leads on average to an extra € 18170 each week in sales.
- We cannot interpret the intercept of this equation, since
0 does not belong to the x-data.
28
Estimator of the model-variance s e2
The model variance measures the variation around the line
of means in the population cloud. Its natural estimator will
measure the variation around the sample regression line in
the sample cloud.
1 n
yi
ei
y = b0 + b1x se =
2
å i i
n - 2 i =1
( y - ˆ
y ) 2
b0 + b1xi
1 n 2
= å
n - 2 i =1
ei
SSE
(= )
n-2
xi
Se = Se2
q
q
q q q
q
q q
q q q
Se Se
L = B1 - ta / 2;n -2 and U = B1 + ta / 2;n -2
( n - 1) s 2
X ( n - 1) s X2
34
This confidence interval fits the general form
estimator ± constant ´ SE (estimator)
We often write: se
SD ( B1 ) = s B1 = standard deviation of B1
( n - 1) s X2
Se
SE ( B1 ) = S B1 = standard error of B1
( n - 1) s X2
35
Example 19.6 supermarket: Y = ‘weekly sales (x €1000)’; X
= ‘weekly advertising costs (x €1000)’; n = 7.
Question: Does X influence Y? Use a 95% CI.
estimator - hinge
SE (estimator)
38
Example 19.8 supermarket Y = ‘weekly sales
(x €1000)’; X = ‘weekly advertising costs (x €1000)’;
n = 7. Question: If the weekly ads costs are increased by
€1000, will the average weekly sales be increased by more
than €15000? Use a test with a = 0.05.
Data:
SSE 72.510
b1 = 18.172; s = 0.1362; se =
2
X
2
= = 14.502
5 5
se
s B1 = = 4.2124;
(n - 1) s X2
39
(i) Hypotheses: H 0 : b1 £ 15 vs H1 : b1 > 15 ; a = 0.05
B1 - 15 B1 - 15 B1 - 15
(ii) Test statistic: T= = =
Se / (n - 1) s X2 SE ( B1 ) S B1
Conjectures:
- We expect a positive linear relationship between Y and X.
- But 1 cm extra height will on average correspond to less
than 0.75 kg extra weight.
sε
SSE
43
The first conjecture is b1 > 0. We will use the p-value and
the printout to comment on this conjecture.
b1 s B1
45
(i) Hypotheses: H 0 : b1 ³ 0.75 vs H1 : b1 < 0.75 ; a = 0.02
B1 - 0.75 B1 - 0.75
(ii) Test statistic: T = =
Se / (n - 1) s X2 S B1
0.584 - 0.75
(iv) val: val = = -2.128
0.078
(v) Conclusion: Reject H0, since -2.128 < -2.065. The data
indicate that the second conjecture is also
true.
46
We also can calculate a 95% confidence interval for b1 and
use it to comment on the conjectures.
b1 ± ta / 2,n - 2 × sB1 = 0.584 ± 1.9699 × 0.078
= 0.584 ± 0.1537
= (0.430, 0.737)
We have 95% confidence that this interval captures b1, since
all values are positive, we also have at least 95% confidence
that b1 > 0. The interval is completely at the left hand side of
0.75, which means that we also can conclude that b1 < 0.75.
47
Additional Exercise 4
When the economy is growing, finance ministers often fear
that inflation will be growing too. In this exercise we will
study whether this fear is justified.
.
The file Xrc05-14.xls contains data from the Federal
Statistical Office of Germany about X = ‘GDP growth (%)
of a country in 2005’ and Y = ‘inflation rate (%) of that
county in 2005’. The dataset contains measurements of
both X and Y for 165 countries. We will use the data of 164
of these countries to study the relationship between Y and
X; the observations of Zimbabwe are excluded since the
inflation rate of 302.2 of this country would influence the
results too much. Use the SPSS regression output below to 48
answer the questions.
Additional Exercise 4 (continued)
49
Additional Exercise 4 (continued)
a)Why do we take inflation rate as dependent and
GDP growth as independent variable (and not the
other way around)?
b)Use the printout to determine the equation of the
regression line. Interpret the regression coefficient.
c) Calculate the predicted inflation for Denmark, with
a GDP growth of 3.2% and inflation 1.7%. Also
calculate the accompanying residual.
d)Test whether the variable X is significant in
explaining the variation of Y, use a significance
level of α = 0.05.
e)Give the p-value of the test in d) (use the printout).
Interpret your answer.
50
Wrap up
This week:
• Sample correlation and covariance
• Simple linear regression (basics)
Next week:
• Linear regression using software tools
• ANOVA