Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

Course: Statistiek voor Premasters

Course material:
Course-week 6
Lecturers:
Jochem de Bresser K539
Pieter-Jan Pauwelyn

Coordinator:
Pieter-Jan Pauwelyn
Overview, week 6 (Ch 5 & 19)
• Scatter plot, sample covariance and sample correlation
coefficient (for more information see chapter 5).

• Simple linear regression model (Ch 19)


• Estimation of the coefficients,
• Model assumptions,
• Interpretation of the coefficients,
• Confidence intervals and tests for the slope.

2
The type of questions we are studying

Examples:
I. How can we explain Y from X, if X =
‘weekly sales’ and Y = ‘weekly profit’.
II. What is the relationship between Y
and X, if X=‘advertising costs’ and
Y=‘weekly sales’
III. What is the relationship between Y
and X, if X=‘the height of a person’ and
Y=‘the weight of a person’.
Introduction
General setting: population with k +1 variables, the dependent
variable Y and the k independent variables X1,×××, Xk.

Objective: in what way does Y depend on X1, ××× , Xk ? What is


the strength of that relationship?
If we know that relationship, it can be used for prediction of Y
for a population element for which only the x-data (the x-
characteristics) are measured. What will be the response of Y
if Xi changes?
For the time being only one independent variable X is
considered and we consider linear models. Note that in
general not all population elements can be observed, so our
analysis will be based on a sample. 4
5.1 Scatter plot, Covariance and Correlation
Quantitative variables X and Y
population dataset: (x1, y1), (x2, y2), × × × , (xN, yN)
sample dataset: (x1, y1), (x2, y2), × × × , (xn, yn)
Questions:
- Do the y-data tend to be linearly related to the corresponding x-
data?
That is, do the pairs of data concentrate around a straight line?
- If so, what is the degree (strength) of that linear relationship?
Objectives:
- We want a plot that visually shows the relationship between the
y-data and the x-data.
- We want so-called measures of association, that
measures the strength of the linear relationship.
5
Scatter plot Plot that depicts the pairs (xi, yi)
20 households Euro Area countries
4 6
RECR Deficit
EXP
4
3

2
2
0

1
-2

0 GDP growth (%)


-4
0 20 40 HINC 60 -2 0 2 4 6

An increasing straight line A decreasing straight line is


Is suggested; the y- and x- suggested; the y- and x-data
data are said to be are said o be
positively linearly related negatively linearly related 6
40 individuals 16 countries

hourly happin
16 wage 100 ess

12 80

60
8
40
4
20
0 GNI per capita
0
0 20 40 60 age 80
0 10000 20000 30000

mountain-shaped quadratic some other relationship is


relationship is suggested suggested
Covariance
Example 5.A X = ‘weekly sales (in units of 105 euro) ’
Y = ‘weekly profit (in units of 104 euro)’
x 2 3 4 63 5 4 2 1 8 3 5 6 x =4 y = 5.62
y 5 6 7 78 9 2 2 3 9 7 5 3 These lines are included

10 - search for measures of strength


II I
8 of linear relationship;
6 - positive (negative) result must
4
refer to positive (negative) linear
2
III IV relationship.
0
0 5 Quadrants I, II, III and IV
10

Scatter plot seems to suggest a In I and III: ( xi - x )( yi - y ) > 0


weak positive linear relationship In II and IV: ( xi - x )( yi - y ) < 0
-------------------------------------------------------------------------------------------------------------
If the dots in I and III are more frequent and
farther from ( x , y ) , it is expected that:
n

å ( x - x )( y
i =1
i i - y) > 0

If the dots in II and IV are more frequent and


farther from ( x , y ) , it is expected that:
n

å ( x - x )( y
i =1
i i - y) < 0

This motivates the following definitions:


9
Definitions of covariance
Sample dataset
1 n
sample covariance s X ,Y = å
n - 1 i =1
( xi - x )( yi - y )

Population dataset
N
1
population covariance s X ,Y =
N
å (x - µ
i =1
i X )( yi - µY )

Remarks:
- notice the different notations
- sX,Y is often used as estimate of the unknown sX,Y
- division by n – 1 is used since it in general gives better
estimates 10
Example 5.A (continued) X = ‘weekly sales (in €105 euro) ’
Y = ‘weekly profit (in €104 euro)’
i xi yi xi - x yi - y ( xi - x ) 2 ( yi - y ) 2 ( xi - x )( yi - y )
1 2 5 -2 -0.62 4 0.38 1.23
2 3 6 -1 0.38 1 0.15 -0.38
3 4 7 0 1.38 0 1.92 0.00
4 6 7 2 1.38 4 1.92 2.77
5 3 8 -1 2.38 1 5.69 -2.38
6 5 9 1 3.38 1 11.46 3.38
7 4 2 0 -3.62 0 13.07 0.00
8 2 2 -2 -3.62 4 13.07 7.23
9 1 3 -3 -2.62 9 6.84 7.85
10 8 9 4 3.38 16 11.46 13.54
11 3 7 -1 1.38 1 1.92 -1.38
12 5 5 1 -0.62 1 0.38 -0.62
13 6 3 2 -2.62 4 6.84 -5.23
total 52 73 0 0.00 46 75.08 26.00
2
s X
= 46 / 12 = 3.83

2
s Y = 75.08 / 12 = 6.26

s X ,Y = 26.00 / 12 = 2.17
Indeed, the covariance is positive!!
But is it large??
We don’t have a reference point !!

Idea: divide covariance by the product of the two


standard deviations
-------------------------------------------------------------------------------------------------------------
12
Correlation

sample correlation s X ,Y
r = rX ,Y =
coefficient s X sY

population correlation s X ,Y
r = r X ,Y =
coefficient s XsY

remarks: - notice the different notations


- correlation coefficients are dimensionless
property: –1 £ r £ 1 and –1 £ r £ 1
13
• A correlation coefficient measures the degree of linear
relationship, the strength of the tendency of the x- and
y-data to concentrate around one straight line.
• In case of +1, all dots lie precisely on one increasing
line.
• In case of -1, all dots lie precisely on one decreasing
line.
• If the correlation coefficient is close to +1 (resp. -1),
then the y-data and the x-data are strongly positively
(resp. negatively) linearly related.
• If r = 0, it is said that the y-data and the x-data are
uncorrelated.
• If r = 0, it is said that X and Y are uncorrelated.
• But notice that correlation only measures linear
dependence between Y and X. 14
Example 5.A (continued)
X = ‘weekly sales (in €105 euro) ’
Y = ‘weekly profit (in €104 euro)’
Recall :

s = 3.83
2
X s = 6.26
2
Y
s X ,Y = 2.17
2.17
r= = 0.44
3.83 6.26
- As expected, there exists a positive linear relationship between
‘sales’ and ‘profit’.
- However, this relationship is not very strong.
- Apparently, there are more factors that influence the variation of
‘profit’....
15
-------------------------------------------------------------------------------------------------------------
Short-cut formulae for covariance
Using a pocket calculator, you better use the calculation formulae
below that give the same answers but are easier to handle.
N N
Results: 1) å (x - µ
i =1
i X )( yi - µY ) = å xi yi - Nµ X µY
i =1

N N
1 1
2) s X ,Y =
N
å
i =1
( xi - µ X )( yi - µY ) =
N
åx y
i =1
i i - µ X µY
n n
3) å ( x - x )( y - y ) = å x y - nx × y
i =1
i i
i =1
i i

1 n 1 n n
4) s X ,Y = å
n - 1 i =1
( xi - x )( yi - y ) = å
n - 1 i =1
xi yi -
n -1
x×y

Interpretation of the population covariance :


mean of the products minus the product of the means
16
General objectives
- Interested in the variation of a variable Y.
- We try to explain this variation from the
dependence of Y on some other variable X.
- We talk about regression of Y on X. In this
context, Y is called dependent variable and X
is called independent variable.

17
Regression line (for Y on X)
Introduction
General setting: population with two variables, the
dependent variable Y and the independent variable X.
In general, Y will not be strictly linear in X. That is, Y is not
precisely equal to b0 + b1X. Instead, it is written:

Y = b 0 + b1 X + e
The variable e = Y – (b0 + b1X) is called the error (variable).

The coefficient b0 is the y-intercept and b1 is the slope.


What is the best fitting line that estimates this linear
18
relationship?
The least-squares method

yi y = b0 + b1x
b0 + b1xi

How to choose b0
and b1? We use the
least squares
xi method.
The vertical deviations, also called errors or residuals, are
y1 – (b0 + b1x1), y2 – (b0 + b1x2), × × × , yn – (b0 + b1xn)
Find b0 and b1 such that the sum of the squared deviations
is as small as possible, this results in the (best-fitting) least-
19
squares line.
population cloud sample cloud
y
25

y
20

15

y = b0 + b1 x
10
y = 2.107 + 0.070x

0
0 10 20 30 40
x x

- The population cloud is - The sample cloud is observed.


unobserved; at the best, we - The equation y = b0 + b1x of the
have a vague idea of it. (best-fitting) least-squares line is
- The equation y = b0 + b1x of the known. It estimates the
(best-fitting) least-squares line unknown population line.
is also unknown. 20
We will consider formal methods that generate estimates of
the parameters on the basis of a sample (X1, Y1), ××× , (Xn, Yn).

For the estimation of b0 and b1 we adopt the least-squares


method (Chapter 5), that generates estimates by
determining the straight line that in a sense fits best
through the sample cloud. This method yields the
following estimators of respectively b1 and b0:

S X ,Y
B1 = 2
and B0 = Y - B1 X
S X

The sample regression line is denoted by

Yˆ = B0 + B1 X
21
Example 19.3 (continued) Supermarket; Y = ‘weekly sales (x
€1000)’. X = ‘weekly advertising costs (x € 1000)’.

Sample: x 2.0 2.2 2.0 2.5 1.8 2.3 2.9

y 80.2 85.6 74.3 93.9 77.8 87.0 93.4

x = 2.2429, y = 84.6
s X2 = 0.1362, s X,Y = 2.47

Then: 2.475 Note the difference in notations:


b1 = = 18.17 B1 versus b1 and B0 versus b0
0.1362
b0 = 84.6 - 18.17 × 2.24286 = 43.847

yˆ = 43.847 + 18.17 x 22
Suppose next week advertising costs are €2100, then
yˆ = 43.847 + 18.17 × 2.1 = 82.0
We predict that next week sales will be €82000.
How precise is this estimate?
To get an idea about the predicting performance of the
regression line, we can calculate the residuals
(observations of the error term for the data points).

23
- predictions of yi : yˆ1 = 43.84 + 18.172 ´ 2.0 = 80.180
....
yˆ 7 = 43.84 + 18.172 ´ 2.9 = 96.533
- residuals ei :
(observations of the e1 = y1 - yˆ1 = 80.2 - 80.180 = 0.020
error term) ...
e7 = y7 - yˆ 7 = 93.4 - 96.533 = -3.133
-sum of squared errors
(SSE) : SSE = (0.020)2 + ××× + (−3.133)2
= 72.510

But the value of SSE is difficult to interpret. We need to


know more about the precision of the estimates of b0 and
b1, we need some model assumptions.
24
Simple linear regression model
A simple linear regression model is the combination of the
following requirements (model assumptions):
1. Y = b0 + b1X + e and E(e | x) = 0 for all values x of X
(linearity, this is the basic assumption)
2. For all fixed x of X, the subpopulations with X = x have
the same variance (called σε2 ) of e . (homoskedasticity)
3. For all fixed x of X, the distribution of e on the
subpopulation with X = x is normal (normality)
4. e1, ××× , en are independent
5. ei and Xj are independent for all i, j

Conditions (1), (2), (3) can be summarized as: e ~ N(0, σε2)


conditionally on X = x (for all x). Here, se is called the
25
standard deviation of the model.
Consequences for Y So,along all vertical
y lines the distributions
only have different
means

x1 x2 x

For all values x of X it holds that: Y ~ N ( b o + b1 x, s e2 )


The unknown population line is called: line of means
E(Y | x) = b0 + b1x (which is an alternative way to
formulate the basic assumption)

Although incorrect, we will often write E(e) and E(Y)


instead of E(e | x) and E(Y | x). 26
Interpretation of the parameters
Note that: E(Y | x+1) = b0 + b1x +b1
E(Y | x) = b0 + b1x
--------------------------------------------- −
E(Y | x+1) − E(Y | x) = b1

• b1 is just the change (positive or negative) of the


expectation of Y if x increases by 1
• b0 is just the expectation of Y if x = 0
(but this interpretation is only allowed if x = 0 is possible
within the population)
• σε2 measures for the population the variation around
the line of means

27
Example 19.3 Supermarket; Y = ‘weekly sales (x €1000)’.
X = ‘weekly advertising costs (x € 1000)’.
Sample regression line: yˆ = 43.847 + 18.17 x
Interpretation coefficients:
- Slope: spending an extra €1000 each week on advertising
leads on average to an extra € 18170 each week in sales.
- We cannot interpret the intercept of this equation, since
0 does not belong to the x-data.

28
Estimator of the model-variance s e2
The model variance measures the variation around the line
of means in the population cloud. Its natural estimator will
measure the variation around the sample regression line in
the sample cloud.
1 n
yi

ei
y = b0 + b1x se =
2
å i i
n - 2 i =1
( y - ˆ
y ) 2

b0 + b1xi
1 n 2
= å
n - 2 i =1
ei

SSE
(= )
n-2
xi

n – 2 is used since this, in general, yields better estimates.


29
The estimator of the model variance is based on the sample
of the random pairs (X1, Y1), ××× , (Xn, Yn), the random
intercept B0 and the random slope B1 , so it is a random
variable.
1 n
Se =
2
å i i
n - 2 i =1
(Y - Yˆ ) 2

Se = Se2

Se is called : standard error of the estimated model

Example 19.3 (continued) Supermarket; Y = ‘weekly sales (x


€1000)’. X = ‘weekly advertising costs (x € 1000)’.
72.510 Note the difference in
se = = 3.808 notations: Se versus se
7-2 30
A model is useful if b1 ≠ 0. In this case it is said that the
variable X is significant in model. Note that b 1 = 0 indicates
that there is no linear relationship between Y and X (which
doesn’t mean that there is no relationship between Y and X).

q
q
q q q
q
q q
q q q

We can use a confidence interval or a test to answer the


question whether we can conclude that b1 ≠ 0. To do so we
need to know more about the distribution of the estimator
B1. 31
Properties of the estimators
Suppose that the simple linear regression model is valid.
If it is already given that X1 = x1; ××× ; Xn = xn, then:
• B1, B0 and Sε2 are unbiased and consistent estimators of b1,
b0 and σε2.
s e2 1 x 2
• B1 ~ N ( b1 , ) and B0 ~ N ( b 0 , s e2 ( + ))
(n - 1) s 2
X n (n - 1) s X
2

These are conditional results. For instance:


- E(B1 | x1,…,xn) = b1. Since the right-hand side doesn’t
depend on x1, …, xn, it also holds unconditionally.
- V(B1 | x1,…,xn) = s e2 /(( n - 1) s X2 ), which does depend on
x1,…,xn. 32
Thanks to these results, we can use a normal distribution
to calculate probabilities concerning B1 and B0; at least: if
the model requirements are valid.
We will formulate an interval estimator and hypothesis
tests for the parameter b1.
B1 - b1
Z= is N (0, 1); Z is a pivot.
s e / (n - 1) s X2
To get rid of the unknown se we replace it by its (consistent)
estimator Se at the cost of getting a t-distribution with n – 2
df (recall Se had n – 2 df):
B1 - b1
T= is tn -2
Se / ( n - 1) s X2
33
Confidence interval for the slope
Rewriting yields the interval estimator:

Se Se
L = B1 - ta / 2;n -2 and U = B1 + ta / 2;n -2
( n - 1) s 2
X ( n - 1) s X2

When the data are available and substituted, a


100(1 – a)%-confidence interval (l, u) follows.

34
This confidence interval fits the general form
estimator ± constant ´ SE (estimator)

We often write: se
SD ( B1 ) = s B1 = standard deviation of B1
( n - 1) s X2
Se
SE ( B1 ) = S B1 = standard error of B1
( n - 1) s X2

35
Example 19.6 supermarket: Y = ‘weekly sales (x €1000)’; X
= ‘weekly advertising costs (x €1000)’; n = 7.
Question: Does X influence Y? Use a 95% CI.

95%-CI : b1 = 18.172; s X2 = 0.1362; se2 = SSE = 72.510 = 14.502


5 5
se
s B1 = = 4.2124; t0.025;5 = 2.5706 T.INV(0.975,5)
(n - 1) s X2

l = b1 - t0.025;5 s B1 = 18.17 - 2.5706 × 4.2124 = 7.342


u = b1 + t0.025;5 s B1 = 18.17 + 2.5706 × 4.2124 = 29.00

There is 95% confidence that b1 lies in (7.342, 29.00).


Since 0 is not in this interval, it can be concluded that
b1 ¹ 0. The model is useful; X and Y are linearly related. 36
Hypothesis testing for a testing problem with hinge b,
three types of testing problems:
(a) test H0: b1 £ b against H1: b1 > b
(b) test H0: b1 ³ b against H1: b1 < b
(c) test H0: b1 = b against H1: b1 ¹ b
Start with the pivot Z, adapt it in accordance with the
worst-case scenario b1 = b and replace the remaining
parameter se by the consistent estimator Se . This yields
the following random variable as test statistic:
if b1 = b
B1 - b
T= » tn-2
Se / (n - 1) s X2
37
Note that the test statistic is of the general form

estimator - hinge
SE (estimator)

The test procedures guarantee that the probability of


rejecting H0 incorrectly, is at most a. This leads to the
rejection regions
(a) reject H 0 Û t ³ ta ;n - 2
(b) reject H 0 Û t £ -ta ;n - 2
(c) reject H 0 Û t £ -ta / 2;n - 2 or t ³ ta / 2;n - 2

38
Example 19.8 supermarket Y = ‘weekly sales
(x €1000)’; X = ‘weekly advertising costs (x €1000)’;
n = 7. Question: If the weekly ads costs are increased by
€1000, will the average weekly sales be increased by more
than €15000? Use a test with a = 0.05.
Data:
SSE 72.510
b1 = 18.172; s = 0.1362; se =
2
X
2
= = 14.502
5 5
se
s B1 = = 4.2124;
(n - 1) s X2

39
(i) Hypotheses: H 0 : b1 £ 15 vs H1 : b1 > 15 ; a = 0.05

B1 - 15 B1 - 15 B1 - 15
(ii) Test statistic: T= = =
Se / (n - 1) s X2 SE ( B1 ) S B1

(iii) Rejection region:


reject H 0 Û t ³ t0.05;5 = 2.0150 (T.INV(0.95,5) )
18.17 - 15
(iv) val: val = = 0.753
4.2124
(v) Conclusion: Do not reject H0, since 0.753 < 2.0150. We
cannot conclude that the answer is “yes”.
p-value = P(T ³ 0.753) = 0.2427 (1-T.DIST(0.753,5,1)
We would have needed a = 0.2427 to draw the conclusion “yes”.
Example 19.9 large population of adult men
Y = ‘weight (kg)’; X = ‘height (cm)’; n = 242;
Dataset: Xmp19-04b.sav

Formulate the model that explains the variation in Y from


X. Use SPSS to estimate the regression equation and to find
the value of SSE and sε.

Conjectures:
- We expect a positive linear relationship between Y and X.
- But 1 cm extra height will on average correspond to less
than 0.75 kg extra weight.

We use suitable tests (α = 0.02) to comment on the


conjectures. 41
Basic assumption: E(Y) = b0 + b1x .

We see that b0 = −26.707 and b1 = 0.584.


yˆ = -26.707 + 0.584 x
The intercept cannot be interpreted, since 0 is not in the
data range. Interpretation slope: if the height increases by
1cm then the weight on average increases by 0.584 kg.
42
SPSS provides also the values of SSE and sε.

SSE

43
The first conjecture is b1 > 0. We will use the p-value and
the printout to comment on this conjecture.

Value of a test with hinge 0 p-value of


(H0: b1 = 0 vs. H1: b1 ≠ 0) two-sided
test
To test whether b1 > 0, the t-test with hinge 0 yields val =
7.465. Since the first three decimals of the p-value for the
two-sided test are all 0, it remains small after division by 2.
The null hypothesis that b1 £ 0 is rejected convincingly. 44
The second conjecture states that b1 < 0.75.

b1 s B1

45
(i) Hypotheses: H 0 : b1 ³ 0.75 vs H1 : b1 < 0.75 ; a = 0.02

B1 - 0.75 B1 - 0.75
(ii) Test statistic: T = =
Se / (n - 1) s X2 S B1

(iii) Rejection region: reject H 0 Û t £ -t0.02, 240 = -2.0650

0.584 - 0.75
(iv) val: val = = -2.128
0.078

(v) Conclusion: Reject H0, since -2.128 < -2.065. The data
indicate that the second conjecture is also
true.
46
We also can calculate a 95% confidence interval for b1 and
use it to comment on the conjectures.
b1 ± ta / 2,n - 2 × sB1 = 0.584 ± 1.9699 × 0.078
= 0.584 ± 0.1537
= (0.430, 0.737)
We have 95% confidence that this interval captures b1, since
all values are positive, we also have at least 95% confidence
that b1 > 0. The interval is completely at the left hand side of
0.75, which means that we also can conclude that b1 < 0.75.

47
Additional Exercise 4
When the economy is growing, finance ministers often fear
that inflation will be growing too. In this exercise we will
study whether this fear is justified.
.
The file Xrc05-14.xls contains data from the Federal
Statistical Office of Germany about X = ‘GDP growth (%)
of a country in 2005’ and Y = ‘inflation rate (%) of that
county in 2005’. The dataset contains measurements of
both X and Y for 165 countries. We will use the data of 164
of these countries to study the relationship between Y and
X; the observations of Zimbabwe are excluded since the
inflation rate of 302.2 of this country would influence the
results too much. Use the SPSS regression output below to 48
answer the questions.
Additional Exercise 4 (continued)

49
Additional Exercise 4 (continued)
a)Why do we take inflation rate as dependent and
GDP growth as independent variable (and not the
other way around)?
b)Use the printout to determine the equation of the
regression line. Interpret the regression coefficient.
c) Calculate the predicted inflation for Denmark, with
a GDP growth of 3.2% and inflation 1.7%. Also
calculate the accompanying residual.
d)Test whether the variable X is significant in
explaining the variation of Y, use a significance
level of α = 0.05.
e)Give the p-value of the test in d) (use the printout).
Interpret your answer.
50
Wrap up
This week:
• Sample correlation and covariance
• Simple linear regression (basics)

Next week:
• Linear regression using software tools
• ANOVA

You might also like