Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Question 1 – Three rank tests (7 + 9 + 7 = 23)

We will look at three different rank tests, one by one:

a. The Wilcoxon Signed Rank Sum Test


b. The Spearman’s Rank Correlation Test
c. The Wilcoxon Rank Sum Test

For each test, we have drawn random samples: either two independent samples 𝑋𝑋 and 𝑌𝑌, or one
sample of matched pairs (𝑋𝑋, 𝑌𝑌) – whatever is appropriate for the specified test.

Observations are on one or two levels of measurement: quantitative or ordinal. Quantitative


observations are numerical (1, 2, 3, etc.) and ordinal observations are literal (A, B, C, etc.) with A
being the lowest ordinal value.

Instructions
The tables below show the observed values (the samples) and a lot of empty cells. In these empty
cells, you must write the appropriate rank numbers associated with the observed values. There may
be more cells than strictly necessary; these may remain empty or may be used for something else.

Example
𝑋𝑋 2 3 5 7 11 13 17 19
rank 1 2 3 4 5 6 7 8

1a. Wilcoxon Signed Rank Sum

Determine the ranks as used by the Wilcoxon Signed Rank Sum Test of this table:
𝑋𝑋 10 6 11 11 1 4 6 3

𝑌𝑌 8 9 5 1 7 2 3 5

(It might be possible that you did not need to use all empty cells)

Now, calculate the Wilcoxon Signed Rank Sum. Write the calculation in the answer box: [ANSWER
BOX]

Answer 1a. – Wilcoxon Signed Rank Sum


𝑋𝑋 10 6 11 11 1 4 6 3

𝑌𝑌 8 9 5 1 7 2 3 5
difference
X–Y 2 –3 6 10 –6 2 3 –2
rank 2 4.5 6.5 8 6.5 2 4.5 2
The Wilcoxon Signed Rank Sum: 𝑇𝑇+ = 2 + 6.5 + 8 + 2 + 4.5 = 23
or: 𝑇𝑇− = 4.5 + 6.5 + 2 = 13
1b. Spearman’s Rank Correlation Coefficient

Determine the ranks of the observed values below, and write them in empty cells of this table:
𝑋𝑋 7 1 12 8 3 12 5 7

𝑌𝑌 C G E H E A B İ

(It might be possible that you did not need to use all empty cells)

Now, calculate Spearman’s Rank Correlation coefficient (2 decimals). Write a short calculation in the
answer box: [ANSWER BOX]

Answer 1b. – Spearman’s Rank Correlation Coefficient


𝑋𝑋 7 1 12 8 3 12 5 7
rank 4.5 1 7.5 6 2 7.5 3 4.5
𝑌𝑌 C G E H E A B İ
rank 3 6 4.5 7 4.5 1 2 8

𝑠𝑠 −1.1785
The Spearman’s Rank Correlation coefficient: 𝑟𝑟𝑠𝑠 = 𝑠𝑠 𝑋𝑋𝑋𝑋
𝑠𝑠
= = −0.20
𝑋𝑋 𝑌𝑌 √5.8571 ∙ √5.9286

1c. Wilcoxon Rank Sum

Determine the ranks of the observed values below, and write them in empty cells of this table:
𝑋𝑋 J F K K A D F C

𝑌𝑌 H İ E A G B C E

(It might be possible that you did not need to use all empty cells)

Now, calculate the Wilcoxon Rank Sum as used in a test. Write the calculation in the answer box:
[ANSWER BOX]

Answer 1c. – Wilcoxon Rank Sum


𝑋𝑋 J F K K A D F C
rank 14 9.5 15.5 15.5 1.5 6 9.5 4.5
𝑌𝑌 H İ E A G B C E
rank 12 13 7.5 1.5 11 3 4.5 7.5

The Wilcoxon Rank Sum: 𝑇𝑇1 = 14 + 9.5 + 15.5 + 15.5 + 1.5 + 6 + 9.5 + 4.5 = 76
or: 𝑇𝑇2 = 12 + 13 + 7.5 + 1.5 + 11 + 3 + 4.5 + 7.5 = 60
Question 2 – Dependence / independence (18)
(this is exercise 10, from the homework exercises of week 3, in Syllabus A – weeks 1-2-3)

New drugs are usually tested by giving a randomly selected group of people the drug, and another
randomly selected group of people (named the control group) a placebo. Each person is then asked
whether she or he suffered serious side effects. Suppose that for a new drug, the following data
were collected:
serious side effects
suffered did not suffer
new drug 41 165
medication
placebo 28 161
Use a 𝜒𝜒 2 -test to determine if we conclude that, at the 10% significance level, differences exists
between the new drug and the placebo in terms of reported side effects. Follow these steps:
[1] Assumptions and conditions, [2] Hypotheses, [3] Test statistic and its distribution, [4] Rejection
region, [5] Sample outcome, [6] Confrontation and decision, [7] Conclusion

Answer 2
[1] Assumptions and conditions
• Radom sample
• Available: nominal data (two yes/no-variables)
• Required: minimally nominal data
• All 𝑒𝑒𝑖𝑖 ≥ 5

[2] Hypotheses
𝐻𝐻0 ∶ the two classifications are independent
𝐻𝐻1 ∶ the two classifications are dependent

[3] Test statistic and its distribution


4 2 2 2
2
(𝑓𝑓𝑖𝑖 − 𝑒𝑒𝑖𝑖 )2 �𝑓𝑓𝑖𝑖𝑖𝑖 − 𝑒𝑒𝑖𝑖𝑖𝑖 �
𝜒𝜒 = � ~ 𝜒𝜒 2 [df] or 2
𝜒𝜒 = � � ~ 𝜒𝜒 2 [df]
𝑒𝑒𝑖𝑖 𝑒𝑒𝑖𝑖𝑖𝑖
𝑖𝑖=1 𝑖𝑖=1 𝑗𝑗=1

with df = (𝑟𝑟 − 1)(𝑐𝑐 − 1) = (2 − 1)(2 − 1) = 1


[4] Rejection region
2 2
𝜒𝜒 2 ≥ 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 = 𝜒𝜒0.10, 1 = 2.706

[5] Sample outcome


2
(41 − 36)2 (165 − 170)2 (28 − 33)2 (161 − 156)2
𝜒𝜒𝑜𝑜𝑜𝑜𝑜𝑜 = + + + = 1.76
36 170 33 156
from: serious side effects
suffered did not suffer
new drug 41 36 165 170 206
medication
placebo 28 33 161 156 189
69 326 395

[6] Confrontation and decision


2 2
𝜒𝜒𝑜𝑜𝑜𝑜𝑜𝑜 < 𝜒𝜒𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 ⟹ do not reject 𝐻𝐻0

[7] Conclusion
Given the significance level of 10%, there is not sufficient evidence to infer that there exists
differences in side effects between the new drug and the placebo.
Question 3 – Confidence interval (9)
A statistician uses two random samples of clothes hangers, in order to test the maximum weight
(kilograms) that two different brands of clothes hangers can carry. The results:

brand 1 brand 2
sample size 18 16
mean max.weight 10.4 12.2
st.deviation max.weight 7.2 4.4
𝑠𝑠 2
… …
𝑛𝑛
Use a 90%-confidence interval to estimate the difference between mean maximum weights.
(3 decimals)

Answer 3

𝑠𝑠 2 𝑠𝑠12 𝑠𝑠22
𝑛𝑛1
= 2.88 = 1.21
𝑛𝑛 𝑛𝑛2

2
𝑠𝑠 2 𝑠𝑠 2
�𝑛𝑛1 + 𝑛𝑛2 � (2.88 + 1.21)2
1 2
df = = = 28.57 ≈ 29
𝑠𝑠 2 2
𝑠𝑠 2
2 2.882 1.212
� 1� � 2� +
18 − 1 16 − 1
𝑛𝑛1 𝑛𝑛2
𝑛𝑛1 − 1 + 𝑛𝑛2 − 1
1
1 − 𝛼𝛼 = 90% ⟹ 2
𝛼𝛼 = 5%

⟹ 𝑡𝑡𝛼𝛼, df = 𝑡𝑡0.05, 29 = 1.699


2

𝑠𝑠12 𝑠𝑠22
𝜇𝜇1 − 𝜇𝜇2 = 𝑥𝑥̅1 − 𝑥𝑥̅2 ± 𝑡𝑡𝛼𝛼, df ∙ � +
2 𝑛𝑛1 𝑛𝑛2
= 10.4 − 12.2 ± 1.699 ∙ √2.88 + 1.21 = −1.8 ± 3.436
⟹ −5.236 < 𝜇𝜇1 − 𝜇𝜇2 < 1.636
Question 4 – Regression I (2 + 2 + 5 + 7 + 4 + 2 + 7 = 29)
In a number of cities in the world, the cost of living has been measured yearly as an index. This index
expresses the cost of living in a city as a percentage of the cost of living in New York City. Thus for
New York City the value of the index is 100 (representing 100%).
rank2007 city index2007 index2006
1 Moscow 134.4 123.9
2 London 126.3 110.6
3 Seoul 122.4 121.7
4 Tokyo 122.1 119.1
5 Hong Kong 119.4 116.3
6 Copenhagen 110.2 101.1
7 Geneva 109.8 103
8 Osaka 108.4 108.3
9 Zurich 107.6 100.8
10 Oslo 105.8 100
11 Milan 104.4 96.9
12 St Petersburg 103 99.7
13 Paris 101.4 93.1
14 Singapore 100.4 92
15 New York 100 100

. sum index2007 index2006

Variable | Obs Mean Std. dev. Min Max


-------------+---------------------------------------------------------
index2007 | 15 111.7067 10.61807 100 134.4
index2006 | 15 105.7667 10.3067 92 123.9

. reg index2007 index2006, r

Linear regression Number of obs = 15


F(1, 13) = 77.75
Prob > F = 0.0000
R-squared = 0.8370
Root MSE = . . .

------------------------------------------------------------------------------
| Robust
index2007 | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
index2006 | .94254 .1069 8.82 0.000 .71161 1.1735
_cons | 12.017 10.80 1.11 0.286 -11.312 35.346
------------------------------------------------------------------------------

The following questions concern the regression analysis above.


4a. Find the value of the residual for Geneva.
The predicted value is: 12.017 + 0.94254 ∙ 103 = 109.1
So the residual is: 109.8 − 109.1 = 0.7
4b. Interpret the intercept.
The intercept does not have a sensible interpretation since 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖2006 = 0 is outside the
range of possible and observed values.
4c. Find the (missing) value of Root MSE.
𝑇𝑇𝑇𝑇𝑇𝑇 = (15 − 1) ∙ 10.618072 = 1578.4
𝑆𝑆𝑆𝑆𝑆𝑆 = 𝑇𝑇𝑇𝑇𝑇𝑇 ∙ (1 − 𝑅𝑅 2 ) = 1578.4 ∙ (1 − 0.8370) = 1578.4 ∙ 0.163 = 257.28
𝑆𝑆𝑆𝑆𝑆𝑆 257.28
𝑀𝑀𝑀𝑀𝑀𝑀 = = = 19.79 ⟹ 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅 = √19.79 = 4.449
𝑛𝑛 − 2 13
Use the heteroskedastic robust standard errors in the next question 4d.
4d. Approximate the 𝑝𝑝-𝑣𝑣𝑣𝑣𝑙𝑙𝑙𝑙𝑙𝑙 when you test whether the slope of the population regression line is
smaller than 1.
0.94254 − 1
𝑡𝑡𝑎𝑎𝑎𝑎𝑎𝑎 = = −0.538
0.1069
−𝑡𝑡0.10,[𝑑𝑑𝑑𝑑=13] = −1.530
−0.538 > −1.530 ⟹ 𝑝𝑝-𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 > 0.10
4e. Comment on the (in)validity of the outcome that you found in the previous question.
The sample size 𝑛𝑛 = 15 is small, and then we cannot rely on validity of the robust standard
errors, which are only valid for large 𝑛𝑛. As a consequence, we cannot rely on validity of the
calculated confidence interval that uses the standard error.
4f. What would be the estimated regression equation when 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖2007 was regressed without
any explanatory variable (so only on a constant)?
When you regress only on a constant, the estimated coefficient is equal to the sample mean.
� 007 = 111.7067.
The estimated equation is then: 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑥𝑥2
4g. The index values for Paris are: 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖2006 = 93.1 and 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖2007 = 101.4. Suppose that
Paris is taken as the base, instead of New York, and that 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (Paris index) expresses the
cost of living in a city as a percentage of the cost of living in Paris. So for Paris: 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃2006 =
100 and 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃2007 = 100. Find the estimated regression equation (including a constant)
when 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃2007 is regressed on 𝑃𝑃𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖2006.
Estimated is:
� 007 = 12.017 + 0.94254 ∙ 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖2006
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑥𝑥2
We will use:
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖2007
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃2007 = × 100
101.4
and
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖2006
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃2006 = × 100
93.1
So:
� 007
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑥𝑥2
� 007 =
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑥𝑥2 × 100
101.4
12.017 0.94254
= × 100 + × 100 ∙ 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖2006
101.4 101.4
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃2006
= 11.851 + 0.92953 ∙ × 93.1
100
= 11.851 + 0.865 ∙ 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃2006
Question 5 – Regression II (3 + 10 + 4 + 4 = 21)
Consider a simple linear regression model with one explanatory variable: 𝑌𝑌𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋𝑖𝑖 + 𝑢𝑢𝑖𝑖

It is assumed that a random sample (𝑌𝑌𝑖𝑖 , 𝑋𝑋𝑖𝑖 ) for 𝑖𝑖 = 1, … , 𝑛𝑛 is drawn, and that (𝑌𝑌𝑖𝑖 , 𝑋𝑋𝑖𝑖 ) has finite
fourth moments. It is further assumed that 𝐸𝐸(𝑢𝑢𝑖𝑖 ) = 0 and 𝑣𝑣𝑣𝑣𝑣𝑣(𝑢𝑢𝑖𝑖 ) = 𝜎𝜎 2 , and that 𝑢𝑢𝑖𝑖 is
distributed independently of 𝑋𝑋𝑖𝑖 .

It is also known that 𝑋𝑋𝑖𝑖 > 0 is a random variable that by definition must always be positive, which
implies that 𝐸𝐸(𝑋𝑋𝑖𝑖 ) = µ𝑥𝑥 > 0; and that its effect on 𝑌𝑌𝑖𝑖 is strictly positive, so 𝛽𝛽1 > 0. Suppose that
an econometrician is interested in estimating the constant 𝛽𝛽0 and considers to use the simple
sample mean 𝑏𝑏0 = 𝑌𝑌� as an estimator of the constant 𝛽𝛽0 .

5a. It follows from the assumptions that 𝐸𝐸(𝑢𝑢𝑖𝑖 |𝑋𝑋𝑖𝑖 ) = 0 and 𝑣𝑣𝑣𝑣𝑣𝑣(𝑢𝑢𝑖𝑖 |𝑋𝑋𝑖𝑖 ) = 𝜎𝜎 2 . Explain why.
Since 𝑢𝑢𝑖𝑖 is distributed independently of 𝑋𝑋𝑖𝑖 , it follows that the conditional distribution of 𝑢𝑢𝑖𝑖
for given 𝑋𝑋𝑖𝑖 is the same as the unconditional distribution of 𝑢𝑢𝑖𝑖 and this means in particular
that the conditional expected value and variance are equal to the unconditional expected value
and variance, so 𝐸𝐸(𝑢𝑢𝑖𝑖 |𝑋𝑋𝑖𝑖 ) = 𝐸𝐸(𝑢𝑢𝑖𝑖 ) = 0 and 𝑣𝑣𝑣𝑣𝑣𝑣(𝑢𝑢𝑖𝑖 |𝑋𝑋𝑖𝑖 ) = 𝑣𝑣𝑣𝑣𝑣𝑣(𝑢𝑢𝑖𝑖 ) = 𝜎𝜎 2
5b. Derive 𝐸𝐸(𝑏𝑏0 |𝑋𝑋1 , . . . , 𝑋𝑋𝑛𝑛 ). Next, derive 𝐸𝐸(𝑏𝑏0 ) and show that it is unequal to 𝛽𝛽0 .
𝐸𝐸(𝑏𝑏0 | 𝑋𝑋1 , … , 𝑋𝑋𝑛𝑛 ) = 𝐸𝐸(𝑌𝑌� | 𝑋𝑋1 , … , 𝑋𝑋𝑛𝑛 )
1
= 𝐸𝐸 � ∑𝑛𝑛𝑖𝑖=1 𝑌𝑌𝑖𝑖 � 𝑋𝑋1 , … , 𝑋𝑋𝑛𝑛 �
𝑛𝑛
1
= 𝐸𝐸 �𝑛𝑛 ∑𝑛𝑛𝑖𝑖=1(𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋𝑖𝑖 + 𝑢𝑢𝑖𝑖 ) � 𝑋𝑋1 , … , 𝑋𝑋𝑛𝑛 �
𝑛𝑛
1
= �(𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋𝑖𝑖 + 𝐸𝐸(𝑢𝑢𝑖𝑖 | 𝑋𝑋1 , . . . , 𝑋𝑋𝑛𝑛 ) )
𝑛𝑛
𝑖𝑖=1
𝑛𝑛
1
= �(𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋𝑖𝑖 + 𝐸𝐸(𝑢𝑢𝑖𝑖 |𝑋𝑋i ) ) as 𝑋𝑋𝑖𝑖 is independent of 𝑋𝑋𝑗𝑗 with 𝑗𝑗 ≠ 𝑖𝑖
𝑛𝑛
𝑖𝑖=1
𝑛𝑛
1
= �(𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋𝑖𝑖 + 0)
𝑛𝑛
𝑖𝑖=1
= 𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋̄
So: 𝐸𝐸(𝑏𝑏0 ) = 𝐸𝐸[𝐸𝐸(𝑏𝑏0 |𝑋𝑋1 , … , 𝑋𝑋𝑛𝑛 )] = 𝐸𝐸[𝛽𝛽0 + 𝛽𝛽1 𝑋𝑋̄] = 𝛽𝛽0 + 𝛽𝛽1 ∙ 𝐸𝐸[𝑋𝑋̄] = 𝛽𝛽0 + 𝛽𝛽1 𝜇𝜇𝑋𝑋 ≠ 𝛽𝛽0
5c. Based on the result in part b, can you tell whether 𝑏𝑏0 is biased, and whether 𝑏𝑏0 is consistent?
Explain exactly what you check for.
𝐸𝐸(𝑏𝑏0 ) = 𝛽𝛽0 + 𝛽𝛽1 𝜇𝜇𝑋𝑋 > 𝛽𝛽0 (as 𝛽𝛽1 > 0 and 𝜇𝜇𝑋𝑋 > 0), so 𝑏𝑏0 has a positive bias.
𝐸𝐸(𝑏𝑏0 ) = 𝛽𝛽0 + 𝛽𝛽1 𝜇𝜇𝑋𝑋 is constant in 𝑛𝑛, so it does not converge to 𝛽𝛽0 when 𝑛𝑛 increases to
infinity. This means that 𝑏𝑏0 has an asymptotic bias, so that it cannot be consistent.
5d. Is the least-squares estimator of 𝛽𝛽0 the BLUE? Explain.
The LS-assumptions are satisfied, as:
(1) 𝐸𝐸(𝑢𝑢𝑖𝑖 |𝑋𝑋𝑖𝑖 ) = 0 is assumed
(2) (𝑌𝑌𝑖𝑖 , 𝑋𝑋𝑖𝑖 ) for 𝑖𝑖 = 1, … , 𝑛𝑛 is a random sample, so they are i.i.d.
(3) (𝑌𝑌𝑖𝑖 , 𝑋𝑋𝑖𝑖 ) has finite fourth moments.
In addition 𝑣𝑣𝑣𝑣𝑣𝑣(𝑢𝑢𝑖𝑖 |𝑋𝑋𝑖𝑖 ) = 𝜎𝜎 2 is constant (homoskedasticity).
Under these conditions the LS-estimator of 𝛽𝛽0 is BLUE.

~ The End ~

You might also like