Chapter 5 Stratified Random Sampling Completed

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 26

Chapter 8 Stratified Random Sampling

A stratified random sample is obtained by dividing the popln into groups (called strata) and
obtaining a simple random sample from each stratum.

We control (ie decrease) variation by selecting the strata so that there is little variation within
strata. For a given cost, we obtain more information than for simple random sampling.

Reasons for doing stratified random sampling:

 Force representation of all groups especially the smaller groups


 The margin of error can be made smaller than for SRS if the strata are chosen in such a
way that the responses within each group are similar to one another. The estimated
variance of each estimator will be calculated as the sum of the variances within each
stratum.
 Costs are often reduced if strata take geographic location into account (less travel, etc)
 Estimates can be provided for each stratum.
Eg Opinion polls – may split by gender or location.
Eg Power consumption--split by type of housing

Another example:

CPI (Consumer Price Index) = average change in price for a fixed collection of goods and
services. There are 85 strata chosen on the basis of geographic location, popln size, popln
change, % urban.

Notation

L = number of strata

N i = size of stratum i = number of individuals in stratum i, i = 1, 2,….L

N = popln size = number of individuals in popln = N 1 + N 2 +…+ N L = N

ni = number individuals in sample from stratum i, i = 1, 2,….L

n = sample size = total number of individuals in sample = n1 +n 2+ …+n L = n

Estimating the popln mean (μ)

N1 N2 NL
y st = y 1+ y 2 +…+ y
N N N L

This is a weighted average of the averages from all groups, with weight proportional to stratum
size.

1
(
ni s2i
)
L
1
^(y ) =
Estimated variance of y st : Show that V st ∑ i N n
N 2 i=1
N
2
1−
i i

V st
N 1 N 2 (
^ N 1 y + N 2 y + …+ N L y
^ (y ) = V
N L )
^
¿V ( NN y )+V^ ( NN y )+…+V^ ( NN y )
1
1
2
2
L
L Why can we do this? V(X+Y)=V(X)+V(Y) +2cov

(X,Y). Cov(X, Y) = 0 because the readings from different strata are indep of one another

N 21 N 22 N 2L
¿ ^ ( y 1) +
V V^ ( y 2 ) + …+ V^ ( y L ) Why can we do this? V(aX) = a 2 V ( X)
2 2 2
N N N

n i s2i
( ) ( )
L 2
1
¿ 2 ∑ N i 1−
2 ^ ( y )=¿ 1− ni s i from Chap 4
since V i
N i=1 N i ni N i ni

Bound on error of estimate: B=2 √ V


^ ( y st )

Example 5 A work-study investigator wishes to estimate how long employees spend per week in union
meetings. Employees are stratified by type of job, as follows: floor workers, supervisors, janitorial staff
and administrators. A s.r.s. is obtained from each stratum, as follows:

Stratum I II III IV Overall


Job Floor Worker Supervisor Janitorial staff Administrator
Time spent 0 0.5 0.5 2 2.5 1.5 2 0 0.5 0.5 0 1 1 1 1.5
(hours) 0.5 0.5 0.5 1.5 2
0.5 0.5 0.5
0 0 0.5 0
0.5 0.5
Average ( y i ¿ 0.367 1.92 0.25 1.125 0.78
2
Variance ( si ¿ 0.052 0.142 0.083 0.0625 0.4926
Stratum size ( 150 15 10 12 187
Ni ¿
Sample size ( 15 6 4 4 29
ni ¿

2
Make boxplots to check for outliers. (Why? The sample mean is affected by outliers when the sample is
small.)

(a) Estimate the average time spent by employees in union meetings per week, and place a bound on
the error of the estimate. Interpret the resulting interval.

Boxplots first:

Boxplot of Floor, Supervisor, Exec, Admin

2.5

2.0

1.5
Data

1.0

0.5

0.0

Floor Supervisor Exec Admin

We see no outliers so the sample averages will not be affected by outliers.

N1 N N
y st = y 1+ 2 y 2 +…+ L y L
N N N

150 15 10 12
¿ ( 0.376 )+ (1.92 ) + ( 0.25 )+ ( 1.125 )
187 187 187 187

¿ .53
2 2 2
^ ( y st ) = N 12 V
V ^ ( y 1) + N 22 V^ ( y 2 ) + …+ N L2 V^ ( y L )
N N N

=
1
187 2
[1502 1−
15
150 (
.0524 /15+ 152 1−
6
15 )
.142/6+102 1−
4
10
.083/ 4+122 1−
4
(
12 )
.0625/4 ] ( ) ( )
3
= .002193

∴B = 2 √.002193=0.09

The CI is .53 ± .09=( 0.44 ,0.62 ) hours

Interpret: We estimate that the average number of hours per week spent in union meetings by all
employees is somewhere between .44 hours and .62 hours.

Notice that if this had been a SRS then we’d have:

y=0.776 (compare with y st =.53 ¿ Why are they so different? The readings from the floor
workers are down-weighted in the stratified estimate

V ( 187 29 )
^ ( y )= 1− 29 .4926 =.0144 .∴ B=√ V
^ ( y )=.24 ( Compare with Bst =0.09 )

HW Suppose that the stratified sample on p 120 was a SRS. Calculate y and B and compare with
y st and Bst .

Estimating a population total (τ ):

τ
y st estimates μ=
N

∴ τ^ st =N y st estimates Nμ=τ

τ^ st =N y st =N 1 y1 + N 2 y 2+ … N L y L

Notice that N 1 y 1 estimates τ^ 1 , etc

( )
n i s2i
L
^ ( y st )=∑ N 2i 1−
^ ( τ^ st )=V^ ( N y st ) =N 2 V
V
i=1 N i ni

Example 5 A work-study investigator wishes to estimate how long employees spend per week in union
meetings. Employees are stratified by type of job, as follows: floor workers, supervisors, janitorial staff
and administrators. A s.r.s. is obtained from each stratum, as follows:

Stratum I II III IV Overall


Job Floor Worker Supervisor Janitorial staff Administrator

4
Time spent 0 0.5 0.5 2 2.5 1.5 2 0 0.5 0.5 0 1 1 1 1.5
(hours) 0.5 0.5 0.5 1.5 2
0.5 0.5 0.5
0 0 0.5 0
0.5 0.5
Average ( y i ¿ 0.367 1.92 0.25 1.125 0.78
2
Variance ( si ¿ 0.052 0.142 0.083 0.0625 0.4926
Stratum size ( 150 15 10 12 187
Ni ¿
Sample size ( 15 6 4 4 29
ni ¿

(b) Estimate the total number of hours lost per week on account of union meetings. Place a bound on
the error of the estimate. Interpret the resulting interval.

^ ( τ^ st )=V^ ( N y st ) =N 2 V
Solution 1: Use τ^ st =N y st and V ^ ( y )=0.002193
^ ( y st ) with y st =0.53∧V st

τ^ st =N y st =187∗.53=99.8 hours

^ ( τ^ st )=V^ ( N y st ) =N 2 V
V ^ ( y st )=187 2∗0.002193=76.687

B= 2 sqrt (76.687)= 17.51

Solution 2: Direct:

τ^ =^τ st =N y st =N 1 y 1 + N 2 y 2+ … N L y L =150∗0.367+ … 12∗1.125=¿99.8 hours

( )
L 2
ni s i
^ ( τ^ )=¿ ∑ N 2i 1−
V =¿ ¿
i=1 N i ni

The CI is 99.8 ± 17.51=( 82.29 , 117.31 )

Interpret: We estimate that the total number of hours spent by all employees in union meetings
per week is somewhere between 82.29 hours and 117.31 hours.

The allocation problem: The allocation problem is: Given a desired margin of error, how many
individuals need to be randomly selected from each stratum?

Basic ideas:

 If the stratum is large, it should have greater representation in the sample ie we want
ni ∝ N i
 If the readings in one stratum are more variable than the readings in another stratum then
we need more info from the stratum with the larger variability ie ni ∝ σ i

5
 If it’s expensive to get readings from a stratum then we want fewer readings from that
1
stratum ie ni ∝
ci

There are three solutions to the allocation problem.

Proportional allocation takes into account only the size of each stratum. The weight of each
N
stratum in the sample is w i= i
N

Neyman allocation takes into account the sizes of the strata and the variation within each
Ni σi
stratum. The weight of each stratum in the sample is w i=
∑ Njσ j
Optimal allocation takes into account the sizes of the strata, the variation within each stratum
and the cost of sampling from each stratum. The weight of each stratum in the sample is

Ni σi

w i=
√ ci
N jσj

√cj
In order to solve an allocation problem:

1) Find the weights w 1 , w 2 , … w L using proportional, Neyman or optimal weights, then


2) find the total sample size (n), then
3) find the sample size ni from each stratum.

To find n:

N 2i σ 2i
∑ wi
n= 2
B
+∑ N i σ i
2 2
N
4

To find ni :ni=wi n

Notice that the weights always sum to 1, regardless of the method.

Example 6 Suppose we wish to estimate, to within $200, the average starting salary (p.a.) of teaching
graduates from last year. There were 150 teaching graduates: 90 female and 60 male.
σ F =$ 800 ; σ M =$ 500.We specify proportional samples. Find the required sample size (n) and the
gender breakdown within the sample.

6
Solution: B = $200 , N = 150 N F =90 N M =60 σ F =$ 800 σ M =$ 500

Step 1: Weights according to proportional sampling:

N F 90
wF= = =.6
N 150

N M 60
wM = = =¿ .4
N 150

Step 2: Find n

N 2i σ 2i 2 2
90❑ 800❑ 60❑ 500❑
2 2
∑ wi .6
+
.4
n= 2
= 2
=36.59 so we need n=37
2B 2 200
N +∑ N i σ i 150
2 2 2
+(90∗800❑ +60∗500❑ )
4 4

Step 3: Find ni :ni=wi n

n F =w F n=¿.6*37= 22

n M =w M n=¿.4*37=15

Check: 22+15=37!

Note: If there are more than two groups, it will be quicker to use a spreadsheet.

Example 7.1 Find allocation for given error bound Suppose we wish to estimate to within $200 the
average starting salary (p.a.) of the 150 teaching graduates from 2020. We find that the cost of surveying
graduates who have left town is greater than the cost of surveying those who have not left town: $5 for a
graduate who is in town and $10 for a graduate who has left town. Stratify on gender and on location (ie
out of town or not). Of the 90 female graduates, 35 have left town and 55 have not. Of the 60 male
graduates, 35 have left town and 25 have not. Assume that the approximate standard deviations of salaries
for women and men are respectively $800 and $500.

(a) Find the optimal n1 , n2 , n3 , n 4 .


(b) What is the cost of the sample in (a)?

Solution: (a) B = 200 N= 150

7
Ni σi

Optimal allocation so w i=
√ ci
N σ
∑ jc j
√ j
Set up spreadsheet:

Step 1: Compute the weights. (Check that they sum to 1!)

N i σi
Compute , add them up (use Sum in Stat>Basic Stats> Display descriptive) and then get the
√c i
weights. Be sure to write down the weights.

w 1=¿ .496203 w 2=.223280 w 3=.140967 w 4=¿ .139550


2 2
Ni σi 2
Step 2: To compute n, compute each and add them up. Also compute each N i σ i , then add them
wi
up.

8
Get the sum in the numerator of the formula for n (Adding up these):

Get the sum in the denominator of the formula for n (adding up these):

Then find n. Show your work for finding n.

N 2i σ 2i
∑ wi 10715884138
n= 2
= 2
=36.0077
B 2 200
+∑ N i σ i
2 2
N 150 +72600000
4 4

Step 3: Then compute the sample size for each group as ni =wi n.

n1 =¿ 18 n2 =8 n3=5 n 4=5
Check: 18+8+5+5=36

(b) What is the cost of the sample in (a)? Show your work!

Total cost = 18*5+8*10+5*5+5*10= $245

9
Next: what if there’s a limit on the cost of sampling?

Example 7.2 Find allocation for given cost of sampling. Suppose we wish to estimate the average
starting salary (p.a.) of the 150 teaching graduates from 2020. We have $150 to cover the cost of the
sample. We find that the cost of surveying graduates who have left town is greater than the cost of
surveying those who have not left town: $5 for a graduate who is in town and $10 for a graduate who has
left town. Stratify on gender and on location (ie out of town or not). Of the 90 female graduates, 35 have
left town and 55 have not. Of the 60 male graduates, 35 have left town and 25 have not. Assume that the
approximate standard deviations of salaries for women and men are respectively $800 and $500. This is
the same as before.

(a) Find the optimal n1 , n2 , n3 , n 4 and check that the total cost is at most $150.
(b) What is the error bound (B) for the allocation in (a)?

Solution: (a)

Total cost = $150

w 1=¿ .496203 w 2=.223280 w 3=.140967 w 4=¿ .139550

The weights are the same as before because the sigma’s, costs and stratum sizes are as before.

150 = ∑ ni ci =∑ wi n c i=n ∑ wi c i

∑ wi c i= 6.814 (do in spreadsheet)

150
∴150 = n ∑ wi ci =n∗6.814 Solve for n: n = =22.01 This is our new total sample size.
6.814

Now we have to figure out how many from each stratum using the weights: (I called the new sample sizes
nn_i)

10
ni =wi n :

n1 =¿ 11 n2 =5 n3=3 n 4=3

These add to 22.

Find the total cost of this sample: 11*5+5*10+3*5+3*10=$150 Yes! We could not exceed this

(b) What is the error bound (B) for the allocation in (a)? Use spreadsheet!

( )
L 2
1 ni si
First, calculate V ( y st ) = 2 ∑ N i 1−
^ 2
:
N i=1 N i ni

( )
L 2
^ ( y st ) = 1 ∑ N 2i 1− ni si =¿ 414366667/150^2=18416.29631
V 2
N i=1 N i ni

∴ B=2 √ V^ ( y st )=2 √ 18416.29631=$ 271 Bigger than what it was before ($200)

Notice that, in Example 7.1, part (a), n is chosen so that the estimate of the average salary is within $200
(=B) of μ. Consequently n is chosen to satisfy the restriction on B:
2 2
N i σi
∑ wi
n= 2
B
+∑ N i σ i
2 2
N
4

11
The cost turns out to be $245. Compare this situation with the situation in Example 7.2. In Example 7.2,
part (a), n is chosen so that the cost is at most $150. (The error bound turns out to be $271.). The weights
are the same in both Example 7.1 and 7.2 (Weights are calculated according to optimal allocation. )

Example 8 Refer to Examples 7.1 and 7.2. Suppose that all the out-of-town graduates are surveyed at a
reunion, so the cost of surveying out-of-town graduates is the same as the cost of surveying a graduate
who did not leave town. We had N 1=55 ; N 2=35 ; N 3 =25; N 4=35 ; σ 1=σ 2 =$ 800; σ 3=σ 4=$ 500.
Suppose that we wish to estimate the average salary of the 150 graduates to within $200. Find
n1 , n2 , n3 , n 4 .

Solution: What sort of allocation are we using? (proportional, Neyman or optimal?) Neyman

Ni σi
Step 1: Calculate the weights: w i=
∑ Njσ j
First get the numerators N i σ i

Then add them up to get the denominator:

Now get the weights:

Ni σi
w i=
∑ Njσ j

12
w 1=¿ .431373 w 2=.274510 w 3=.122549 w 4=¿ .171569

Check that the weights sum to 1:

N 2i σ 2i
∑ wi
Step 2: Calculate n n= 2
B
N2
+∑ N i σ 2i
4

To get the numerator:


2 2
Ni σi
Calculate the and then add them up using Sum in Stat>Basic Stats>Display descriptive:
wi

13
Now get the summation in the denominator: ∑ N i σ i
2

2
Calculate the N i σ i

2
Add up the N i σ i using Sum in Stat>Basic Stats>Display descriptive:

Now substitute into the formula for n:

N 2i σ 2i
∑ wi 10404000000
n= 2
= 2
=34.96
B 2 200
+∑ N i σ i
2 2
N 150 +72600000
4 4

So use n=35

Step 3: Apply the weights to the total sample size.

ni =wi n :

n1 =¿ 15 n2 =10 n3=4 n4 =6

Check ∑ ni=15+10+ 4+6=35=n, as we specified.

14
2 2
N i σi
∑ wi
Next: Where does the formula for n come from? n=
B2
N2 +∑ N i σ 2i
4

Derivation:

Recall (from Chapter 4) that V ( y i )= ( ) (


N i−ni σ 2i
N i −1 n i

Ni ) (
N i −ni σ 2i
ni
= 1− )
ni σ 2i
Ni ni

√∑ √∑
2 2 2
Ni Ni ni σ i

Now B=2 V ( y st ) =2
N
2
V ( y i )≈ 2
N
2
(1− )
N i ni

2 2 2
B Ni ni σ i
∴ =∑ 2 (1− ) We want to find n=∑ ni
4 N N i ni

Let ni =wi nwhere ∑ wi=1

B2 N 2i
∴ =∑ 2 1−
4 N
ni σ 2i 1
= ∑ 1−
N i ni N 2 (
wi n N 2i σ 2i
Ni win N2
1
)
= ∑(
win
− (
N 2i σ 2i wi n N 2i σ 2i
N i wi n
¿) ¿ )
2 2
N2 B2 N i σi
=¿ ∑ −∑ N i σ i
2

4 wi n
2 2
1 N i σ i N 2 B2
∴ ∑ = +∑ N i σ 2i
n wi 4
2 2
Ni σi
∑ wi
∴ n= 2
B
N 2
+ ∑ N i σ 2i
4

General Guidelines

1) If the stratum means are widely different then stratified random sampling with
proportional allocation will yield an estimator with smaller variance than that for SRS.
2) If costs are nearly the same for all strata but the stratum variances are widely different
then optimal allocation will yield an estimator with smaller variance than that from
proportional allocation.

Note: The allocation problem for estimating a population total (τ): weights are as for means
B
(proportional or Neyman or optimal). A bound of B onτ is the same as a bound of on μ .
N

15
Modify the formula for n:

N 2i σ 2i N i2 σ i2
∑ wi
∑ wi
n= 2
= 2
where B is the bound on τ
B B
2 ∑
N2
+ N i σ 2i + ∑ N i σ 2i
4N 4

Then calculate the sample sizes. To find ni :ni=wi n

Estimating p
L
Ni
^pst =∑ ^p
i=1 N i

This is a weighted average of the proportions from each stratum. The weights are proportional to
the sizes of the strata. (More weight is given to the large strata and smaller weights are given the
small strata.)

(∑ ) (
ni p^ i q^ i
)
L L L
Ni 1 1
^ ( ^p ) = V
^ ^pi = 2 ∑ N i Var ( ^p i)= 2 ∑ N i 1−
2 2
V st
i =1 N N i=1 N i=1 N i ni−1

^ ( ^p )= 1− n pq
From Chapter 4: V (
N n−1 )
Example. Refer to the situation in Exercise 5.6 p 154 of the text. A school desires to estimate the
average score that may be obtained on a reading comprehension exam for students in the sixth
grade. The school’s students are grouped into three tracks, with the fast learners in track I, the
slow learners in track III, and the rest in track II. The school decides to stratify on tracks because
this method should reduce the variability of test scores. The sixth grade contains 55 students in
track I, 80 in track II, and 65 in track III. A stratified random sample of 50 students is
proportionally allocated and yields simple random samples of n1 = 14, n2 = 20, and n3 = 16
from tracks I, II, and III. The test is administered to the sample of students.

Results are here

16
Estimate the proportion of sixth-graders who scored 60 or more and place a bound on the error of
the estimate.

Solution:
14 12 2
^p I = =1 ^p II = =.6 ^p III = =.125
14 20 16

The estimate of p is .5556

Spreadsheet:

L
Ni
^pst =∑ ^p =.55563
i=1 N i

17
L
^ ( ^p st ) = 1 ∑ N 2 Var ( ^p )=
V i i
N 2 i=1

( ) ni ^p i q^ i
L
1 1 1
2∑
2
N i 1− = 2 ( 83.85554 )= ( 83.5554 )=¿ .0020889
N i=1 N i ni −1 N 200
2

∴ B=2 √ .0020889=.091

The 95% CI for p is ( .46, .64)

Interpret: We estimate that between 46% and 64% of Grade 6 scored over 60%

28
What if we had not stratified? Then ^p= =.56
50

Then the 95% CI for p is ^p ±2 (√ 1− Nn ) n−1


^p q^
=¿
28
50 √ 200 ) 50−1 =.56 ± .123
± 2 ( 1−
50 .56∗.44

Compare the margins of error for stratified and unstratified. Which is better? Stratified is better
because the margin of error is smaller (.123 if not stratified. .091 if stratified.)

Difference of two proportions.

Example: Refer to Exercise 5.6 page 154. In terms of the percentage scoring over 60%, is the
difference between Tracks I and II significant?

Solution:

^p I −^p II =1−.6=.4

^ ( ^p − ^p )=V^ ( ^p ) + V
V ^ ( ^p )=0+.0094737
I II I II

∴ B=2∗ √.0094737=.19

The 95% CI for p I − p II is ( .21 , .59 )

What does this tell us about the significance of the difference in % scoring over 60%? Zero is not
in the CI so the difference (in terms of the % scoring over 60%) between tracks I and II is
“significant”.

Note: the procedures for differences are the same as for two independent SRS because
stratification means drawing a SRS independently from each stratum.

The allocation problem for estimating a proportion: same as for estimating means, except replace
σ i with √ p i qi.

18
Example 9: A survey is conducted amongst new residents of Abbotsford to discover whether their
housing needs have adequately been met. Stratification was by type of housing (apartment, free-standing
house or in between (duplex, townhouse, shared house)). Figures for summer 2020 were used as initial
estimates for summer 2021.

Apartment House In between Total


Ni 137 41 90 268
^pi 0.9 0.7 0.6
ci $4 $1 $3
Costs of contacting apartment dwellers were high because of the large number of call-backs required to
find some-one at home.

(a) Find n1 , n2 , n3and n so as to estimate to within 5% the percentage who are happy with their
accommodation. Also find the cost of this sample.
(b) Find the optimal stratum sample sizes assuming that we have only $150 for sampling.

Solution:

(a) B = 0.05

Step 1: Find w i
N i √ pi q i

w i=
√c i
∑ j √c j j
N pq
√ j
Spreadsheet:

Then do Stat>Basic Stats to get the Sum:

Then get the weights (divide the last column above by the 64.7944):

19
w1= .3172, w2=.2810, w3 = .3929

Step 2: Find n:

N 2i p i qi
∑ wi
n= 2
B
+∑ N i p i q i
2
N
4

Spreadsheet: Calculate the terms in each sum

Then use Stat>Basic Stats to get the Sums:

Now calculate n:

N 2i p i qi
∑ wi 11492
n= 2
= 2
=131.4
B 2 .05
N2 +∑ N i p i q i 268 +42.54
4 4

Step 3: Find ni =wi n :

20
n1= 42; n2=38; n3= 52

Check: This allocation yields n=132 (Compare with n = 131.4 calculated above.)

The cost of this sample is 42*$4+38*$1+52*$3=$362

(b) Find the optimal stratum sample sizes assuming that we have only $150 for sampling.

Scale each sample size downwards:

150/362=.414

New n1 = .414*42= 17

New n2= .414*38=16

New n3 = .414*52=22

Total sample size=55

Total cost= $150

Will the bound for this allocation be larger or smaller than 5%? Larger How do you know? The
sample size n is smaller so the error bound will be larger

How would you calculate this bound?

( )
ni p^ i q^ i
L
1
2∑
Ans: B=2∗√ V ( pst ) where
^ ^ ^
V ( ^
p )
2
N i 1−
st =
N i=1 N i ni−1

Additional comments

1) Variation within each stratum should be small otherwise V(estimator) will be larger for
stratified sampling than for SRS.
2) Many surveys ask more than one question. What if we need to estimate both popln mean
and popln %? We have to choose the stratum sample sizes in order to satisfy error bounds
on both mean and %.

Move on to Chap 10!! The rest of Chap 5 will not be examinable

21
Example 10 What if we have two targets? (eg an error bound for estimating mu and an error bound for
estimating p. How do we find a suitable sample size and allocate the sample?) Do for HW.

Suppose that an investigator wishes to find out (a) what percentage of local teachers supported a recent
strike action and (b) their average length of service in BC. There are 250 teachers in the area. The
percentage is to be estimated to within 10 percentage points and the average is to be estimated to within
one year. The teachers are stratified by salary: 100 teachers who earn less than $x p.a. form stratum 1 and
the rest form stratum 2. From past records, experience in BC ranges from 0 to 5 years in stratum 1 and 5-
30 years in stratum 2. Find the sample size and allocation necessary to achieve both of these bounds.

Post-stratification

Divide up respondents into strata after sampling. We must know the proportional representation
N
of each stratum in the popln ie we must know W i = i
N

eg Suppose we know that the popln is 52% female and 48% male. We do not know N 1=¿
number of males in popln and N 2=¿ number of females in popln.

We have W 1=.48 and W 2=.52.


L
Ni
y st =∑ y as before.
i=1 N i

For stratified sampling, when ni are known in advance of sampling, we have:

( )
L 2
1 ni si
V ( y st ) = 2 ∑ N i 1−
^ 2

N i=1 N i ni

But now, in post-stratified sampling, the ni are random (not known in advance and may change
in repeated sampling).

( )∑ wi s2i + n12 ∑ ( 1− NN ) s
L L
^ p ( y st ) = 1 1− n
V i 2
Formula **
i
n N i=1 i=1

^ ( y ) under proportional allocation + increase due to post-stratification


=V st

Ni N n
To derive this, first note that, under proportional allocation, w i= ∧ni=wi n= i .
N N

( )
Ni n

( )
2 2
^
( )∑ w s
L L L
Then V ( y st ) = 1 ni si N si 1 n
2∑
2
N i 1− =∑ wi 1 −
2
= 1− 2
i i
N i=1 N i ni i=1 N i wi n n N i=1

22
( )
N
( )
L L
^ ( y ) = 1 1− n
Now derive the result: V
1
∑ w s + n2 ∑ 1− Ni si2
2
p st i i
n N i=1 i=1

( ) ( )
L L 2 2
^ ( y ) = 1 ∑ N 2i 1− ni si =∑ w2i 1− ni s i
V st 2
N i=1 N i ni i=1 N i ni

Ni
where w i= . The reason for this is that, in a random sample, the size of each group in the
N
sample will theoretically be proportional to the size of that group in the population—that is,
N i ni
w i= ≈
N n

( ni s 2i
)
2
L
2 si N 2i s 2i ni 2
2 si 1
V ( y st ) = ∑ i
^ ∑ ∑ ∑ − ∑ wi s i ¿
2 2
w 1− = w i − =¿ w i
i=1 N i ni ni N ni N i
2
ni N

1 1
In the case of post-stratified samples, ni is random so replace with an approximation of E( )
ni ni

1 1 N
E( ) is like an average value of . ni is hypergeometric so E ( ni ) = i n=wi n
ni ni N

Then E ( )
1

1 1−w i
+
ni n wi n2 w2i
This works well only if n is large.

^ ( y ) =¿ ∑ w2i s 2i
V p st ( 1 1−wi 1
)
+ 2 2 − ∑ wi s 2i =∑ w2i s 2i
n wi n wi N
1
n wi
1−w 1
( )
+¿ ∑ w 2i s2i 2 2i − ∑ wi s 2i ¿
n wi N

¿
1
n
∑ n
1 1
N
1 1
wi s 2i + 2 ∑ s 2i ( 1−wi ) − ∑ w i s2i = −
n N ( )∑ w s + n1 ∑ (1−w ) s
2
i i 2 i
2
i

¿
1
n(1−
n
N )∑ w s + n1 ∑ (1−w ) s , as required.
2
i i 2 i
2
i

ni pi q i
For proportions: replace s2i with in the formula **
ni−1

Example 11 An opinion poll on EI payments was undertaken by phone (cell and landline): 1000 people
were asked whether EI payments were adequate for basic needs. Respondents were stratified according to
whether they were employed or not. (Respondents were asked during the survey about their employment
status.) The results were summarised as follows:

23
Number of respondents EI adequate (Yes) ^pi
Unemployed 120 48 0.4
Employed 880 704 0.8

(a) The national unemployment figure was 6%. Post-stratify and find ^pst , the estimate of the
proportion of the population who think that EI is adequate for basic needs.
(b) Give a 95% confidence interval for the percentage of the proportion of the population who think
that EI is adequate for basic needs.

Solution: First use the usual estimator to estimate the proportion of the population who think that EI
is adequate for basic needs. What’s wrong with this estimator?

^p= ❑ = ❑ =? ?
❑ ❑
Is ?? a good estimate of the % of the popln who think that EI is adequate for basic needs? Why or why
not?

Solution (a): In order to give each person the popln the same weight, we need the unemployed to count
6% and the employed to count 94%.

ie w U =¿ w E =¿

Then ^pst =¿

Compare this with ^p

Note: Use the same idea for estimating μ and τ

Solution (b)

( ) (N
)
L L
^ ( y ) = 1 1− n
V
1
∑ w s + n2 ∑ 1− Ni si2
2
p st i i
n N i=1 i=1

2 ni pi qi
For proportions, si =
ni −1

n1 p 1 q 1 ❑ n p q
2
So s1= = =?∧s 22= 2 2 2 = ❑ =?
n1 −1 ❑ n2−1 ❑

p st
n (
N ❑ )
^ ( y ) = 1 1− n ()+ ❑ [ ( 1−wU ) ()+ ( 1−w E ) ()]
V

24
¿
1
n( n
)
1− ()+ ❑ [ ( 1−wU ) ()+ ( 1−w E ) ()]
N ❑

∴ B=2∗ √❑

The CI for p is

Interpret:

Caveats:

Ni
1) must be known accurately
N

2) Be careful if post-stratification is used to adjust for non-response.

Eg Suppose that

 the response variable (y) is height,


 men are less likely to respond than women (why might this happen?), and
 we have W M =.49 and W F =.51.

Then y st =.49 y M +.51 y F

This will be too high because short men tend not to answer, making y M is too high.

The problem is that the non-response caused bias. This *can’t* be fixed by adjusting for the
proportion of men in the popln. The bias has affected y M . Can’t fix this because we don’t know
what % of men are tall, short, etc. Also, we can’t assume that the proportions of tall/short men in
the popln are the same as in the sample.

Notice that this problem is the exactly the problem faced by pollsters trying to predict outcomes
of political elections when certain types of voters (eg Trump supporters) are less likely to
participate in polls than other voters. We can’t predict the outcome of the vote because the % of
Trump supporters in the sample is not the % of Trump supporters in the popln, and we can’t
assume that those who did respond will vote like those who did not respond.

HW. Exercises 5.1, 5.3, 5.5, 5.7, 5.9, 5.11, 5.13, 5.15, 5.19 (Assume that the stratum sizes are the
same), 5.27, 5.31, 5.33

Below are the formulae for the allocation problem, all on one page for your convenience.

25
Sample size calculation and allocation to strata: Calculate w i, then n , then ni =wi n

Ni
Proportional Allocation: w i= Accounts for sizes of strata
N

Optimal allocation: Accounts for sizes of strata, different variances and costs

Ni σi

w i=
√ ci
N jσj

√cj
Neyman Allocation: Assumes costs the same for all strata

Ni σi
w i=
∑ Njσ j
2 2
Ni σi
∑ wi
n= Formula(¿)
B2
N 2
+∑ N i σ 2i
4
2
B B2
Same for τ except 2 replaces For p, replace σ i with √ p i qi
4N 4

26

You might also like