Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

3 UNEQUAL PROBABILITY SAMPLING

In Section 2, with a SRS we know the sampling units all have equal probability of being selected
in the sample. For other sampling procedures, dierent units in the population have dierent
probabilities of being included in a sample. The dierent inclusion probabilities depend on either
(i) the type of sampling procedure or (ii) the probabilities may be imposed by the researcher
to obtain better estimates by including more important units with higher probability. In
either case, the unequal inclusion probabilities must be taken into account to derive estimates
of population parameters.
3.1 Hansen-Hurwitz Estimation
One method of estimating and when the probability of selecting sampling units is not
equal is Hansen-Hurwitz estimation.
Situation: (i) A sample of size n is to be selected, (ii) sampling is done with replacement,
and (iii) the probability of selecting the i
th
unit equals p
i
on each draw (selection) of a
sampling unit. That is, the probability is constant and equal to p
i
for every selection of a
sampling unit.
3.1.1 Hansen-Hurwitz Estimators
The Hansen-Hurwitz estimator of is

hh
=
1
n
n

i=1
y
i
p
i
.
The variance of
hh
is
var(
hh
) =
1
n
N

i=1
p
i
_
y
i
p
i

_
2
.
Because the true variance is unknown, we use the data to estimate it. The estimated
variance of our estimator is
var(
hh
) =
1
n(n 1)
n

i=1
p
i
_
y
i
p
i

hh
_
2
.
When estimating , divide
hh
by N and var(
hh
) by N
2
. The Hansen-Hurwitz estimators
are:

hh
=
1
Nn
n

i=1
y
i
p
i
and var(
hh
) =
1
N
2
n(n 1)
n

i=1
p
i
_
y
i
p
i

hh
_
2
.
These estimators are unbiased.
29
Example: Hansen-Hurwitz estimation with selection probabilities p
i
that are propor-
tional to Size and with p
i
(approximately) directly proportional to y
i
The total abundance = 16. There are N = 5 sampling units. The gure shows the units,
labeled 1 to 5, and the ve y
i
values.
1 2 3 4 5
y
i
7 4 0 2 3
p
i
.4 .3 .1 .1 .1
Sample Units y-values P[S = s]
HH

Var(
HH
)
1 1,2 7,4 0.24 15.4167 4.3403
2 1,3 7,0 0.08 8.7500 76.5625
3 1,4 7,2 0.08 18.7500 1.5625
4 1,5 7,3 0.08 23.7500 39.0625
5 2,3 4,0 0.06 6.6667 44.4444
6 2,4 4,2 0.06 16.6667 11.1111
7 2,5 4,3 0.06 21.6667 69.4444
8 3,4 0,2 0.02 10.0000 100.0000
9 3,5 0,3 0.02 15.0000 225.0000
10 4,5 2,3 0.02 25.0000 25.0000
11 1,1 7,7 0.16 17.5000 0.0000
12 2,2 4,4 0.09 13.3333 0.0000
13 3,3 0,0 0.01 0.0000 0.0000
14 4,4 2,2 0.01 20.0000 0.0000
15 5,5 3,3 0.01 30.0000 0.0000
Because sampling is done with replacement, a unit may occur more than once in the
sample. The estimator uses that units y
i
value as many times as it was selected.
The ideal case for Hansen-Hurwitz estimation occurs when each selection probability p
i
is
proportional to y
i
for all i. That is p
i
y
i
/. Then y
i
/p
i
for all i and var(
hh
) is close
to 0.
Therefore, in practice, if we believe the y
i
values are nearly proportional to some known
variable (like sample unit size), we should assign selection probabilities p
i
proportional to
the value of that known variable. This would yield an estimator with small variance.
3.1.2 Condence Interval Estimation
For large samples, approximate (1 )100% condence intervals for and are

hh
z

_
var(
hh
)
hh
z

_
var(
hh
)
For small samples (Thompson suggests for n < 50), replace z

with t

from the t-
distribution having n 1 degrees of freedom.
30
Figure 4: Hansen-Hurwitz Estimation
with Selection Probabilities p
i
that are Proportional to Size
The total abundance = 13354. There are N = 49 sampling units. The gure shows the unit
labels and the forty-nine y
i
values.
1 2 3 10 11 12 13
292 (344) 401 218 243 272 267
4 5 6 14 15 16 17
337 418 ((467)) 252 273 (279) 307
7 8 9 18 19 20 21
419 475 526 285 308 321 346
22 26 30 34 35 36 37
227 249 (278) 158 156 170 171
23 27 31 38 39 40 41
242 278 305 171 180 181 195
24 28 32 42 43 44 45
262 280 (322) 187 185 (193) 216
25 29 33 46 47 48 49
(293) 293 333 183 191 202 203
For i = 1, 2, . . . , 9, p
i
= .04, for i = 10, 11, . . . , 33, p
i
= .02, and for i = 34, 35, . . . , 49,
p
i
= .01. Then, we see

p
i
= (9)(.04) + (24)(.02) + (16)(.01) = .36 + .48 + .16 = 1.
A sample of size 8 was selected proportional to size with replacement. The sample units were
2, 6, 6, 16, 25, 30, 32, 44.
3.2 Horvitz-Thompson Estimation
A second method of estimating and when the probability of selecting sampling units
is not equal is Horvitz-Thompson estimation.
Situation: (i) A sample of size n is to be selected, (ii) sampling is done with replacement
or without replacement.
3.2.1 Inclusion Probabilities
Suppose that we are taking a probability sample from a population of size N with y-values
y
1
, y
2
, . . . , y
N
.
It is assumed that the researchers goal is to estimate the population total or mean per
unit and an associated variance or condence interval.
The rst order inclusion probability
i
is the probability unit i will be included by a
sampling design.
The second order inclusion probability
ij
is the probability that unit i and unit j
will both be included by a sampling design.
31
3.2.2 Horvitz-Thompson Estimators
A practical application of the
i
s and the
ij
s is for estimation purposes. The
i
s and
the
ij
s are used to determine the
(i) Horvitz-Thompson estimators
ht
and
ht
of the population total and mean
(ii) Variances of these estimators, var(
ht
) and var(
ht
)
(iii) Estimates of the variances of these estimators, var(
ht
) and var(
ht
).
When the goal is to estimate the population total or the mean , and the
i
s are known,
Horvitz and Thompson (1952) showed

ht
=

i=1
y
i

i
and
ht
=
1
N

i=1
y
i

i
(19)
are unbiased estimators of and where is the eective sample size. The eective
sample size is the number of distinct units in the sample. When sampling without
replacement, = n.
Because the summation is over the distinct units in the sample, the estimator does not
depend on the number of times a unit may be selected. This allows for Horvitz-Thompson
estimators to be used for sampling plans with replacement or without replacement of units.
One form of the true variance of the estimator
ht
is
var(
ht
) =
N

i=1
_
1

i
1
_
y
2
i
+ 2
N1

i=1
N

j=i+1
_

ij

j
1
_
y
i
y
j
.
Because this variance is typically unknown, we must estimate it from the data. An esti-
mator of this variance is
var(
ht
) =

i=1
_
1

2
i

i
_
y
2
i
+ 2

i=1

j>i
_
1

ij
_
y
i
y
j
. (20)
It follows directly that var(
ht
) =
1
N
2
var(
ht
).
If
ij
> 0 for all i, j = 1, 2, . . . , N, then var(
ht
) is an unbiased estimator.
Example: Horvitz-Thompson Estimation with inclusion probabilities
i
that are propor-
tional to size and
i
is (approximately) directly proportional to y
i
The total abundance = 16. There are N = 5 sampling units. The gure shows the units,
labeled 1 to 5, and the ve y
i
values.
1 2 3 4 5
y
i
7 4 0 2 3
p
i
.4 .3 .1 .1 .1
32
Sample Units y-values P[S = s]
HT

Var(
HT
)
1 1,2 7,4 0.24 18.7806 11.4440
2 1,3 7,0 0.08 10.9375 43.0664
3 1,4 7,2 0.08 21.4638 13.0803
4 1,5 7,3 0.08 26.7270 65.4002
5 2,3 4,0 0.06 7.8431 30.1423
6 2,4 4,2 0.06 18.3695 18.3450
7 2,5 4,3 0.06 23.6326 79.7593
8 3,4 0,2 0.02 10.5263 89.7507
9 3,5 0,3 0.02 15.7895 201.9391
10 4,5 2,3 0.02 26.3158 24.0997
11 1,1 7,7 0.16 10.9375 43.0664
12 2,2 4,4 0.09 7.8431 30.1423
13 3,3 0,0 0.01 0 0
14 4,4 2,2 0.01 10.5263 89.7507
15 5,5 3,3 0.01 15.7895 201.9391
The inclusion probabilities are
1
= .64,
2
= .51, and
3
=
4
=
5
= .19.
A second unbiased variance estimator (if all
ij
> 0) is the Sen-Yates-Grundy variance
estimator:
var
syg
(
ht
) =

j>i
_
y
i

y
j

j
_
2
_

ij
1
_
(21)
Warning: Some designs will occasionally generate samples that yield negative variance
estimates even though the true variance must be nonnegative.
The following formula of Brewer and Hanif is considered a conservative alternative to the
previous variance estimation formulas:
var(
bh
) =
N
N
s
2
t

where s
2
t
=
1
( 1)

i=1
(t
i

ht
)
2
and t
i
=
y
i

i
.
It is considered conservative because it will always yield a nonnegative variance estimate
and tends to be larger than the actual variance. Thus, var
syg
(
ht
) is a biased estimator.
For additional information on the Horvitz-Thompson estimators and the Sen-Yates-Grundy
and the Brewer and Hanif variance estimators, see Hedayat and Sinha (1991).
3.2.3 Condence Interval Estimation
For large samples, approximate 100(1 )% condence intervals for and are

ht
z

_
var(
ht
)
ht
z

_
var(
ht
) (22)
where z

is the the upper /2 critical value from the standard normal distribution.
33
For smaller samples ( < 50, Thompson (1992)), approximate 100(1 )% condence
intervals for and are

ht
t

_
var(
ht
)
ht
t

_
var(
ht
) (23)
where t

is the the upper /2 critical value from the t( 1) distribution.


3.3 Sampling with Probabilities Proportional to Size (PPS)
Suppose that n sampling units are selected with replacement with selection probabilities
proportional to the sizes of the units from a nite population of N units.
Let p
i
be the probability that unit i is selected during the sampling with replacement
process. Then p
i
=
N
i
N
T
where N
i
is the size of unit i and N
T
=
N

i=1
N
i
= the total size
of the population of N units.
If sampling is done without replacement, determining the inclusion probabilities is often
very complex. Thus, we will only consider sampling with replacement and use Horvitz-
Thompson estimation.
If sampling is done with replacement, the rst order inclusion probability

i
= Pr(unit i will be included in the sample) (24)
= 1 Pr(unit i will not be included in the sample)
= 1 (1 p
i
)
n
To nd the second order inclusion probability, we use the principle of inclusion/exclusion.
That is, for two events A and B,
Pr(A and B) = Pr(A) + Pr(B) Pr(A or B)
Pr(A B) = Pr(A) + Pr(B) Pr(A B)
If A is the event that unit i is in the sample and B is the event that unit j is in the sample,
then

ij
= Pr(both unit i and unit j will be included in the sample) (25)
= Pr(unit i will be included in the sample)
+ Pr(unit j will be included in the sample)
Pr(either unit i or unit j will be included in the sample)
=
i
+
j
[1 Pr(neither unit i nor unit jwill be included in the sample)]
=
i
+
j
[1 (1 p
i
p
j
)
n
]
The values for
i
and
ij
can be substituted into equations (19) to (23) to generate esti-
mates of and , to estimate variances, and to generate condence intervals.
34
Example: Hansen-Hurwitz Estimation with selection probabilities p
i
that are propor-
tional to size and p
i
is approximately inversely proportional to y
i
The total abundance = 16. There are N = 5 sampling units. The gure shows the units,
labeled 1 to 5, and the ve y
i
values. We will now select all samples of size n = 2.
1 2 3 4 5
y
i
0 2 7 4 3
p
i
.4 .3 .1 .1 .1
Sample Units y-values P[S = s]
HH

Var(
HH
)
1 1,2 0,2 0.24 3.3333 11.1111
2 1,3 0,7 0.08 35.0000 1225.0000
3 1,4 0,4 0.08 20.0000 400.0000
4 1,5 0,3 0.08 15.0000 225.0000
5 2,3 2,7 0.06 38.3333 1002.7778
6 2,4 2,4 0.06 23.3333 277.7778
7 2,5 2,3 0.06 18.3333 136.1111
8 3,4 7,4 0.02 55.0000 225.0000
9 3,5 7,3 0.02 50.0000 400.0000
10 4,5 4,3 0.02 35.0000 25.0000
11 1,1 0,0 0.16 0.0000 0.0000
12 2,2 2,2 0.09 6.6667 0.0000
13 3,3 7,7 0.01 70.0000 0.0000
14 4,4 4,4 0.01 40.0000 0.0000
15 5,5 3,3 0.01 30.0000 0.0000
This is a very poor sampling plan. This will occur when the largest y
i
values in the population
are associated with the smallest p
i
values.
Sampling with replacement is less precise than sampling without replacement (i.e., the
variance of the estimator will be larger). However, when the sampling fraction f = n/M
is small, the probability that any unit will appear twice in the sample is also small. In
this case, sampling with replacement, for all practical purposes, is equivalent to sampling
without replacement.
Thus, the loss of some precision using sampling with replacement can oset the complex-
ity of having to determine the second order inclusion probability when sampling is done
without replacement.
35

You might also like