Sampling Lesson 10: Sampling Without Replacement: The Binomial and Hypergeometric Probability Models

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

SAMPLING

LESSON 10: SAMPLING WITHOUT REPLACEMENT


Introduction:
THE BINOMIAL AND HYPERGEOMETRIC PROBABILITY MODELS
In previous lesson, we re-examined coin tossing from a statistical perspective. What happen in every
flip of the coin is unaffected by earlier and later flips. That is, the probability p of getting a head
remains the same through all n flips of the coin. In other words, the flips, or more precisely, their
outcomes, are independent. As a sampling exercise or procedure, we call this sampling with equal
probability with replacement or more commonly referred to as simple random sampling 1 (with
replacement).
Also, in the previous lesson you learned that the probabilities of observing x heads in n independent
tosses of a coin is given by
P(X = x) = n𝑪𝒙 𝒑𝒙 (𝟏 − 𝒑)𝒏−𝒙 for x =0, 1, 2, …, n
where the chance p of getting heads is a fixed constant. Remember that it is called the binomial
probability mass function (pmf). The binomial pmf is the result of sampling for a proportion when
the sampling is equal probability with replacement.
Consider now a box that contains N marbles, M of which are white and the rest of other colors, so that
p =M/N is the proportion of white marbles.

Mix the marbles well. Then, draw n < N successively, without putting back those previously drawn.
In this case, the outcomes of the draws are no longer independent. Hence the variable X = number of
white marbles in the sample will not behave according to the binomial pmf 2. From basic counting
techniques (through a branch of mathematics called combinatorics), the number of ways n can be
𝑁!
drawn from N marbles is NCn = (𝑁−𝑛)!𝑛!; the number of ways x white marbles can be drawn from M
and n-x from N-M non-white marbles is MCx, (N-M)C(n-x), respectively. Therefore, the probability of
having x marbles in the sample is
BRIEF EXPLANATION
1Strictly speaking, random does not mean equal probability. However, very early on in the history of
statistics, simple random was used to describe sampling with equal probability, and the usage persisted.
2Xwill follow the binomial pmf if the marbles are drawn with replacement: draw, replace, mix the marbles,
and repeat n times. Then, the outcomes of the draws are independent and you are drawing from the same
population (of marbles) each time.
(𝑴 𝑪𝒙 )(𝑵−𝑴 𝑪𝒏−𝒙 )
𝑷(𝒙 = 𝒙) = for x = 0, 1, 2, …, n
𝑵 𝑪𝒏

This is the hypergeometric pmf.


When N and M, hence P = M?N too, are known, the hypergeometric pmf can be computed exactly,
without uncertainty. This is a mere exercise in probability.

Main Lesson: Sampling Marbles from a Box without Replacement


In practice, the fraction P of marbles that are white is unknown. There are many important real-world
sampling situations when P is not known and the sampling is done without replacement, hence the
hypergeometric pmf is useful. The aim is to infer about P. Think of the proportion of voters for
candidate A. This is now a problem in statistics, no longer an exercise in computing the exact
probabilities.
Since in each draw equal probability is assigned to all remaining marbles, it is intuitively obvious that
the sampling procedure assigns equal probability for every ball to be in the sample. Hence the
procedure is called sampling with equal probability without replacement, or simple random sampling
without replacement.
Furthermore, the equal probability property of the sampling procedure suggests that assigning equal
importance or weight to the outcome of every draw should lead to a reasonable estimate for P.
Picture the outcome of a sample as a sequence of n 1’s and 0’s, where 1 stands for white and 0 not
white; the sum of the sequence is X = number of whites. Giving equal weight, with the sum of the
weights = 1, means 1/n for each. This leads to the estimate p = X/n, which is the same formula in
sampling with equal probability without replacement.

Standard Error of Sample Mean when Sampling without Replacement


How much lower is the SE of the mean without replacement? As was indicated in the previous lesson,
under conditions of sampling with replacement, the standard error (SE) of the sampling distribution
of the mean is proportional to the population standard deviation s and inversely proportional to the
square root of the sample size n:
𝜎
𝑆𝐸 =
√𝑛
On the other hand, when sampling in conducted without replacement, the SE is the

𝜎 𝑛 𝑁 𝜎 𝑁−𝑛
𝑆𝐸 = √[1 − ( )] [ ]= √( )
√𝑛 𝑁 𝑁−1 √𝑛 𝑁−1

𝑁−𝑛 𝑛
The term √( ) is called the finite population correction (fpc) and the ratio ( ) is the
𝑁−1 𝑁
sampling rate, where n and N are the sample size and population size, respectively. When
the sampling rate is small enough, the two SEs for the mean (where sampling is conducted
with and without replacement) can be assumed to be virtually the same.
𝑛
Bu how small is “small enough”? it is depend on the situation, although ( ) < .05 is a workable
𝑁
rule of thumb in many real situations. In majority of actual sampling applications, N is very large so
the fpc is replaced by 1.

EXAMPLE: (Heights, Weights, and BMIs of Female Students Revisited):


In the previous lesson, the heights, weights, and BMIs of 15 students formed the population to be
studied, with the (population) average height, (population) average weight, and (population) average
BMI the parameters of interest.
The entire sampling distributions for the average height, average weight, and average BMI for a
sample of size n = 2 students were extensively tabulated and illustrated. In addition, the behaviour
of the sampling distribution was also simulated with 10,000 experiments (on a software) for sample
sizes n=3, n=5, n=9, and n=12. It was observed that the expected value of the sampling distributions
always hit the mark (i.e. the relevant population averages), but the SE got smaller with increasing
sample size. In addition, the sampling distribution could be approximated rather well with a normal
curve when the sample size was increased.
Here, you revisited the results, but this time under the assumption of sampling without replacement.
If a random sample of size n=2 were to be obtained, then there would be (15)(14) = 210 possible
equally likely samples to be selected. The full list of samples are given in Table 1.
Table 1: Distinct sample of size two (without replacement) and average heights, weights,
and BMIs of sample

First Second Average Average Average First Second Average Average Average First Second Average Average Average
Sample Sample Sample
Student Student Height Weight BMI Student Student Height Weight BMI Student Student Height Weight BMI

1 1 2 1.58 45 18.25669 71 6 1 1.633 42.5 15.94628 141 11 1 1.56 43 17.93642

2 1 3 1.58 44.5 18.04028 72 6 2 1.573 47.5 19.33087 142 11 2 1.5 48 21.321

3 1 4 1.645 42.5 15.70052 73 6 3 1.573 47 19.11446 143 11 3 1.5 47.5 21.10459

4 1 5 1.33 50 36.27112 74 6 4 1.638 45 16.7747 144 11 4 1.565 45.5 18.76483

5 1 6 1.633 42.5 15.94628 75 6 5 1.323 52.5 37.3453 145 11 5 1.25 53 39.33543

6 1 7 1.57 39 15.8805 76 6 7 1.563 41.5 16.95468 146 11 6 1.553 45.5 19.0406

7 1 8 1.62 45.5 17.39699 77 6 8 1.613 48 18.47117 147 11 7 1.49 42 18.94481


8 1 9 1.53 41.1 17.90025 78 6 9 1.523 43.6 18.97443 148 11 8 1.54 48.5 20.46131
9 1 10 1.58 47 19.12234 79 6 10 1.573 49.5 20.19652 149 11 9 1.45 44.1 20.96456
10 1 11 1.56 43 17.93642 80 6 11 1.553 45.5 19.0106 150 11 10 1.5 50 22.18666
11 1 12 1.63 47 17.72412 81 6 12 1.623 49.5 18.7983 151 11 12 1.55 50 20.78843
12 1 13 1.57 38 15.43605 82 6 13 1.563 40.5 16.51023 152 11 13 1.49 41 18.50037
13 1 14 1.59 45 17.97746 83 6 14 1.583 47.5 19.05164 153 11 14 1.51 48 21.04177
14 1 15 1.655 51.5 18.73083 84 6 15 1.648 54 19.80501 154 11 15 1.575 54.5 21.79514
15 2 1 1.58 45 18.25669 85 7 1 1.57 39 15.8805 155 12 1 1.63 47 17.72412
16 2 3 1.52 49.5 21.42486 86 7 2 1.51 44 19.26508 156 12 2 1.57 52 21.1087
17 2 4 1.585 47.5 19.0851 87 7 3 1.51 43.5 19.04867 157 12 3 1.57 51.5 20.89229
18 2 5 1.27 55 39.6557 88 7 4 1.575 41.5 16.70891 158 12 4 1.635 49.5 18.55253
19 2 6 1.573 47.5 19.33087 89 7 5 1.26 49 37.27951 159 12 5 1.32 57 39.12313
20 2 7 1.51 44 19.26508 90 7 6 1.563 41.5 16.95468 160 12 6 1.623 49.5 18.7983
21 2 8 1.56 50.5 20.78158 91 7 8 1.55 44.5 18.40539 161 12 7 1.56 46 18.73251
22 2 9 1.47 46.1 21.28483 92 7 9 1.46 40.1 18.90864 162 12 8 1.61 52.5 20.24901
23 2 10 1.52 52 22.50693 93 7 10 1.51 46 20.13074 163 12 9 1.52 48.1 20.75226
24 2 11 1.5 48 21.321 94 7 11 1.49 42 18.94481 164 12 10 1.57 54 21.97436
25 2 12 1.57 52 21.1087 95 7 12 1.56 46 18.73251 165 12 11 1.55 50 20.78843
26 2 13 1.51 43 18.82064 96 7 13 1.5 37 16.44445 166 12 13 1.56 45 18.28807
27 2 14 1.53 50 21.36204 97 7 14 1.52 44 18.98585 167 12 14 1.58 52 20.82947
28 2 15 1.595 56.5 22.11541 98 7 15 1.585 50.5 19.73922 168 12 15 1.645 58.5 21.58284
29 3 1 1.58 44.5 18.04028 99 8 1 1.62 45.5 17.39699 169 13 1 1.57 38 15.43605
30 3 2 1.52 49.5 21.42486 100 8 2 1.56 50.5 20.78158 170 13 2 1.51 43 18.82064
31 3 4 1.585 47 18.86869 101 8 3 1.56 50 20.56517 171 13 3 1.51 42.5 18.60423
32 3 5 1.27 54.5 39.43929 102 8 4 1.625 48 18.22541 172 13 4 1.575 40.5 16.26447
33 3 6 1.573 47 19.11446 103 8 5 1.31 55.5 38.79601 173 13 5 1.26 48 36.83507
34 3 7 1.51 43.5 19.04867 104 8 6 1.613 48 18.47117 174 13 6 1.563 40.5 16.51023
35 3 8 1.56 50 20.56517 105 8 7 1.55 44.5 18.40539 175 13 7 1.5 37 16.44445
36 3 9 1.47 45.6 21.06842 106 8 9 1.51 46.6 20.42514 176 13 8 1.55 43.5 17.96094
37 3 10 1.52 51.5 22.29052 107 8 10 1.56 52.5 21.64723 177 13 9 1.46 39.1 18.4642
38 3 11 1.5 47.5 21.10459 108 8 11 1.54 48.5 20.46131 178 13 10 1.51 45 19.68629
39 3 12 1.57 51.5 20.89229 109 8 12 1.61 52.5 20.24901 179 13 11 1.49 41 18.50037
40 3 13 1.51 42.5 18.60423 110 8 13 1.55 43.5 17.96094 180 13 12 1.56 45 18.28807
41 3 14 1.53 49.5 21.14563 111 8 14 1.57 50.5 20.50235 181 13 14 1.52 43 18.54141
42 3 15 1.595 56 21.899 112 8 15 1.635 57 21.25572 182 13 15 1.585 49.5 19.29478
43 4 1 1.645 42.5 15.70052 113 9 1 1.53 41.1 17.90025 183 14 1 1.59 45 17.97746
44 4 2 1.585 47.5 19.0851 114 9 2 1.47 46.1 21.28483 184 14 2 1.53 50 21.36204
45 4 3 1.585 47 18.86869 115 9 3 1.47 45.6 21.06842 185 14 3 1.53 49.5 21.14563
46 4 5 1.335 52.5 37.09953 116 9 4 1.535 43.6 18.72866 186 14 4 1.595 47.5 18.80587
47 4 6 1.638 45 16.7747 117 9 5 1.22 51.1 39.29926 187 14 5 1.28 55 39.37647
48 4 7 1.575 41.5 16.70891 118 9 6 1.523 43.6 18.97443 188 14 6 1.583 47.5 19.05164
49 4 8 1.625 48 18.22541 119 9 7 1.46 40.1 18.90864 189 14 7 1.52 44 18.98585
50 4 9 1.535 43.6 18.72866 120 9 8 1.51 46.6 20.42514 190 14 8 1.57 50.5 20.50235
51 4 10 1.585 49.5 19.95076 121 9 10 1.47 48.1 22.15049 191 14 9 1.48 46.1 21.0056
52 4 11 1.565 45.5 18.76483 122 9 11 1.45 44.1 20.96456 192 14 10 1.53 52 22.2277
53 4 12 1.635 49.5 18.55253 123 9 12 1.52 48.1 20.75226 193 14 11 1.51 48 21.04177
54 4 13 1.575 40.5 16.26447 124 9 13 1.46 39.1 18.4642 194 14 12 1.58 52 20.82947
55 4 14 1.595 47.5 18.80587 125 9 14 1.48 46.1 21.0056 195 14 13 1.52 43 18.54141
56 4 15 1.66 54 19.55924 126 9 15 1.545 52.6 21.75897 196 14 15 1.605 56.5 21.83618
57 5 1 1.33 50 36.27112 127 10 1 1.58 47 19.12234 197 15 1 1.655 51.5 18.73083
58 5 2 1.27 55 39.6557 128 10 2 1.52 52 22.50693 198 15 2 1.595 56.5 22.11541
59 5 3 1.27 54.5 39.43929 129 10 3 1.52 51.5 22.29052 199 15 3 1.595 565 21.899
60 5 4 1.335 52.5 37.09953 130 10 4 1.585 49.5 19.95076 200 15 4 1.66 54 19.55924
61 5 6 1.323 52.5 37.3453 131 10 5 1.27 57 40.52136 201 15 5 1.345 61.5 40.12984
62 5 7 1.26 49 37.27951 132 10 6 1.573 49.5 20.19652 202 15 6 1.648 54 19.80501
63 5 8 1.31 55.5 38.79601 133 10 7 1.51 46 20.13074 203 15 7 1.585 50.5 19.73922
64 5 9 1.22 51.1 39.29926 134 10 8 1.56 52.5 21.64723 204 15 8 1.635 57 21.25572
65 5 10 1.27 57 40.52136 135 10 9 1.47 48.1 22.15049 205 15 9 1.545 52.6 21.75897
66 5 11 1.25 53 39.33543 136 10 11 1.5 50 22.18666 206 15 10 1.595 58.5 22.98107
67 5 12 1.32 57 39.12313 137 10 12 1.57 54 21.97463 207 15 11 1.575 54.5 21.79514
68 5 13 1.26 48 36.83507 138 10 13 1.51 45 19.68629 208 15 12 1.645 58.5 21.58284
69 5 14 1.28 55 39.37647 139 10 14 1.53 52 22.2277 209 15 13 1.585 49.5 19.29478
70 5 15 1.345 61.5 40.12984 140 10 15 1.595 58.5 22.98107 210 15 14 1.605 56.5 21.83613

Figure 1 illustrates the sampling distributions for the average height, average weight, and average
BMI of sample size n=2.
Figure 2: Sampling distributions of the sample means of size n=2 female students
(selected at random without replacement) for their (a) heights, (b) weights, and (c) BMI levels
Computations can also be readily made for the EVs and the SEs of the sampling distributions for the
average height, average weight, and average BMI when a sample size n=2 is taken (where sampling
is done without replacement). They yield:

Average Height Average Weight Average BMI


EV 1.52 48.21 22.09
SE 0.10 5.04 6.70

 Recall the EVs and the SEs of the sampling distributions for the average height, average
weight, and average BMI of sample size n=2 (when sampling is conducted with replacement).
They were:

Average Height Average Weight Average BMI


EV 1.52 48.21 22.09
SE 0.11 5.23 6.96
As we pointed earlier, the EV for sampling distributions of means, whether with or without
replacement, is the target population parameter, i.e. the population mean. While the SE for sampling
without replacement is less than SE for sampling with replacement. In fact, the SE for means when
samples are done with replacement for a sample of size n is given by:
𝜎
𝑆𝐸 =
√𝑛
EXAMPLE (continued):
Always remember that the sampling distribution tends to get better approximated by a normal curve
with center given by the EV and a standard deviation given by the SE.
Here is a simulation experiment conducted with a statistical software called Stata. Simulation
experiments of 10,000 experiments of random samples without replacement for sample size n=3,
n=5, n=9, and n=12 from the box model representing the sampling distributions of heights, weights,
and BMIs of the N=15 students in Figure 3, Figure 4, and Figure 5, respectively.
Figure 3. Sampling Distribution of the Sample Mean Height (taken from a random sample
without replacement) of size (i) n=3; (ii) n=5; (iii) n=9; (iv) n=14

Figure 4. Sampling Distribution of the Sample Mean Weight (taken from a random sample
without replacement) of size (i) n=3; (ii) n=5; (iii) n=9; (iv) n=14

Figure 5. Sampling Distribution of the Sample Mean BMI (taken from a random sample
without replacement) of size (i) n=3; (ii) n=5; (iii) n=9; (iv) n=14
What justifies the choice sampling “without replacement” over “with replacement”? As was pointed
out, more information is gained by having sampling done without replacement.
EXAMPLE 2:
A janitor has 20 keys, and one of them is the key to a locked office door. Should sample the
keys with or without replacement?
If he randomly tries the keys one by one, but does not eliminate the ones he tries, then he is
sampling with replacement. In this case, the long-run average number of tries to unlock the
door is 20.
If he tries the keys one by one, eliminating the ones that do not work, then he is sampling
without replacement. In this case, the long-run average number of tries to unlock the door is
11.
In this case, sampling without replacement makes sense over sampling with replacement.

KEY POINTS:

 Sampling with replacement results in independent events that are unaffected by previous
outcomes, but in practice, there is more of sampling without replacement since we do want
to have more information. Additional information is gained whenever a new unit is drawn, but
no new information is gained from a unit that had already been drawn previously (which
happens when sampling is done with replacement).
 When selecting a relatively small sample from a large population, obtaining a sample of
independent subjects occurs whether we sample with replacement or without replacement.
 While the standard error (SE) of the sampling distribution of the mean is
𝜎
𝑆𝐸 =
√𝑛
When sampling with replacement, the SE for the mean for sampling without replacement is
less, and given by
𝜎 𝑛 𝑁 𝜎 𝑁−𝑛
𝑆𝐸 = √[1 − ( )] [ ]= √( )
√𝑛 𝑁 𝑁−1 √𝑛 𝑁−1
Where s is the population standard deviation, while n and N are the sample size and
population size, respectively.
𝑁−𝑛 𝑛
o The term √( ) is called the finite population function (fpc) and the ratio ( ) is
𝑁−1 𝑁
the sampling rate.
o When the sampling rate is small enough, the two SEs (for with and without
replacement) can be assumed to be virtually the same. In majority of actual sampling
applications, N is very large so that the fpc is replaced by 1.
 For the special case, when the sample mean is actually a proportion, the EV of the sampling
distribution of sample proportion p is the population proportion P; the standard error (SE) is

𝑃(1 − 𝑃) 𝑁 − 𝑛
√ √( )
𝑛 𝑁−1
Where n and N are the sample size and population size, respectively.
APPLICATION AND ASSESSMENT

Do the following.
1. A city has 300,000 registered voters, with 120,000 of them poor. A survey organization is
about to take a random sample of 1,000 registered voters. Describe the sampling distribution
of the fraction of poor among the 1,000 sampled voters.

2. Consider a school district that has 10,000 11th graders. In this district, the average weight of
an 11th grader is 45 kg, with a standard deviation of 10 kg. Suppose you draw a random
sample of 50 students. What is the probability that the average weight of a sampled student
will be less than 42.5 kg?

3. A survey organization wants to take a simple random sample without replacement to


estimate the proportion of voters who are in favour of voting for candidate A in the next
election. To costs down, they want to take a sample as a small possible, but their client would
like to only tolerate chance errors of 1 percentage point or so in the estimate. Should they use
a sample of size 100, 2,500, or 10,000? Note that past experience suggests that the population
percentage is in the range 20% to 40%.

4. A simple random sample of 400 persons 15 years old and above is taken in Naga City. The
total years of schooling of all the sampled persons is 3230, so that the average educational
3230
attainment is ≈ 8.1 years. The standard deviation of the sample data is 4.1 years. Describe
400
the sampling distribution.

You might also like