Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

Item Navigation

Preventing Bad/Biased Samples


Many so-called “standard” statistical analyses that are presented and discussed in introductory statistics courses
make the assumption that the data of interest are independent and identically distributed (or “i.i.d.”)
observations. As discussed in the lectures earlier this week, simple random sampling (SRS) is the closest probability
sampling analog to i.i.d., in that the sampling mechanism used to generate the observations will produce
independent and identically distributed observations. While this type of sampling will produce samples with this nice
“i.i.d.” statistical property, facilitating “standard” statistical analyses, SRS is seldom used when sampling from real
populations. One of the reasons for this is that SRS, while producing estimates that are unbiased in nature (which
recall means that the estimates based on hypothetical repeated samples using SRS will have a mean equal to the true
population mean), has the potential to generate “bad” samples with substantial sampling error (where an estimate
based on the sample is quite different from the population parameter of interest).

Consider, for example, a national sample of 1,000 cell phone numbers selected using SRS. While in expectation any
one given sample will include a representative random sampling of numbers from area codes across the nation, all
possible random samples using SRS are equally likely. What this means is that a simple random sample of cell
phone numbers that only includes area codes from Florida is just as likely as a simple random sample of numbers that
includes a representative selection across the states. Ideally, we would like to use design strategies to reduce the
chances of such a “bad sample” occurring, especially if our variable of interest tends to take on very different values in
the state of Florida! The major statistical problem with the simple random “Florida” sample is that any estimate that
we compute after collecting data from the sample will likely be very different from the true population parameter that
we are trying to estimate (especially if the variable of interest tends to take on very different values in Florida relative
to the rest of the nation). Because the probability of selecting these extreme samples is equal to the probability of
selecting more representative samples, the sampling distribution for simple random samples can tend to be quite
variable.

A very common sampling technique used to minimize the sampling variance that can arise from these so-called “bad
samples” in SRS is stratification. You’ve already been introduced to stratification in an earlier lecture. When we
conduct stratified sampling, we first allocate portions of our sample to all possible divisions (or “strata”) of the
population of interest (e.g., states). This ensures that some sample will be selected from all of these possible divisions,
and that the overall sample will therefore be representative of the target population. For example, using a technique
known as proportionate allocation, suppose that we knew that 55% of students enrolled in a particular college were
females, and 45% were males. If we wanted to draw a sample of 1,000 students from this college, we would randomly
selected 550 females from a list of all females enrolled, and 450 males from a list of all males enrolled. This ensures
that our entire sample of size 1,000 won’t include only females!

You might also like