Module 0 - Data Collection

Module 0 Gathering Data (Assigned readings)
In order for researchers to gather data to determine a population parameter (numerical summary of a
population), they need to do a census.
What is a Census?
- Census is a special sample that includes everyone and “samples” the entire population.
- There are problems with taking a census:
o Too expensive
o Undercoverage (may not actually include everyone)
o Too time-consuming
Because of these problems, researchers prefer to collect data from a sample instead, and summaries
that are found from data in a sample are called sample statistics. There are two types of conclusions
(or inferences) that a researcher can make with sample statistics.
Two types of Inferences:
a) Population Inference
- Results from the sample can be generalized to an entire population (as estimates)
b) Causal (cause-and-effect) Inference
- The difference in the responses is caused by the difference in treatments when comparing
the results from two treatment groups.
1 of 14
Population Inference
We should only make population inferences when we have random sampling (in other words,
randomly select individuals in samples from the population).
- Randomizing helps to eliminate the effect of unknown extraneous factors, even ones that we
may not have thought about.
o Randomizing makes sure that, on the average, the sample looks like the rest of the
population.
- Non-random sampling leads to biased results (results that tend to over- or under- emphasize
some characteristics of the population)
o There is usually no way to fix a biased sample and no way to salvage useful information
from it.
o The best way to avoid bias is to select individuals for the sample at random.
- When we do not have random sampling from a population, conclusions should be restricted to
the sample. That is, we should not generalize our results in the sample to anyone else.
Random Sampling Methods
1) Simple Random Samples (SRS)

- SRS of size n: each sample of size n in the population has the same chance of being selected.
o Ex: put all the names of the individuals in the population in a box and draw names to
complete the sample
- Samples drawn at random generally differ from one another.
o Each draw of random numbers selects different people for our sample.
o These differences lead to different values for the variables we measure.
o We call these sample-to-sample differences sampling variability.
2 of 14
Example:
Suppose a local school district decides to randomly test high school students for their overall
school experience. There are three high schools in the district, each with grades 9-12. The
school board pools all of the students together and randomly samples 250 students. Is this a
simple random sample?
A. Yes, because each student is equally likely to be chosen.
B. Yes, because they could have chosen any sample of 250 students from throughout
the district.
C. No, because we can’t guarantee that there are students from each school in the
sample.
D. No, because we can’t guarantee that there are students from each grade in the
sample.
E. There is not enough info to know whether it is a simple random sample.
Example: A random sample of 2000 Canadians were asked to name their favorite fast-food
restaurant in 2019. ABC Restaurant had the highest percentage with 10% of Canadians
ranking it as their favorite restaurant. Which is TRUE?
I. The population of interest is all Canadians.
II. 10% is a statistic and not the actual percentage of all Canadians who would rank
this restaurant as their favorite.
III. This sampling design should provide a reasonably accurate estimate of the
actual percentage of all Canadians who would rank this restaurant as their
favorite.
A. I only
B. II only
C. III only
D. I II, and III
3 of 14
2) Stratified Random Sampling

- the population is first divided into different homogeneous groups, called strata; then take an
SRS within each stratum before the results are combined.
- Stratified random sampling can reduce bias.
- Stratifying can also reduce the variability of our results.
- Ex: Suppose you want to estimate the proportion of Canadians that support federal party X
based on an appropriate representation from each province.
o You could break up the population by province (strata) and select an SRS from each
province.
Example: Suppose a store owner decides to randomly test the hygiene of his staff. There are
10 departments in his store, and each department has 20 staff members. He plans to test 40
of his staff members by randomly choosing 4 staff members from each department. Is this a
simple random sample?
A. Yes, because the staff members were chosen at random.
B. Yes, because each staff member is equally likely to be chosen.
C. Yes, because stratified samples are a type of simple random sample.
D. No, because not all possible groups of 40 staff members could have been the sample.
This is a stratified sample.
E. No, because a random sample of departments was not first chosen.
3) Systematic Random Sampling

- Start from a randomly selected individual, then sample every kth person.
- When there is no reason to believe that the order of the list could be associated in any way
with the responses sought, systematic sampling can give a representative sample.
- Systematic sampling can be much less expensive than true random sampling.
- Ex: Suppose you want to estimate the proportion of individuals that support federal party X in
your area. You can set up a booth in your area and ask every 50th person your question.
4 of 14
4) Cluster Random Sampling

- Splitting the population into similar groups (or clusters), select one or a few clusters at random
and perform a census within each of them.
o This sampling design is called cluster sampling.
o If each cluster fairly represents the full population, cluster sampling will give us an
unbiased sample.
- Cluster sampling is not the same as stratified sampling. Consider how.
Example: Suppose a store owner decides to randomly test the hygiene of his staff. There are
10 departments in his store, and each department has 20 staff members. He plans to test 40
of his staff members by randomly choosing two departments and check everyone in these two
departments. Is this cluster sampling?
A. Yes, because the staff members were chosen at random.
B. No, because each staff member is equally likely to be chosen.
C. Yes, because cluster samples are a type of simple random sample.
D. No, because not all possible groups of 40 staff members could have been the sample.
E. Yes, because each department is a cluster, and everyone within the two randomly
chosen departments is the sample.
Example: A statistics teacher wants to know how her students feel about an introductory statistics
course. She decides to administer a survey to a random sample of students taking the course. She has
several sampling plans to choose from. Name the sampling strategy in each.
a. There are four levels of students taking the class: 1st year, 2nd year, 3rd year, and 4th year.
Randomly select 15 students from each level.
Stratified Random Sample
b. Divide the class into 4 similar groups, and randomly select one of the groups and survey every
student in that group.
Cluster Random Sample
c. Each student has a seven-digit student number. Randomly choose 60 numbers.
5 of 14
Simple Random Sample

d. Using the class roster, select every fifth student from the list.
Systematic Random Sample
Example: You want to determine the proportion of university students that have “jobs” while
attending school. What kind of sample can you get if:
1) you have a complete list of students?
SRS (Simple random sample)
2) you have a complete list of students and want to make sure each faculty is appropriately
represented.
Stratified Random Sample. Select an SRS from each faculty.
3) you do not have a complete list of students but believe that, at any given time, the group of
students in each classroom across campus all individually form a representative sample of the
entire student body.
Cluster Sample. Select an SRS of classrooms, then go to each selected classroom and
sample everyone.
4) you do not have a complete list of students. You believe that students walking in front of the
Registrar’s building throughout the day are a good representation of the entire student body.
Systematic Random Sample. Because of the massive number of students walking past your
booth, you can’t sample everyone, but you sample every 20th person that walks by.
6 of 14
Recall: Bias is the tendency for a sample to differ from the corresponding population in some
systematic way.
Sources of Bias:
1) Selection Bias (Undercoverage): when some portion of the population is not sampled at all or
has a smaller representation in the sample than it has in the population.
- Usually the people that are not covered differ from the rest of the population, so bias exists.
o Ex: a sample survey of households will miss persons with no fixed address and prison
inmates.
o Ex: an opinion poll conducted by telephone will miss the households without
residential phones.
2) Response Bias: refers to anything in the survey design that influences the responses.
- respondents may lie, especially if asked about illegal or unpopular behavior.
o Ex: Have you lied to your friends last week?
3) Voluntary Response Bias: occurs when individuals can choose on their own whether to
participate in the sample.
o Ex: an internet poll asking people how they feel about the healthcare system? People
can choose whether they want to participate.
4) Nonresponse Bias: occurs when a large proportion of those sampled fail to respond.
o Ex: a telephone survey is conducted to observe the eating habits of office workers,
those who are selected but are randomly away on vacation can’t respond to the
survey.
o Ex: a large number of magazine subscribers did not respond to a survey made by this
magazine.
7 of 14
Example:
Name and describe the kind of bias that might be present if a statistics teacher decides that, instead
of randomly selecting students to survey on how they feel about the course, she just asks students to
volunteer for the survey.
Volunteer response bias—the bias would probably be towards those students who say they
enjoy the course.
Example:
A chemistry professor who teaches a large lecture class surveys the students who attend his class on
how he can make the class more interesting to get more students to attend. This survey method
suffers from what?
A. nonresponse bias
B. response bias
C. undercoverage
D. none of the above
Example:
A question posted on a Canadian website asked visitors to the site whether they think that a
particular Canadian Government bill should pass.
Population – all Canadian adults
Parameter – proportion that feels the bill should pass
Sample – those visiting the web site who responded
Method – voluntary response (no randomization employed)
Bias – voluntary response bias; those who visit the website and respond may be predisposed
to a particular answer.
8 of 14
Example:
In order to determine the proportion of the voting population that supports a new government policy,
a local news organization carried out a survey. The results showed that only 32% of the people
answering the survey support the policy.
Consider where they obtained their evidence (data) and answer these questions:
Can these results be generalized to the entire voting population? Comment on any problems with the
ways in which the data were collected.
i. An online survey was conducted. They asked individuals to log onto their website and offer their
opinion.
No random selection; there is voluntary response bias - generalizing to any population not
possible.
Selection/Undercoverage bias: People without a computer (or tv) can’t respond!
ii. They randomly selected individuals at the local mall and asked for their opinion.
Even though a random sample was conducted, it was not a random sample from the
population of interest.
We could perhaps generalize the result to the population of shoppers at the local mall. We
should also take into account the time the survey was conducted.
Selection/Undercoverage Bias: You have to be at the mall to respond.
Voluntary Response Bias: People refuse to respond.
iii. They randomly selected phone numbers and called people.
So far, the best method, but still…
Selection/Undercoverage Bias: People without a phone cannot be selected.
Voluntary Response Bias: People hang up!
9 of 14
Causal (cause-and-effect) Inference

- We should only make cause-and-effect (causal) inferences when we have random allocation.
o When there is no random allocation, the difference in responses could have been
caused by lurking variables
o Lurking variables are variables that are related to both group membership and to the
response. These are other variables that could possibly explain the result.
Example:
After many dogs and cats suffered health problems caused by contaminated foods, a researcher is
trying to find out whether a newly formulated pet food is safe. Our experiment will feed some dogs
the new food and some cats a food known to be safe, and a veterinarian will check the response.
Why would it be a bad design to feed the test food to some dogs and the safe food to some cats?
There are lurking variables not accounted for. We would not be able to tell whether any
differences in animals’ health were attributable to the food they had eaten or to the
differences in how the two species responded.
Example:
For children between the ages of 6 and 10, the larger their shoe size, the better they do on a
particular math test. Does this mean that larger feet cause students to do better on the test?
Probably not. Age is a reasonable confounding variable that might be a more reasonable
explanatory variable for the test performance. Age might also be used to explain shoe size.
Study Designs
There are two main types of study designs:
1) Observational Studies
2) Randomized Experiment
10 of 14
Observational Study
- the investigator observes individuals and measures variables of interest but does NOT attempt
to influence the responses.
- Example: Study impact of distance of home to power plant on cancer incidence.
o Since the researchers did not assign households to live at a certain home (ie. They
didn’t influence the distance) and simply observed the households randomly, it was an
observational study.
o Because researchers in this example first identified the distance of the subjects’ home
to the power plant and then collected data on whether they had cancer, this was a
retrospective study.
▪ Historical data is useful when an outcome is rare
▪ Historical data, however, may contain many types of observation error.
- Had the researchers identified subjects in advance and collected data as events unfolded, the
study would have been a prospective study.
o Observe explanatory and response variables
o More costly than a retrospective study, but design could better avoid the many types
of observation error.
- Observational studies are valuable for discovering trends and possible relationships.
- However, it is not possible for observational studies to demonstrate a causal relationship.
Example: There are more cats being diagnosed with organ enlargement than in the past. A
researcher identified 50 kittens not already diagnosed with organ enlargement and followed the cats
for several years to see if any developed organ enlargement. This is a(n)
A. Randomized experiment
B. Survey
C. Prospective study
D. Retrospective study
11 of 14
Randomized, Comparative Experiments

- An experiment is a study design that allows us to prove a cause-and-effect relationship.
- An experiment:
a. Manipulates factor levels to create treatments.
b. Randomly assigns subjects to these treatment levels.
c. Compares the responses of the subject groups across treatment levels.
- Example: Study the impact of exercise on blood pressure.
We can enroll and assign exercise schedules to individuals participating in the study.
Summary:
1. Random Selection of Individuals (Random Sampling):
- the individuals in the sample are selected randomly from the population
=> population inferences are allowed
2. Random Allocation (Random Assignment):
- the individuals are randomly assigned to different treatment groups
=> causal inferences are allowed
NOTE: We cannot make causal inferences from observational studies.
Example:
According to a recent national article, the female average wage grew faster than the male average
wage at ABC Bank. These results are based on a random sample of females and males working at ABC
Bank across Canada in 1971.
i. Can we conclude that, in general, the female average wage grew faster than males’ at any bank?
No. We do not have a random sample from ALL banks.
ii. Can we conclude that, in general, the average female wage grew faster than males’ at any branch
of this bank (across Canada)?
Yes. We have a random sample from THIS bank.
12 of 14
iii. Is there evidence of sexual discrimination in this bank? In other words, do these results imply that
average female wage grew faster than males’ because they are female?
Because this was an observational study, NO causal inference should be made.
Example:
We propose 2 designs to test the effectiveness of a new medication in relieving migraines:
Design 1:
In order to test the effectiveness of a new medication, a random sample of individuals were
chosen from a particular population. The drug was administered to the individuals the instant
they began to experience pain. After a fixed period of time elapsed, they were asked to rate
the effectiveness of the drug (in terms of pain reduction) on a 10-point scale, 10 meaning no
pain to 1 meaning the drug was ineffective. The results showed the average rating was 6.6.
The study was repeated.
Design 2:
This time, each individual was randomly assigned to one of two treatment groups.
One group took the new drug and the other took a placebo. The results showed that the
average rating for the drug group was 8.6 and the average rating for the placebo group was
6.7.
i. What do these results imply about the whole population? Can we generalize these results to
everyone in the population?
Yes, because we have random selection, we can generalize these results to the population
of interest.
ii. Can we conclude that the drug is effective in relieving migraines? Did the drug cause a decrease in
pain?
Not in Design 1: There is no random allocation.
Yes in Design 2: The individuals were randomly assigned to treatment groups.
iii. Consider the variable that is being measured. Does a rating of 8 mean the same thing for all
individuals?
It is hard to tell, but most likely not. As a result, there may be response bias.
13 of 14
14 of 14

Module 0 - Data Collection

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 0 - Data Collection

Uploaded by

Copyright:

Available Formats

Module 0 Gathering Data (Assigned readings)

Two types of Inferences:

Random Sampling Methods

1) Simple Random Samples (SRS)

2) Stratified Random Sampling

3) Systematic Random Sampling

4) Cluster Random Sampling

Simple Random Sample

Causal (cause-and-effect) Inference

Randomized, Comparative Experiments

You might also like