Learning objectives

•Why sampling is required?

•Sampling criteria
•Various terms related to sampling
•Sampling size determination
•Methods of sampling
Why sampling is required?
• Even though in many investigations, study of the
entire population is possible they lead to;
• High cost
• Time consuming
• Lack of accuracy

To overcome all these factors sampling is

recommended. Further, certain situations would
not permit us to study the whole population.
Sampling criteria

• The characteristics essential for

inclusion in the target population
• Between the ages of 18 and 45
• Ability to speak English
• Admitted for gall bladder surgery
• Diagnosed with diabetes
within last month
Various terms related to sampling
• Population: The totality or aggregate of all the
individuals with the specified characteristics is a
• Three types of population
– Finite population is one in which there are a finite
number of members (eg. The no. of students in a
– Infinite population contains an infinite number of
members (eg. All possible hb values in a given
– Hypothetical population is one which is assumed for
theoretical purpose (eg. No. of rates)
Sampling frame

• A listing of every member of the

population, using the sampling
criteria to define membership in
the population.
• Subjects are selected from the
sampling frame.
Sampling unit

• An individual unit of a
• Person (subject)
• House
• Street
• Event
• Behavior
• Sample: Any group of individuals taken from a
population. The sample is supposed to be
representative (unbiased, unselected, random).
• Sample size: The no. of individuals or
observations in the sample is denoted as sample
• Sampling variation: The variation between one
sample and another sample is known as
sampling variation.
• Sampling error: Difference between the
population mean & the sample mean.
Sample size determination
• Statistical methods can be used to determine
sample sizes for RCTs and single parameter
estimation (e.g. prevalence)
• Why does it matter?
– Too few subjects  realistic improvements unlikely to
be distinguished from chance variation.
– Too many subjects  unnecessary number of
subjects receiving inferior treatment
• Efficient use of resources
Prior information

• Some prior information is necessary for a

sample size calculation to be possible.
– Clinically important difference, or an expected
difference between groups
– Estimate of variability (continuous response)
or control group ‘success’ proportion (binary
Estimating a single population
mean to a required precision

Standard error of mean = SD/sqrt(n)

95% confidence interval = Mean ± 1.96SE
If we know a sensible value for  (the standard
deviation) and the desired confidence interval
width, we can obtain n, the number of observations
95% confidence interval width = 1.96xSD/sqrt (n)
Eg. In a preliminary study it was observed that the SD of fasting
sugar level is 48. With the desired confidence interval width: 
20, how many individuals are required to estimate the popn.
Mean to within 20.

CI width = 20=1.96x48/sqrt(n) and solving for n gives

n = (1.96x48/20)^2 =22
i.e. a sample of 22 would enable us to estimate the population
mean to within 20 (with 95% probability).
Estimating a proportion to a
required precision
• Useful when estimating a prevalence
• Problem: the standard error of a proportion
depends on the proportion itself, the
quantity we are trying to estimate!
– We need an initial estimate

95% confidence interval = p±1.96sqrt(p(1-p)/n)
Example: prevalence of a disease

• Suppose we are trying to estimate the

prevalence of a HIV infection, which we
suspect to be about 3%, and we want the
95% confidence interval to be 0.5% on
either side.
CI width = .005=1.96sqrt(.03(1-.03)/n)
n = 1.96^2x0.03x0.97/0.005^2= 4472
Differences between independent
• Sample size calculations are usually used
to help design comparative studies, in
which two groups are compared for some
primary outcome.
– Continuous data – difference between means
– Categorical data – difference between
Formula: difference between
• The most commonly used value for significance () is 0.05, giving
z1-/2 = 1.96
• The most commonly used value for power (1-) is 80%, giving z1- =
• n= (2x(1.96+0.84)^2xσ^2/d^2)
• n= 2x7.849x σ^2/d^2
• n=15.7X σ^2/d^2
• For 90% power
• n= 21x σ^2/d^2
A sample size of n in each group will have 80% or 90% power to detect
a difference in means of ‘d’, assuming that the common standard
deviation is , and that the test will be performed at the 5%
significance level (two-sided).”
• In a trial to compare the effects of two oral
contraceptives on blood pressure (over one
year), it is anticipated that one drug will increase
diastolic blood pressure by 3mmHg, and the
other will not change it. The SD (of the changes
in blood pressure) in both groups is expected to
be 10mmHg. How many patients are required for
this difference to be significant at the 5% level
(with 80% power)?
• n= 2x7.849x100/9 = 175 women/group
Formula for a difference in
For 5% significance and 80% power, this reduces

• p1 = expected proportion in the control group

• p2 = expected proportion in the intervention
• (= p1 + d)
• In a randomised clinical trial, the placebo
response is anticipated to be 25%, and the
active treatment response 65%. How many
patients are needed if a two-sided test at the 5%
level is planned, and a power of 80% is
Practical issues
• Allow for drop-outs and non-consent when planning
sample size, particularly when subjects are being
followed up for a long period of time.
• Sample size calculations can only be as accurate as the
assumptions that go into them. A pilot study may be
necessary to obtain suitable estimates.
• When a sample is greater than 5% of the population
from which it is being selected and the sample is chosen
without replacement, to get more precise estimate the
finite population correction factor should be used.
• Accordingly the revised sample size (n1)= n/(1+n/N),
where “N” is the population size.
Other methods
• Other sample sizing methods are available
– equivalence studies, where the aim is to show
that two groups do not differ by more than a
specified amount, d.
– matched case-control studies with a binary
outcome (exposed or unexposed), which
requires specification of the anticipated odds
ratio and the proportion of pairs with differing
Sampling Methods
Probability Sampling Non-Probability Sampling
• Simple random sampling • Deliberate (quota)
• Stratified random sampling
sampling • Convenience sampling
• Systematic sampling • Purposive sampling
• Cluster (area) sampling • Snowball sampling
• Multistage sampling
Simple Random Sampling
• Equal probability
• Techniques
– Lottery
– Table of random numbers
• Advantage
– Most representative group
• Disadvantage
– Difficult to identify every member of a
Stratified Random Sampling
• Technique
– Divide population into various strata
– Randomly sample within each strata
– Sample from each strata should be proportional
• Advantage
– Better in achieving representativeness on control
• Disadvantage
– Difficult to pick appropriate strata
– Difficult to ID every member in population
Systematic Sampling
– Given a sampling frame, decide on the sample size
• This sets the sampling ratio and k, the sampling interval

For example:
• If you have 100 in a population & want a sample of 10
• …the ratio is 1/10 and k=10
• Randomize the order of cases in the sampling frame
• Use a random number table to select the first case
• Sample every kth case thereafter
• Advantage
– Quick, efficient, saves time and energy
• Disadvantage
– Not entirely bias free; each item does not have equal chance to
be selected
– System for selecting subjects may introduce systematic error
– Cannot generalize beyond pop actually sampled
Cluster (Area) Sampling
• Randomly select groups (cluster) – all members of groups are
• Appropriate when
– you can’t obtain a list of the members of the population
– have little knowledge of pop characteristics
– Pop is scattered over large geographic area.
• Advantage
– More practical, less costly
• Conclusions should be stated in terms of cluster (sample unit –
• Sample size is # of clusters
Multistage Sampling
• Stage 1
– randomly sample clusters (schools)
• Stage 2
– randomly sample individuals from the schools
Deliberate (Quota) Sampling
• Similar to stratified random sampling
• Technique
– Quotas set using some characteristic of the
population thought to be relevant
– Subjects selected non-randomly to meet quotas (usu.
convenience sampling)
• Disadvantage
– selection bias
– Cannot set quotas for all characteristics important to
Other sampling methods
• Convenience Sampling
Intact classes, volunteers, survey respondents
(low return), a typical group, a typical person
Disadvantage: Selection bias
• Purposive Sampling
Establish criteria necessary for being included in
study and find sample to meet criteria.
• Snowball sampling
• The Three Universal Assumptions of Analysis of Variance
• 1. Independence
• 2. Normality
• 3. Homogeneity of Variance
• Overview of the concepts
• Model I (Assessing treatment effects)
• Comparison of mean values of several groups.
• Why ANOVA?
• Model I (Assessing treatment effects)
• ANOVA is an extension of the commonly used t-test for
comparing the means of two groups.
• The aim is a comparison of mean values of several groups.
• The tool is an assessment of variances.
• Model I: t-test versus ANOVA
• Why not multiple t-tests?
• With several groups, many t-tests are necessary for pair-wise
comparisons, e.g. 6 times for 4 groups.
• Multiple comparisons inflate the t-value, i.e. too often one will
get a “significant” result, i.e. a P-value below 5%.
• Thus, ANOVA is useful when dealing with several groups.
Model I ANOVA – Short summary
• Plot your data
• Generally, the procedure is robust towards deviations from
normality. However, it is indeed sensitive towards outliers, i.e.
investigate for outliers within groups.
• Testing for variance homogeneity may be carried out by Bartlett
´s test.
• Cochran's test can be used to test for variance outliers.
• Control group (C) versus treatment groups
• Often, focus is on effects in
• treatment groups versus
• the control group.
• Apply Dunnett´s Test based on
• the principle of “least significant
• difference” (LSD), i.e. critical t-values
• for differences between treatment groups
• and the control group are adjusted

