Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

Introduction to

Sampling
Situo Liu
Spry, Inc.
10/25/2013
Ways to deal with Big Data
Big Analytics - use distributed database systems
(hadoop) and parallel programming
(MapReduce)
Sampling - use the representative sample
estimate the population
Sampling in Hadoop
Hadoop isnt the king of interactive analysis
Sampling is a good way to grab a set of data then
play with it locally (R or Excel)
Pig has a handy SAMPLE keyword
Elements of a Sample
Sample - a subset of individuals within a statistical population to
estimate characteristics of the whole population.
Target Population - collection of observations we want to study
Sampled Population - all possible observation units that might
have been sampled
Sampling Frame - list of all sampling units (student roster, list of
phone number)
Sampling Unit - unit we actually sample (e.g. household)
Observational Unit - element to be measured (e.g. individual
people in the household)
Sampling Techniques (1)
Probability Sampling
Every unit in the population has a chance (greater than zero) of
being selected in the sample, and this probability can be
accurately determined.
Not every observational unit has to have the same probability of
selection but every observational units probability is known.
Nonprobability Sampling
Some elements of the population have no chance of selection
(these are sometimes referred to as 'out of coverage'), or where
the probability of selection can't be accurately determined.
Because the selection of elements is nonrandom, nonprobability
sampling does not allow the estimation of sampling errors.
Sampling Techniques (2)
Probability Sampling
Simple Random Sampling
Systematic Sampling
Stratified Sampling
Cluster or Multistage Sampling
Probability Proportional to Size Sampling
Panel sampling
Nonprobability Sampling
Accidental sampling / Convenience sampling / Haphazard
Quota sampling
Purposive sampling / Judgmental sampling
Capture-Recapture sampling (determine population size)
Line-intercept sampling
http://upload.wikimedia.org/wikipedia/en/0/09/LiTrSa42008Geelho
ed.svg

Simple Random Sampling - SRS
Definition: for a size n simple random sample, every possible
subset of n units in the population has the same chance of
being in the sample
Requirement: One unique identifier is needed for
implementation
Advantage: easy to understand and implement
Disadvantage: biggest variance, least accuracy

Systematic Sampling
Definition: Systematic sampling relies on arranging the study
population according to some ordering scheme and then
selecting elements at regular intervals through that ordered
list. Systematic sampling involves a random start and then
proceeds with the selection of every kth (k=population
size/sample size) element from then onwards.
Requirement: Ordering scheme for population
Advantage: easy to implement, very efficient
Disadvantage: vulnerable to periodicities
Stratified sampling (1)
Definition: Where the population embraces a number of
distinct categories, the frame can be organized by these
categories into separate "strata." Each stratum is then
sampled as an independent sub-population, out of which
individual elements can be randomly selected.
Requirement: population can be divided into distinct,
independent strata, provided that strata are selected based
upon relevance to the criterion in question
Variability within strata are minimized
Variability between strata are maximized
The variables upon which the population is stratified are
strongly correlated with the desired dependent variable.


Stratified sampling (2)
Advantage:
Inferences can be done about specific subgroup
Very likely more efficient statistical estimates
will never result in less efficiency than SRS, provided that each
stratum is proportional to the group's size in the population.
Data maybe more readily for individual pre-existing strata within
a population than for the overall population
Because strata are independent, different approaches for
subgroups
Disadvantage:
Complexity in implementation and estiamtion
Multiple criteria can be tricky
Specified minimum sample size per group

Cluster Sampling (1)
Definition: where the entire population is divided into groups,
or clusters, and a random sample of these clusters are
selected. All observations in the selected clusters are included
in the sample.
Requirement: does not require complete list of every unit in
the population, only requires sampling frame on cluster-level
Variability within cluster are maximized
Variability between cluster are minimized
The variables upon which the population is divided into
clusters are not strongly correlated with the desired
dependent variable.
Cluster Sampling (2)
Advantages:
Easy to implement
Cost-effective
Disadvantages:
Complexity in estimation
May not reflect the diversity of clusters
Provide less information per observation than SRS
Redundant information from the others in the cluster
Standard errors may be higher than other sampling designs

Probability Proportional to size
sampling - PPS
Definition: Where the selection probability for each element is
set to be proportional to its size measure.
Every technique before was equal probability of selection (EPS)
Requirement: auxiliary variable / size measure, correlated to
the variable of interest
Advantage:
May improve accuracy for a given sample size by concentrating
sample on large elements that have the greatest impact on
estimation
For business and auditing, monetary unit sampling (MUS)
Disadvantage:
Complexity for implementation and estimation
Different portions of the population may be over or under
represented due to the probability variation in selection

Representativeness of the sample
Match between target population and
sampled population
Method of drawing sample

Two kinds of Errors
Non-sampling error - can be reduced by careful design of the survey
Selection bias - part of target population is not in sampled population
(target population may not have a natural frame, the mode of data
collection may restrict frame)
Coverage Error - the extent to which the Sampling Frame does not cover
the Target population
Measurement bias - measuring instrument has tendency to differ
from true value in one direction
Measurement error (Errors of Observation)
Deviations of measurement
Inaccurate measurement
Item nonresponse (didnt understand, didnt see, or refused question)
Unit nonresponse (not home, not approached by interviewer, refuse call)
Sampling error - results from taking a sample instead of whole
population, can be quantified by statistics, reduced by increasing
sample size

Sample Size Calculation
In order to know what our sample size needs to be, we must
decide in advance the maximum estimation error we are
willing to tolerate.
Determine the nature of estimation proportion or mean
The confidence level of your estimation significant level

Proportion (1)
Proportion: p^ = X/n
where X is the number of 'positive' observations, n is sample size
When the observations are independent, the estimator has a
binomial distribution, variance = np(1-p)
The maximum variance of this distribution is 0.25*n, when
p=0.5
For sufficiently large n, the distribution of p^ will be closely
approximated by a normal distribution. around 95% of this
distribution's probability lies within 2 standard deviations of
the mean.

will form a 95% confidence interval for the true proportion.
Proportion (2)
If this interval needs to be no more than W units wide, the
equation

can be solved for n, yielding n = 4/W
2
= 1/B
2
where B is the
error bound on the estimate
i.e., the estimate is usually given as within B. So,
for B = 10% one requires n = 100,
for B = 5% one needs n = 400,
for B = 3% the requirement approximates to n = 1000,
while for B = 1% a sample size of n = 10000 is required.
Mean (1)
A proportion is a special case of a mean. When estimating the
population mean using an independent and identically
distributed (iid) sample of size n, where each data value has
variance 2, the standard error of the sample mean is:

This expression describes quantitatively how the estimate
becomes more precise as the sample size increases. Using the
central limit theorem to justify approximating the sample
mean with a normal distribution yields an approximate 95%
confidence interval of the form
Mean (2)
If we wish to have a confidence interval that is W units in
width, we would solve
for n, yielding the sample size n = 16
2
/W
2
.
i.e., if we are interested in estimating the amount by which a
drug lowers a subject's blood pressure with a confidence
interval that is 6 units wide, and we know that the standard
deviation of blood pressure in the population is 15, then the
required sample size is 100

Stratified Sample Size (1)
The sample can often be split up into sub-samples. Typically, if
there are k such sub-samples (from k different strata) then
each of them will have a sample size ni, i = 1, 2, ..., k. These ni
must conform to the rule that n1 + n2 + ... + nk = n (i.e. that
the total sample size is given by the sum of the sub-sample
sizes). Selecting these ni optimally can be done in various
ways, using (for example) Neyman's optimal allocation.
There are many reasons to use stratified sampling:[7] to
decrease variances of sample estimates, to use partly non-
random methods, or to study strata individually. A useful,
partly non-random method would be to sample individuals
where easily accessible, but, where not, sample clusters to
save travel costs.
Stratified Sample Size (2)
In general, for H strata, a weighted sample mean is



Thank You
sliu@spryinc.com
www.spryinc.com

You might also like