Professional Documents
Culture Documents
Introduction To Biostatistics: Associate Professor Georgi Iskrov, PHD Department of Social Medicine
Introduction To Biostatistics: Associate Professor Georgi Iskrov, PHD Department of Social Medicine
Statistics means
never having to say you are certain!
Population vs Sample
Population
Parameters
Sample / Statistics μ, σ, σ2
x, s, s2
Population vs Sample
• Population includes all objects of interest whereas
sample is only a portion of the population.
– Parameters are associated with populations and statistics with
samples
– Parameters are usually denoted using Greek letters (μ, σ) while
statistics are usually denoted using Roman letters (X, s)
• There are several reasons why we do not work with
populations.
– They are usually large, and it is often impossible to get data for
every object we're studying
– Sampling does not usually occur without cost, and the more items
surveyed, the larger the cost
Descriptive vs Inferential statistics
Sampling
Inferential statistics
Descriptive vs Inferential statistics
• We compute statistics, and use them to estimate
parameters.
• The computation is the first part of the statistical analysis
(Descriptive Statistics) and the estimation is the second
part (Inferential Statistics).
• Descriptive Statistics
The procedure used to organize and summarize masses
of data
• Inferential Statistics
The methods used to find out something about a
population, based on a sample
Probability
• A measure of the likelihood that a particular event
will happen.
• It is expressed by a value between 0 and 1.
0.0 1.0
Cannot happen Sure to happen
• First, note that we talk about the probability of an
event, but what we measure is the rate in a group.
• If we observe that 5 babies in every 1 000 have
congenital heart disease, we say that the probability of a
(single) baby being affected is 5 in 1000 or 0.005.
Probability vs Statistics
Probability
General Specific
=>
Population Sample
=>
Model Data
=>
Statistics
Specific => General
Sample => Population
Data => Model
Sampling
• Individuals in the population vary from one another with
respect to an outcome of interest.
Sampling
• When a sample is drawn there is no certainty that it will
be representative for the population.
Sample A
Sample B
Sampling
• Sampling – a specific principle used to select members
of population to be included in the study.
• Due to the large size of target population, researchers have no choice but to
study a number of cases within the population in order to represent the
population and to reach conclusions about the population.
• Random error can be conceptualized as sampling
variability.
• Bias (systematic error) is a difference between an
observed value and the true value due to all causes
other than sampling variability.
• Accuracy is a general term denoting the absence of
error of all kinds.
Sampling
Sample B
Sample A
Population
Sampling
Sample B
Population Sample A
Sampling
• Stages of sampling:
– Defining target population
– Determining sampling size
– Selecting a sampling method
• Properties of a good sample:
– Random selection
– Representativeness by structure
– Representativeness by number of cases
Sampling
• Non-probability sampling:
– Judgment (purposive) sampling;
– Convenience sampling;
– Snowball sampling;
– Quota (proportional) sampling.
• Probability sampling:
– Simple random sampling;
– Systematic sampling;
– Stratified sampling;
– Cluster sampling.
Advantages of probability
sampling
• Provides a quantitative measure of the extent of variation
due to random effects
• Provides data of known quality
• Better control over non-sampling sources of errors
• Mathematical statistics and probability can be
applied to analyze and interpret the data
Disadvantages of non-
probability sampling
• Purposively selected without any confidence
• Selection bias likely
• Bias unknown
• No mathematical property
• Non-probability sampling should not be undertaken with
science in mind
• Provides false economy
Non-probability sampling
• Judgment: Sample group members are selected on the
basis of judgment of researcher
+ Time efficiency
− Samples are not highly representative
− Unscientific approach
− Personal bias
• Convenience: Obtaining participants conveniently with
no requirements whatsoever
+ High levels of simplicity and ease
+ Usefulness in pilot studies
− Highest level of sampling error
− Selection bias
Non-probability sampling
• Snowball: Sample group members nominate additional
members to participate in the study
+ Possibility to recruit hidden population
− Over-representation of a particular network
− Reluctance of sample group members to nominate
additional members
Probability sampling
• Simple random sampling: Each element has an equal
chance (probability) of being selected from a list of all
population units (sample of n from N population).
+ Highly effective if all subjects participate in data
collection
− High level of sampling error when sample size is small
• Systematic sampling: Every Nth member of population
is included in the study
+ Time efficient
+ Cost efficient
− High sampling bias if periodicity exists
Probability sampling
• In simple random sampling we expect units to be
“equally” distributed.
Probability sampling
• In reality the random selection may be like this:
Simple random vs
systematic sampling
• Systematic sampling has many advantages:
– Provides a better random distribution than simple random
sampling
– Simple to implement
– May be started without a complete listing frame (say, interview of
every 9th patient coming to a clinic).
– With ordered list, the variance may be smaller than in simple
random sampling
• However:
– In systematic sampling, only the first unit is selected at random,
the rest being selected according to a predetermined pattern.
– Systematic sampling is to be applied only if the given population
is logically homogeneous.
– Simple random sampling is free of classification error and
requires minimum advance knowledge of the population
Stratified sampling
• The total population is divided into smaller groups or
strata to complete the sampling process. The strata is
formed based on some common characteristics.
Stratified sampling
• Proportionate allocation uses a sampling fraction in each of the
strata that is proportional to that of the total population. For
instance, if the population consists of X total individuals, m of which
are male and f female (and where m + f = X), then the relative size
of the two samples (x1 = m/X males, x2 = f/X females) should reflect
this proportion.
• Optimum allocation (or disproportionate allocation) uses a
sampling fraction in each of the strata that is proportional to both the
proportion that of the total population (as proportionate allocation)
and to the standard deviation of the distribution of the variable.
Larger samples are taken in the strata with the greatest
variability to generate the least possible overall sampling
variance.
+ Effective representation of all subgroups
+ Precise estimates in cases of homogeneity or heterogeneity within
strata
− Knowledge of strata membership is required
− Complex to apply in practical levels
Cluster sampling
• Cluster sampling is a sampling plan used when mutually
homogeneous yet internally heterogeneous groupings are evident in
a population. The total population is divided into these groups and a
simple random sample of the groups is selected. The elements in
each cluster are then sampled.
Cluster sampling
• If all elements in each sampled cluster are sampled, then
this is referred to as a "one-stage" cluster sampling
plan.
• If a simple random sub-sample of elements is selected
within each of these groups, this is referred to as a "two-
stage" cluster sampling plan.
+ Time and cost efficient
− Group-level information needs to be known
− Usually higher sampling errors compared to alternative
sampling methods
Stratified vs cluster sampling
• The main difference between cluster sampling and
stratified sampling is that in cluster sampling the cluster
is treated as the sampling unit, so sampling is done on
a population of clusters (at least in the first stage). In
stratified sampling, the sampling is done on elements
within each strata.
– In stratified sampling, a random sample is drawn from each of
the strata, whereas in cluster sampling only the selected clusters
are sampled.
• A common motivation of cluster sampling is to reduce
costs by increasing sampling efficiency. This contrasts
with stratified sampling where the motivation is to
increase precision.
Sample size calculation
• Law of Large Numbers: As the number of trials of a
random process increases, the percentage difference
between the expected and actual values goes to zero.
• Application in biostatistics: Bigger sample size, smaller
margin of error.
• A properly designed study will include a justification for
the number of experimental units (people/animals) being
examined.
• Sample size calculations are necessary to design
experiments that are large enough to produce useful
information and small enough to be practical.
Sample size calculation
• Provides validity of the clinical trials/intervention studies
• Assures that the intended study will have a desired
power for correctly detecting a (clinically meaningful)
difference of the study entity under study if such a
difference truly exist
• Two objectives:
– Measure with a precision:
• Precision analysis
– Assure that the difference is correctly detected
• Power analysis
Sample size calculation
• Generally, the sample size for any study depends on:
– Acceptable level of confidence;
– Expected effect size and absolute error of precision;
– Underlying scatter in the population;
– Power of the study.
Z SD 2 2
n 2
d
• Z – confidence level;
• SD – standard deviation;
• d – absolute error of precision (margin of error).
Sample size calculation
• Sources of variance information:
– Published studies (concerns: geographical, contextual, time
issues – external validity)
– Previous studies
– Pilot studies
• Sample size estimation depends on the study design
– as variance of an estimate depends on the study
design.
Sample size calculation
• For quantitative variables:
Z SD 2 2
n 2
d
• A researcher is interested in knowing the average
systolic blood pressure in pediatric age group at 95%
level of confidence and precision of 5 mmHg. Standard
deviation, based on previous studies, is 25 mmHg.
1.96 25
2 2
n
5 2
96.04 => 97
Sample size calculation
• For qualitative variables:
Z p (100 p)
2
n 2
d
• Z – confidence level
• p – expected proportion in population
• d – absolute error of precision (margin of error)
Sample size calculation
• For qualitative variables:
Z p (100 p)
2
n 2
d
• A researcher is interested in knowing the proportion of
diabetes patients having hypertension. According to a
previous study, the actual number is no more than 15%.
The researcher wants to calculate this size with a 5%
absolute precision error and a 95% confidence level.
Frequency
Yes Yes Yes Yes
distribution
Median,
No Yes Yes Yes
percentiles
Mean,
standard No No Yes Yes
deviation
Ratio No No No Yes
Data processing
• Some visual ways to summarize data:
– Tables
– Graphs
• Bar charts
• Histograms
• Box plots
Frequency table
• Elements
– Formal
1. Title
2. Main column
3. Main row
4. Legend
– Logical
Frequency table
Simple table
Table 1. Anti-HBs (+) outcomes per group from a HBV
Title
screening study*
Main row Number of Anti-HBs
Screened group %
(+) cases
Chilldren of 7 y. 3 10%
Chilldren of 11 y. 7 23%
Chilldren of 17 y. 3 10%
Main column
Roma people 1 3%
Contacts in family 3 10%
Health professionals 13 43%
Total 30 100%
Residence
Smolyan Zlatograd Rudozem Subtotal
Risk group
Total: 350