Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 60

Introduction to biostatistics

Associate Professor Georgi Iskrov, PhD


Department of Social Medicine
Outline
• Population vs sample
• Descriptive vs inferential statistics
• Sampling methods
• Sample size calculation
• Level of measurement
• Graphical summaries
Definition of biostatistics
The science of
collecting, organizing, analyzing, interpreting and
presenting data
for the purpose of
more effective decisions in clinical context.

“Turning data into knowledge”


(Patrick Heagerty)
Why do we need to use
statistical methods?
• Why do we need to use statistical methods?
– To make strongest possible conclusion from limited amounts of
data;
– To generalize from a particular set of data to a more general
conclusion.
• What do we need to pay attention to?
– Bias
– Probability

Statistics means
never having to say you are certain!
Population vs Sample

Population
Parameters
Sample / Statistics μ, σ, σ2
x, s, s2
Population vs Sample
• Population includes all objects of interest whereas
sample is only a portion of the population.
– Parameters are associated with populations and statistics with
samples
– Parameters are usually denoted using Greek letters (μ, σ) while
statistics are usually denoted using Roman letters (X, s)
• There are several reasons why we do not work with
populations.
– They are usually large, and it is often impossible to get data for
every object we're studying
– Sampling does not usually occur without cost, and the more items
surveyed, the larger the cost
Descriptive vs Inferential statistics

Sampling

Population From population to sample


Sample
Statistics
From sample to population
Parameters

Inferential statistics
Descriptive vs Inferential statistics
• We compute statistics, and use them to estimate
parameters.
• The computation is the first part of the statistical analysis
(Descriptive Statistics) and the estimation is the second
part (Inferential Statistics).
• Descriptive Statistics
The procedure used to organize and summarize masses
of data
• Inferential Statistics
The methods used to find out something about a
population, based on a sample
Probability
• A measure of the likelihood that a particular event
will happen.
• It is expressed by a value between 0 and 1.

0.0 1.0
Cannot happen Sure to happen
• First, note that we talk about the probability of an
event, but what we measure is the rate in a group.
• If we observe that 5 babies in every 1 000 have
congenital heart disease, we say that the probability of a
(single) baby being affected is 5 in 1000 or 0.005.
Probability vs Statistics

Probability
General Specific
=>
Population Sample
=>
Model Data
=>
Statistics
Specific => General
Sample => Population
Data => Model
Sampling
• Individuals in the population vary from one another with
respect to an outcome of interest.
Sampling
• When a sample is drawn there is no certainty that it will
be representative for the population.

Sample A

Sample B
Sampling
• Sampling – a specific principle used to select members
of population to be included in the study.
• Due to the large size of target population, researchers have no choice but to
study a number of cases within the population in order to represent the
population and to reach conclusions about the population.
• Random error can be conceptualized as sampling
variability.
• Bias (systematic error) is a difference between an
observed value and the true value due to all causes
other than sampling variability.
• Accuracy is a general term denoting the absence of
error of all kinds.
Sampling
Sample B

Sample A
Population
Sampling
Sample B

Population Sample A
Sampling
• Stages of sampling:
– Defining target population
– Determining sampling size
– Selecting a sampling method
• Properties of a good sample:
– Random selection
– Representativeness by structure
– Representativeness by number of cases
Sampling
• Non-probability sampling:
– Judgment (purposive) sampling;
– Convenience sampling;
– Snowball sampling;
– Quota (proportional) sampling.
• Probability sampling:
– Simple random sampling;
– Systematic sampling;
– Stratified sampling;
– Cluster sampling.
Advantages of probability
sampling
• Provides a quantitative measure of the extent of variation
due to random effects
• Provides data of known quality
• Better control over non-sampling sources of errors
• Mathematical statistics and probability can be
applied to analyze and interpret the data
Disadvantages of non-
probability sampling
• Purposively selected without any confidence
• Selection bias likely
• Bias unknown
• No mathematical property
• Non-probability sampling should not be undertaken with
science in mind
• Provides false economy
Non-probability sampling
• Judgment: Sample group members are selected on the
basis of judgment of researcher
+ Time efficiency
− Samples are not highly representative
− Unscientific approach
− Personal bias
• Convenience: Obtaining participants conveniently with
no requirements whatsoever
+ High levels of simplicity and ease
+ Usefulness in pilot studies
− Highest level of sampling error
− Selection bias
Non-probability sampling
• Snowball: Sample group members nominate additional
members to participate in the study
+ Possibility to recruit hidden population
− Over-representation of a particular network
− Reluctance of sample group members to nominate
additional members
Probability sampling
• Simple random sampling: Each element has an equal
chance (probability) of being selected from a list of all
population units (sample of n from N population).
+ Highly effective if all subjects participate in data
collection
− High level of sampling error when sample size is small
• Systematic sampling: Every Nth member of population
is included in the study
+ Time efficient
+ Cost efficient
− High sampling bias if periodicity exists
Probability sampling
• In simple random sampling we expect units to be
“equally” distributed.
Probability sampling
• In reality the random selection may be like this:
Simple random vs
systematic sampling
• Systematic sampling has many advantages:
– Provides a better random distribution than simple random
sampling
– Simple to implement
– May be started without a complete listing frame (say, interview of
every 9th patient coming to a clinic).
– With ordered list, the variance may be smaller than in simple
random sampling
• However:
– In systematic sampling, only the first unit is selected at random,
the rest being selected according to a predetermined pattern.
– Systematic sampling is to be applied only if the given population
is logically homogeneous.
– Simple random sampling is free of classification error and
requires minimum advance knowledge of the population
Stratified sampling
• The total population is divided into smaller groups or
strata to complete the sampling process. The strata is
formed based on some common characteristics.
Stratified sampling
• Proportionate allocation uses a sampling fraction in each of the
strata that is proportional to that of the total population. For
instance, if the population consists of X total individuals, m of which
are male and f female (and where m + f = X), then the relative size
of the two samples (x1 = m/X males, x2 = f/X females) should reflect
this proportion.
• Optimum allocation (or disproportionate allocation) uses a
sampling fraction in each of the strata that is proportional to both the
proportion that of the total population (as proportionate allocation)
and to the standard deviation of the distribution of the variable.
Larger samples are taken in the strata with the greatest
variability to generate the least possible overall sampling
variance.
+ Effective representation of all subgroups
+ Precise estimates in cases of homogeneity or heterogeneity within
strata
− Knowledge of strata membership is required
− Complex to apply in practical levels
Cluster sampling
• Cluster sampling is a sampling plan used when mutually
homogeneous yet internally heterogeneous groupings are evident in
a population. The total population is divided into these groups and a
simple random sample of the groups is selected. The elements in
each cluster are then sampled.
Cluster sampling
• If all elements in each sampled cluster are sampled, then
this is referred to as a "one-stage" cluster sampling
plan.
• If a simple random sub-sample of elements is selected
within each of these groups, this is referred to as a "two-
stage" cluster sampling plan.
+ Time and cost efficient
− Group-level information needs to be known
− Usually higher sampling errors compared to alternative
sampling methods
Stratified vs cluster sampling
• The main difference between cluster sampling and
stratified sampling is that in cluster sampling the cluster
is treated as the sampling unit, so sampling is done on
a population of clusters (at least in the first stage). In
stratified sampling, the sampling is done on elements
within each strata.
– In stratified sampling, a random sample is drawn from each of
the strata, whereas in cluster sampling only the selected clusters
are sampled.
• A common motivation of cluster sampling is to reduce
costs by increasing sampling efficiency. This contrasts
with stratified sampling where the motivation is to
increase precision.
Sample size calculation
• Law of Large Numbers: As the number of trials of a
random process increases, the percentage difference
between the expected and actual values goes to zero.
• Application in biostatistics: Bigger sample size, smaller
margin of error.
• A properly designed study will include a justification for
the number of experimental units (people/animals) being
examined.
• Sample size calculations are necessary to design
experiments that are large enough to produce useful
information and small enough to be practical.
Sample size calculation
• Provides validity of the clinical trials/intervention studies
• Assures that the intended study will have a desired
power for correctly detecting a (clinically meaningful)
difference of the study entity under study if such a
difference truly exist
• Two objectives:
– Measure with a precision:
• Precision analysis
– Assure that the difference is correctly detected
• Power analysis
Sample size calculation
• Generally, the sample size for any study depends on:
– Acceptable level of confidence;
– Expected effect size and absolute error of precision;
– Underlying scatter in the population;
– Power of the study.

Large sample size


High power Large effect
Little scatter
Small sample size
Low power Small effect
Lots of scatter
Sample size calculation
• For quantitative variables:

Z  SD 2 2
n 2
d
• Z – confidence level;
• SD – standard deviation;
• d – absolute error of precision (margin of error).
Sample size calculation
• Sources of variance information:
– Published studies (concerns: geographical, contextual, time
issues – external validity)
– Previous studies
– Pilot studies
• Sample size estimation depends on the study design
– as variance of an estimate depends on the study
design.
Sample size calculation
• For quantitative variables:

Z  SD 2 2
n 2
d
• A researcher is interested in knowing the average
systolic blood pressure in pediatric age group at 95%
level of confidence and precision of 5 mmHg. Standard
deviation, based on previous studies, is 25 mmHg.

1.96  25
2 2
n
5 2
 96.04 => 97
Sample size calculation
• For qualitative variables:

Z  p  (100  p)
2
n 2
d
• Z – confidence level
• p – expected proportion in population
• d – absolute error of precision (margin of error)
Sample size calculation
• For qualitative variables:

Z  p  (100  p)
2
n 2
d
• A researcher is interested in knowing the proportion of
diabetes patients having hypertension. According to a
previous study, the actual number is no more than 15%.
The researcher wants to calculate this size with a 5%
absolute precision error and a 95% confidence level.

1.96 15  (100  15)


2
n 2
 195.92=> 196
5
When do you need biostatistics?

BEFORE you start your study!


After that, it will be too late…
Planning
Research programme:
1. Aim
2. Object
3. Units of observation
4. Indices of observation
5. Place
6. Time
7. Statistical analyses
8. Methodology
One vs Many
• Many measurements on one subject are not the same
thing as one measurement on many subjects.
– With many measurements on one subject, you get
to know the one subject quite well but you learn
nothing about how the response varies across
subjects.
– With one measurement on many subjects, you
learn less about each individual, but you get a good
sense of how the response varies across subjects.
Paired vs Unpaired
• Data are paired when two or more measurements are
made on the same observational unit (subjects, couples,
and so on).
• Data are unpaired, where only one type of
measurement is made on each unit.
Data processing
• Data check and correction
• Data coding
• Data aggregation
– According to the data usage:
– Primary
– Secondary
– According to the number of indices
– Simple
– Complex

• It is always a good idea to summarize your data (at


least for important variables).
• You become familiar with the data and the characteristics of the
sample that you are studying.
• You can also identify problems with data collection or errors in
the data.
Variables vs Data
• A variable is something whose value can vary.
• Data are the values you get when you measure a
variable.

Mr. Smith Mrs. Johns Mrs. Oliver


Age 36 43 56
Sex Male Female Female
Blood type 0 A A
Quantitative (metric) variables
• Continuous
– Measured units
– Metric continuous variables can be properly measured and have
units of measurement.
– Continuous values on proper numeric line or scale
– Data are real numbers (located on the number line).
• Discrete
– Integer values on proper numeric line or scale
– Metric discrete variables can be properly counted and have
units of measurement – ‘numbers of things’.
– Counted units
– Data are real numbers (located on the number line).
Qualitative (categorical) variables
• Nominal
– Values in arbitrary categories
– Ordering of the categories is completely arbitrary. In other words,
categories cannot be ordered in any meaningful way.
– No units!
– Data do not have any units of measurement.
• Ordinal
– Values in ordered categories
– Ordering of the categories is not arbitrary. It is now possible to
order the categories in a meaningful way.
– No units!
– Data do not have any units of measurement.
Levels of measurement
• There are four levels of measurement: Nominal, Ordinal,
Interval, and Ratio. These go from lowest level to highest
level.
• Data is classified according to the highest level which it
fits. Each additional level adds something the previous
level didn't have.
– Nominal is the lowest level. Only names are meaningful here.
– Ordinal adds an order to the names.
– Interval adds meaningful differences.
– Ratio adds a zero so that ratios are meaningful.
Levels of measurement
• Nominal scale – eg., genotype
You can code it with numbers, but the order is arbitrary
and any calculations would be meaningless.
• Ordinal scale – eg., pain score from 1 to 10
The order matters but not the difference between values.
• Interval scale – eg., temperature in C
The difference between two values is meaningful.
• Ratio scale – eg., height
It has a clear definition of 0. When the variable equals 0,
there is none of that variable. When working with ratio
variables, but not interval variables, you can look at the
ratio of two measurements.
Variables
• Different types of data require different kind of analyses.

Nominal Ordinal Interval Ratio

Frequency
Yes Yes Yes Yes
distribution

Median,
No Yes Yes Yes
percentiles
Mean,
standard No No Yes Yes
deviation

Ratio No No No Yes
Data processing
• Some visual ways to summarize data:
– Tables
– Graphs
• Bar charts
• Histograms
• Box plots
Frequency table
• Elements
– Formal
1. Title
2. Main column
3. Main row
4. Legend
– Logical
Frequency table
Simple table
Table 1. Anti-HBs (+) outcomes per group from a HBV
Title
screening study*
Main row Number of Anti-HBs
Screened group %
(+) cases
Chilldren of 7 y. 3 10%
Chilldren of 11 y. 7 23%
Chilldren of 17 y. 3 10%
Main column
Roma people 1 3%
Contacts in family 3 10%
Health professionals 13 43%
Total 30 100%

Legend * Part of TPTBHB Project


Frequency table
Contingency table (cross tabulation)
Table 2. HBV high-risk groups to be screened by residence*

Residence
Smolyan Zlatograd Rudozem Subtotal
Risk group

Contacts in family 65 20 15 100


Health professionals 98 30 22 150
Roma people 65 20 15 100

Total: 350

* Part of TPTBHB Project


Graphical summaries
Variable Graph Statistics
Frequency table
Bar chart
One qualitative Relative frequency table
Pie chart
Proportion
Side-by-side bar
chart Two-way table
Two qualitative
Segmented bar Difference in proportions
chart
Measures of central tendency
Dotplot
Measures of spread
One quantitative Histogram
Other: five number summary,
Boxplot
percentiles, distribution shape
Side-by-side
One quantitative Statistics broken down by group
boxplots
by one qualitative Difference in means
Stacked dotplots
Two quantitative Scatterplot Correlation
Bar chart
• Bar chart is a way to visually represent qualitative data.
• Data is displayed either horizontally or vertically and
allows viewers to compare items, such as amounts,
characteristics, and frequency.
• Bars are arranged in order of frequency, so more
important categories are emphasized.
• Bar charts can be either single, stacked, or grouped.
Pie chart
• Pie chart is helpful when graphing qualitative data, where
the information describes a trait or attribute and is not
numerical.
• Each slice of pie represents a different category, and
each trait corresponds to a different slice of the pie—with
some slices usually noticeably larger than others.
Histogram
• A histogram is used with quantitative data. Ranges of
values, called classes, are listed at the bottom, and the
classes with greater frequencies have taller bars.
Histogram
• A histogram often looks similar to a bar chart, but they
are different because of the level of measurement of the
data:
– A bar chart is for categorical data, and the x-axis has
no numeric scale
– A histogram is for quantitative data, and the x-axis is
numeric.
Boxplot
• Boxplot is a method for graphically depicting groups of
numerical data through their quartiles.
Scatterplot
• Scatterplot is a type of plot using Cartesian coordinates
to display values for two variables for a set of data.
• Data are displayed as a collection of points, each having the
value of one variable determining the position on the horizontal
axis and the value of the other variable determining the position
on the vertical axis.

You might also like