Download as pdf or txt
Download as pdf or txt
You are on page 1of 62

NATURE OF BIOSTATISTICS AND DATA TWO BRANCHES OF BIOSTATISTICS

PROCESSING
1. DESCRIPTIVE
STATISTICS  Methods of summarizing and
presenting data
 An art of summarizing data
 Computation of measures of central
 Tool in decision making
tendency and variability
 METHOD or DATA
 Tabulation and graphical presentation
USES:  Facilitate understanding, analysis, and
interpretation of data
 Data reduction technique
 Tool for analyzing research projects and 2. INFERENTIAL
clinical trials  Methods of arriving at conclusions and
 Tool for objective appraisal and generalizations about a target
evaluation of programs population based on information from a
 Tool in decision-making process and sample
policy making  Estimation of parameters and
2 AREAS OF STATISTICS hypotheses testing

1. MATHEMATICAL STATISTICS - Uses of Biostatistics


concerned with the development of  Epidemiology: distribution &
new statistical inference and requires determinants of health-related states
detailed knowledge of abstract and events
mathematics for its implementation  Demography: a study of the human
2. APPLIED STATISTICS - - involves population
applying the methods of mathematical
 Health Economics: functioning of the
statistics to specific subject areas such
health care system and health affecting
as BIOSTATISTICS
behaviors
BIO MEANS LIFE; STATISTICS IS Science dealing  Genetics and Genomics: Heredity;
with the collection, organization, analysis, and genes and functions
interpretation of numerical data
TERMINOLOGIES
BIOSTATISTICS - It is a special branch of
 Population- defined as a whole defined
statistics which deals with quantitative and
group or entirety
qualitative aspects of vital phenomena.
 Sample- portion of population or subset
Application of statistical methods to the life
 Parameter – measure of characteristic
sciences like biology, medicine, and public
of population
health.
 Statistic- any quantity computed from
USES OF BIOSTATISTICS values
 Data – pieces of information
 Epidemiology
 Constant – unchanging value
 Demography
 Variable – changing value
 Health Economics
 Genetics and Genomics
Types of Quantitative Data  Always qualitative
 Does not represent any amount or
 Discrete – assumes finite or countable
quantity
number of values
 Continuous – assumes infinite or
2. Ordinal
possible values
 Represents an ordered series of
INTRODUCTION TO DATA PROCESSING relationships
 May be qualitative or quantitative
 Problem Identification / Hypothesis
 Objective Formulation 3. Interval
 Review of Related Literature  Does not have a true-zero value starting
 Research Design point
 Sampling Design and Estimation  Always quantitative
 Data Collection and Processing
 Data Analysis 4. Ratio
 Writing the Report  Modified interval level which includes
 Dissemination of result zero as a starting point
TYPES OF DATA  Always quantitative

According to Source: DATA PROCESSING

 Primary Data Systematic procedure to ensure that the


information/data gathered are complete,
 Secondary Data
consistent and suitable for analysis.
According to Functional
Data Processing Flowchart
 Dependent
1) DATA COLLECTION
 Independent
2) DATA PROCESSING
CATEGORIES OF DATA a) CODING
b) ENCODING
Types of Variable
c) EDITING
 Qualitative – categories are simply 3) ANALYSIS
descriptions or labels to distinguish one
group form another.
 Quantitative – categories can be DATA CODING
measured and ordered according to
- Conversion of verbal/written information into
quantity or amount and can be
numbers which can be more easily encoded,
expressed numerically. Can either be
counted and tabulated.
discrete or continuous.
Why do we code? To allow researchers to have
SCALE OF MEASUREMENT OF VARIABLES
small amount of qualitative, smooth coding and
1. Nominal billing process
 Simply used as names or identifiers of a
category
Example: RULES IN CODE CONSTRUCTION

 ”0” for male and “1” for female  Number of code must be kept to
 “1” for agree and “2” for disagree minimum (preferably < 8).
 Codes should be exhaustive and
mutually exclusive
TYPES OF CODE  Adopt coding convention for questions
with similar answers.
A. Field Code: actual value or information
given by the respondent CODING MANUAL - A document which contains
a record of all codes assigned to the responses
Example: to all questions in the data collection forms.
 Age (yrs) – 30 years old Minimum information that must be included in
 Weight (lbs) – 180 lbs a coding manual:
 Height (cm) – 144 cm
 Variable name
B. Bracket Code: recorded as range of  Variable description
values rather than actual values  Coding instructions

Example: Monthly income

 1 – less than Php 5,000


 2 – Php 5,000 to Php 10,000
 3 – Above Php 10,000

C. Factual Code: codes are assigned to a


list of categories of a given variable

Example: Civil Status

 1 – Single 2 – Married 3 – Widowed

D. Pattern Code: applicable for questions


with multiple responses DATA ENCODING
Example: Symptoms of COVID-19 Entering the data/responses in a spreadsheet
o Fever  MS Excel
o Dry cough  MS Access
o Sore throat  Epi Info
o loss of taste or smell
o Others: DATA EDITING - Inspection and correction of
any errors or inconsistencies in the information
collected

 During data collection, encoding, before


data analysis
TYPES OF EDITING DATA PRESENTATION AND PRINCIPLES OF
SAMPLING
A. Field Editing
 Reviewing the accomplished data
collection forms
PRINCIPLES OF SAMPLING
 Decoding of abbreviations or special
symbols POPULATION (N)- is the set of complete
 Making callbacks/messages for collection or totality of all possible values of the
verification/clarification of incomplete variables.
answers
SAMPLE (n)- a subset or sub-collection of
elements drawn from a population
B. Central Editing
 Checking of inconsistencies and STAGES IN THE SELECTION OF A SAMPLE
incorrect entries after receiving the
1. Define the Target Population
questionnaire from the field
 Checking of encoded data 2. Select a Sampling frame

3. Determine if a probability or non-


IMPORTANCE OF DATA EDITING probability sampling method will be
chosen.
 Make corrections as early as possible
 Reduce non-response or incomplete 4. Plan procedure for selecting sampling
answers units
 Eliminate inconsistencies, incorrect
5. Determine sample size
info.
 Make the entries clear, legible and 6. Select actual sampling units
comprehensive
7. Conduct fieldwork
 Prepare data for analysis
Determining the Sample

There is no general rule regarding the sample


What to check when editing data?
size. However, the higher the percentage of the
 Check for duplicate entries sample, the higher the validity of the study. The
 Check the totals of each variable if the bigger the population, the lesser percentage of
same as with the sample size the sample is taken. For a specific calculation of
 For qualitative data, check if categories the sample for the purpose of adequate
are consistent with what is specified in sampling, the use of Slovene’s Formula
the coding manual presented below is advised as given by Pago
 For quantitative data, check the so.
minimum and maximum if they are
SIMPLIFIED FORMULA FOR PROPORTION
logical given the possible values of
variable  N = Population

 n = sample
 e = margin of error or percentage of  Sampling population
error
 Sampling unit
𝑁
𝑛=
1 + 𝑁 (𝑒)2  Elementary unit/ element

 Sampling frame

SAMPLING TECHNIQUE  Sampling error

1. Simple Random - In this technique


elements of the sample are selected ADVANTAGES OF SAMPLING
through lottery.
 Cheaper
2. Systematic -This technique of sampling
is done by taking every element in the  Faster
population assignment of number as a
 Better quality of information can be
part of the sample
collected
3. Cluster - Population under this
 More comprehensive date may be
technique is being divided into sections
obtained
(or cluster), randomly select some of
these cluster as the member of the  Only possible method for destructive
sample size procedure
4. Stratified- In this technique, the CRITERIA OF A GOOD SAMPLING DESIGN
population is subdivided into at least
two different sub-populations (or  Representative of the population
strata) that share the same  Adequate sample size
characteristics and then the elements
of the sample are drawn from its  Practical and feasible
stratum proportionately.
 Economy and efficiency
SAMPLING- is the act of studying or examining
BASIC SAMPLING DESIGN
only a segment of the population to represent
the whole. NON-PROBABILITY SAMPLING- probability of
each member of the population being selected
Two key features:
as part of the sample is difficult to determine or
 Representative of the population cannot be specified

 Adequate sample size PROBABILITY SAMPLING- each member of the


population has a known non-zero chance of
BASIC CONCEPTS: a population is an entire being selected as a sample.
group of individuals or items of interest in the
study (universe). NON-PROBABILITY SAMPLING DESIGNS

TERMS:  Judgmental/Purposive - based on


expert's subjective judgement.
 Target population
 Accidental / Haphazard - those who is  In this technique elements of the
available sample are selected using either the
lottery method or random numbers
 Quota — samples of a fixed size
generated by a calculator, excel, epiInfo
 Snowball — individual to be included is etc.
identified by a member who was ADVANTAGES:
previously included
 Simple design
 Convenience — units are easily
accessible  Simple analysis

DISADVANTAGES

 Not cost efficient

NON-PROBABILITY SAMPLING DESIGNS  Requires a sampling frame


Advantages

 Easier to execute

 Only possible mean SYSTEMATIC SAMPLING- done by taking every


element in the population assignment of
Disadvantages
number as a part of the sample
 More likely to produce biased result The researcher computes for the sampling
 NO defined rules to compute for interval (k= N/n)
estimates HOW TO DO SYS:
 Cannot compute the reliability of 1. Compute for sampling interval (k= N/n)
estimates.
2. Draw a random number between 1 and
PROBABILITY SAMPLING DESIGNS k. this will be the starting point
 Simple random sampling 3. Count intervals from the starting point
 Systematic sampling until the sample size is reach.

 Stratified sampling
PROBABILITY SAMPLING DESIGNS
 Cluster sampling
Systematic Sampling (SYS)
 Multi-stage sampling
Advantages
Simple random sampling
 Less time consuming and easier to
 Most basic type
perform
 Every element in the population has an
 Can sometimes result in representative
equal chance of being included in the
sample
sample
Disadvantages  Clusters are usually of the same size
and the characteristics of units across
 Units could Widely spread-out
clusters are homogenous or similar.
 Systematic bias
MUTI-STAGE SAMPLING

 a procedure carried out in phases and


STRATIFIED SAMPLING usually involves more than one
sampling method.
 The population is first divided into non-
overlapping groups called: stratum  often used in community-based studies

 Samples are selected from each stratum  Sampling design: 4-stage stratified,
through SRS or SYS. systematic, cluster, simple sampling
design
PROCESS IN STRATIFIED SAMPLING
Multi-Stage Sampling
1. Divide sample population into strata
Advantages
2. Obtain the sampling frame for each
stratum  Cost efficient design

3. Compute the sampling fraction, p=n/N  Sampling frame for all elementary units
not required
4. Select random samples of p in each
stratum  sample is easier to select

STRATIFIED RANDOM SAMPLING Disadvantages

Advantages  more complicated design to implement


 Ensure subgroups are adequately  more complicated analysis
represented
 needs bigger sample size to achieve
 Accurate estimates for each stratum sample precision
can be obtained
DATA PRESENTATION
 produces more reliable results
Tabular and Graphical Presentation of Data
Disadvantages

 may require a very large sample if


reliable estimates for each stratum are
wanted

CLUSTER SAMPLING

 The selection of groups of study units


(clusters) instead of the selection of
study units individually.
METHODS OF PRESENTING DATA  For discrete or continuous data, a
frequency distribution breaks down the
Textual- used to provide contextual information
range of values of the observations into
and are fundamentally presented in paragraphs
or sentences.

Tabular- A table is one of the simplest means to


summarize a set of observations and can be
used for all types of numerical data. It is useful
in summarizing and comparing quantitative
information coming from different variables
and different units and consequently be
presented together.

Graphical- Graphs simplify complex information


by using images and emphasizing data patterns a series of distinct, non-overlapping
or trends, and are useful for summarizing, intervals.
explaining, or exploring quantitative data

TEXT PRESENTATION RELATIVE FREQUENCY


Examples:  The proportion of the total number of
For instance, information about the incidence observations that appears in that
rates of Acute diarrhea among children aged 3- interval.
7 years old in 2016–2017 can be presented with  It is computed by dividing the number
the use of a few numbers: of values within an interval by the total
“The incidence rate of Acute diarrhea among number of values in the table,
children aged 3-7 years old was 10% in 2016 multiplied by 100% to obtain the
and 15% in 2017; no significant difference of percentage of values in the interval.
incidence rates was found between the two  Relative frequencies are useful for
years.” comparing sets of data that contain
FREQUENCY DISTRIBUTION unequal numbers of observations.

 A summary of the data can make things


easier. It lists all classes and their
frequencies

 The number of times that something


occurs is known as its frequency

 For nominal and ordinal data, a


frequency distribution is usually
composed of a set of classes or
categories along with the numerical
counts that correspond to each one.
CUMULATIVE RELATIVE FREQUENCY Graphs simplify complex information by using
images and emphasizing data patterns or trends
 Is the percentage of the total number of
and are useful for summarizing, explaining, or
observations that have a value less than
exploring quantitative data.
or equal to the upper limit of the
interval?

 It is calculated by summing the relative


frequencies for the specified interval
and all previous ones.
A. PIE CHART

 Circles subdivided into a number of


slices - area of each slice represents the
relative proportion data points falling
into given category

 use to show how a whole is divided into


its component parts which could be
breakdowns of groups or totals

GRAPHICAL PRESENTATION

TYPES:

 Pie chart

 Bar graph

 Component bar graph

 Line graph

 Histogram BAR CHARTS


 Frequency polygon  Popular type of graph used to display a
 Stem and leaf plot frequency distribution for nominal or
ordinal data.
 Box plot

 Scatter plot

GRAPH PRESENTATION
 In a bar chart, the various categories  vertical axis: variable values
into which the observations fall are
presented along a horizontal axis.

Histogram

 bar is used to
depict
number or
 A vertical bar is drawn above each
relative
category such that the height of the bar
frequencies
represents either the frequency or the
of data
relative frequency of observations
points falling
within that class.
into the
given class

 bars are drawn over the true limits of


the classes, no gaps exist in between

 horizontal axis: continuous quantitative

Line Graph

 plot of dots joined with lines over some  vertical axis: number of relative
period of time in sequential series frequencies

 time series charts

 horizontal axis: time series


 preferred for grouped interval data  used to show the actual data value
instead of using bars to represent the
height of an interval

What is the difference between a bar graph


and a histogram?

In bar graphs are usually used to display


“categorical data” that is data that fits into
categories. For example, suppose that offered
to buy donuts for six people and three said they
wanted chocolate covered, 2 said plain and one
said with icing sugar. BOX PLOT

Histograms on the other hand are usually used  shows description of a large
to present “continuous data” that is data that quantitative data
represents measured quantity. The data would
 include center, spread, shape, tail
then be collected into categories to present a
length, and outlying data points
histogram.
 Can be presented horizontal or vertical

Frequency Polygon  height of rectangle is arbitrary and has


no specific meaning
Similar to histogram except that:
 used for comparing the distributions of
 frequencies are plotted against the several variables or the distribution of a
corresponding midpoints of the classes single variable in several groups On the
same scale
 adjacent points are joined with lines
and the plot is tied down to the
horizontal axis resulting in multi-sided
polygon

STEM-AND-LEAF PLOT

 primarily for small set of data provides


rank-ordered lists and its easier to SCATTER PLOT
restore the original value of the
observation  shows the relationship between two
quantitative variables
 lines give more information than bars in
histogram
 gives rough estimate of the type and
degree of correlation between the
variables

Graphical Presentation

ADVANTAGES

 main feature & implications of the data


can grasp at a glance

 more attractive & appealing to a wider


range reader

 simplifies concepts that would


otherwise have been in many words

 shows trends & patterns of a large of


data

 comparisons made striking

 can be readily clarify data

DISADVANTAGES

 cannot show as many sets of facts

 can only show approximate values

 require more time to construct

 may be used to misinterpret results


MEASURES OF CENTRAL
TENDENCY, DISPERSION,
AND LOCATION
OBJECTIVES:
At the end of the unit students should be able to:
n • Describe the basic formula of the measures of
central
n tendency, dispersion, and location

n • Explain the uses, and limitations of these


measures
n • Correctly interpret these measures
Central Tendency & Dispersion

n Types of Distributions: Normal, Skewed


n Central Tendency: Mean, Median, Mode
n Dispersion: Variance, Standard Deviation
DESCRIPTIVE STATISTICS
are concerned with describing the
characteristics of frequency distributions

n Where is the center?


n What is the range?

n What is the shape [of the


distribution]?
Frequency Table What is the range of test scores?
A: 30 (95 minus 65)
Test Scores When calculating mean, one
must divide by what number?
Observation Frequency
(scores) (# occurrences) A: 16 (total # occurrences)
65 1
70 2
75 3
80 4
85 3
90 2
95 1
Frequency Distributions

3
Frequency
(# occurrences) 2
1

65 70 75 80 85 90 95

Test Score
Normally Distributed Curve
Skewed Distributions

We say the distribution is skewed We say the distribution is skewed


to the left  (when the “tail” is to the right  (when the “tail” is
to the left) to the right)
Voter Turnout in 50 States - 1940

Q: Is this distribution, positively or negatively skewed?


A: Negatively
Q: Would we say this distribution is
skewed to the left or right?
A: Left (skewed in direction of tail)
Characteristics - Normal Distribution
n It is symmetrical - half the values are to one side of the
center (mean), and half the values are on the other side.

n The distribution is single-peaked, not bimodal or multi-


modal.

n Most of the data values will be “bunched” near the center


portion of the curve. As values become more extreme
they become less frequent with the “outliers” being found
at the “tails” of the distribution and are few in number.

n The Mean, Median, and Mode are the same in a perfectly


symmetrical normal distribution.

n Percentage of values that occur in any range of the curve


can be calculated using the Empirical Rule.
Summarizing Distributions
Two key characteristics of a frequency distribution
are especially important when summarizing data
or when making a prediction:
n CENTRAL TENDENCY
n What is in the “middle”?
n What is most common?
n What would we use to predict?

n DISPERSION
n How spread out is the distribution?
n What shape is it?
The MEASURES of Central Tendency

n 3 measures of central tendency are commonly


used in statistical analysis - MEAN, MEDIAN,
and MODE.
n Each measure is designed to represent a “typical”
value in the distribution.
n The choice of which measure to use depends on
the shape of the distribution (whether normal or
skewed).
Mean - Average
n Most common measure of central tendency.
n Is sensitive to the influence of a few extreme values
(outliers), thus it is not always the most appropriate
measure of central tendency.
n Best used for making predictions when a
distribution is more or less normal (or symmetrical).
n Symbolized as:
n x for the mean of a sample

n n for the mean of a population


Finding the Mean

n Formula for Mean: X = (Σ x)


N
n Given the data set: {3, 5, 10, 4, 3}

X = (3 + 5 + 10 + 4 + 3) = 25
5 5
X =5
Find the Mean
Q: 85, 87, 89, 91, 98, 100
A: 91.67

Q: 5, 87, 89, 91, 98, 100


A: 78.3 (Extremely low score lowered the Mean)
Median
n Used to find middle value (center) of a distribution.
n Used when one must determine whether the data
values fall into either the upper 50% or lower 50%
of a distribution.
n Used when one needs to report the typical value of
a data set, ignoring the outliers (few extreme
values in a data set).
n Example: median salary, median home prices in a market

n Is a better indicator of central tendency than mean


when one has a skewed distribution.
To compute the median
n first you order the values of X from low to high:
 85, 90, 94, 94, 95, 97, 97, 97, 97, 98
n then count number of observations = 10.
n When the number of observations are even,
average the two middle numbers to calculate the
median.
n This example, 96 is the median
(middle) score.
Median
n Find the Median
4 5 6 6 7 8 9 10 12
n Find the Median
5 6 6 7 8 9 10 12
n Find the Median
5 6 6 7 8 9 10 100,000
Mode
n Used when the most typical (common) value is
desired.
n Often used with categorical data.
n The mode is not always unique. A distribution can
have no mode, one mode, or more than one mode.
When there are two modes, we say the distribution is
bimodal.
EXAMPLES:
a) {1,0,5,9,12,8} - No mode
b) {4,5,5,5,9,20,30} – mode = 5
c) {2,2,5,9,9,15} - bimodal, mode 2 and 9
Measures of Variability
n Central Tendency doesn’t tell us
everything Dispersion/Deviation/Spread
tells us a lot about how the data values
are distributed.

n We are most interested in:


n Standard Deviation (σ) and

n Variance (σ2)
Why can’t the mean tell us everything?

n Mean describes the average outcome.

n The question becomes how good a


representation of the distribution is the mean?
How good is the mean as a description of
central tendency -- or how accurate is the mean
as a predictor?

n ANSWER -- it depends on the shape of the


distribution. Is the distribution normal or skewed?
Dispersion
n Once you determine that the data of interest is
normally distributed, ideally by producing a
histogram of the values, the next question to ask
is: How spread out are the values about the
mean?
n Dispersion is a key concept in statistical thinking.
n The basic question being asked is how much do
the values deviate from the Mean? The more
“bunched up” around the mean the better your
ability to make accurate predictions.
Mean Absolute Deviation
The key concept for describing normal distributions
and making predictions from them is called
deviation from the mean.

We could just calculate the average distance between


each observation and the mean.
n We must take the absolute value of the distance,
otherwise they would just cancel out to zero!
Formula: |X X |
 n
i
Mean Absolute Deviation:
An Example
Data: X = {6, 10, 5, 4, 9, 8} X = 42 / 6 = 7

X – Xi Abs. Dev.
1. Compute X (Average)
7–6 1 2. Compute X – X and take
7 – 10 3 the Absolute Value to get
Absolute Deviations
7–5 2 3. Sum the Absolute
7–4 3 Deviations
4. Divide the sum of the
7–9 2
absolute deviations by N
7–8 1
Total: 12 12 / 6 = 2
What Does it Mean?
n On Average, each value is two units away
from the mean.

Is it Really that Easy?


n No!
n Absolute values are difficult to manipulate
algebraically
n Absolute values cause enormous problems
for calculus (Discontinuity)
n We need something else…
Variance and Standard Deviation
n Instead of taking the absolute value, we square
the deviations from the mean. This yields a
positive value.

n This will result in measures we call the Variance


and the Standard Deviation
Sample - Population -
s Standard Deviation σ Standard Deviation
s2 Variance σ2 Variance
Standard Deviation

You might also like