Bistatstics MLT 1-5

Biostatistics Handout
chapter I: Introduction &

Descriptive Statistics
By Israel M.
10/02/2022 1
objectives
• Define nominal, ordinal, discrete and continuous data
and describe the differences between these types of
data.
• Use graphs to interpret various types of data..
• Define and calculate measures of location of a data set.
• Define and calculate the measures variance of a data set
10/02/2022 2
Lecture Topics
• Definition of terminologies/Introduction
• Types of numerical data
• Tables and Graphs
• Measures of Central Tendency
• Measures of Dispersion
10/02/2022 3
Introduction
• Statistics: A field of study concerned with:
– Collection, organization, analysis, summarization

and interpretation of numerical data, &
– Where by inferences are made about specific random

phenomena on the basis of relatively limited sample
material.
• Biostatistics: the application of statistical

methods to the fields of biological and medical
sciences.
10/02/2022 4
Types of statistics
• The field of Statistics is subdivided into 2 main areas:
– Mathematical: development of new methods of

statistical inference
– Applied: application of the methods of

mathematical statistics to specific areas, such as
public health
10/02/2022 5
Descriptive statistics:
 Exploratory data analysis
 Ways of organizing and summarizing data
 Identify the general features and trends in a set of

data and extracting useful information
 Conveying the final results of a study
10/02/2022 6
Inferential statistics:
• Confirmatory data analysis
• Statistical inference makes use of information from a

sample to draw conclusions (inferences) about the
population from which the sample was taken
• Used to draw conclusions about a population based

on the information obtained from a sample of
observations drawn from that population
10/02/2022 7
Using Statistics (Two Categories)
Descriptive Statistics Inferential Statistics
Organize  Predict and forecast values of
Summarize population parameters
Display
Example: tables,  Test hypotheses about values
graphs, numerical of population parameters
summary measures
 Make decisions
10/02/2022 8
Uses of biostatistics
 Assessment of health status
 Provide methods of organizing information
 Health program evaluation
 Resource allocation
 Magnitude of association
– Strong vs weak association between exposure and outcome
10/02/2022 9
• Assessing risk factors
– Cause & effect relationship
• Evaluation of a new vaccine or drug

– What can be concluded if the proportion of people free
from the disease is greater among the vaccinated than the
unvaccinated?
– How effective is the vaccine (drug)?
– Is the effect due to chance or some bias?
• Drawing inferences
– Information from sample to population
10/02/2022 10
Data
 Encompasses observations on one or more variables.
 Set of data is a collection of observed values
representing one or more characteristics of some
objects or units.
 Are numbers which can be measurements or can be
obtained by counting
 Age and height of students in the class
10/02/2022 11
Types of Data
1. Primary data: collected from the items or individual
respondents directly by the researcher for the
purpose of a study.
 More reliable and relatively accurate.
 More expensive and time consuming
2. Secondary data: which had been collected by certain

people or organization, & statistically treated and
the information contained in it is used for other
purpose by other people
10/02/2022 12
Sources of data
• Routinely kept records
• Population Surveys
• Experiments
• Reports
• Literature
• Etc
10/02/2022 13
Variable
• Variable: a characteristic which takes different values

in different persons, places, or things.
• Any aspect of an individual or object that is measured

(e.g., BP) or recorded (e.g., age, sex) and takes any
value.
10/02/2022 14
• Variables can be broadly classified into:
– Categorical (or Qualitative) or
– Quantitative (or numerical variables).
10/02/2022 15
Categorical variable:
 Cannot be measured in quantitative form as we
measure height or weight but only sorted by name or
categories
 Do not have numerical values
 The notion of magnitude is absent or implicit
 Can be nominal or ordinal

Example: Occupation, marital status, and sex
10/02/2022 16
Quantitative variable:
 Measured (or counted) and expressed numerically.

 Take numerical values whose "size" is meaningful.
 Answer questions such as "how many?" or "how
much”
 Example: age, height, weight, heart rate, blood

pressure, number of children in a household, and
cholesterol level
10/02/2022 17
Quantitative variable is divided into two:
1. Discrete: Can only have a limited number of
discrete values (usually whole numbers).
• E.g., the number of daily admissions to a hospital
• Characterized by gaps or interruptions in the values

(integers).
• Both the order and magnitude of the values matter.
10/02/2022 18
2. Continuous variable:
 It can have an infinite number of possible values in any given
interval.
 Both the magnitude and the order of the values matter
 Does not possess the gaps or interruptions
 Example: age, height, weight
10/02/2022 19
Measurement Scales
 At what level does the measurement take place?
 The forms in which data is found or the scales on

which data is measured
 Classified as nominal, ordinal, interval, and ratio.
 Stated in terms of increasing information content

10/02/2022 20
Nominal scale
 The simplest type of data, in which the values fall

into unordered categories or classes
 Consists of “naming” observations or classifying

them into various mutually exclusive and
collectively exhaustive categories
 Uses names, labels, or symbols to assign each

measurement.
 Examples: Blood type, sex, race, marital status,
etc.
10/02/2022 21
Nominal scale
• Binary or dichotomous variable: If nominal data

can take on only two possible/distinict values
• The only valid operations for variables

represented by a nominal scale are the
determination of “=” or “≠”
10/02/2022 22
Ordinal scale
 Assigns each measurement to one of a limited number of
categories that are ranked in terms of order.
 The observations can be ranked from the smallest to the
largest or from the least important to the most important.
 Although non-numerical, can be considered to have a
natural ordering
 Examples: Severity of injury (level-4), Patient status,
cancer stages, social class, etc.
10/02/2022 23
level of severity of  These numbers serve
only to indicate a
injury: pecking order of levels
of the variable—the
1. Fatal injury
differences between
2. Severe these numerical values
are meaningless
3. Moderate
4. Minor
10/02/2022 24
Interval scale
• Measured on a continuum and differences between
any two numbers on a scale are of known size.
• Distance between observations is meaningful.
• Here the numbers assigned to the observations

indicate order and possess the property that the
difference between any two consecutive values is the
same as the difference between any other two
consecutive values.
10/02/2022 25
Interval scale
 Has a zero point, its location may be arbitrary.
Hence ratios of interval scale values have no
meaning.
10/02/2022 26
Ratio scale
 Measurement begins at a true zero point and

the scale has equal space.
- Examples: Height, age, weight, BP, etc.
 Note on meaningfulness of “ratio”-

– Someone who weighs 80 kg is two times as
heavy as someone else who weighs 40 kg.
This is true even if weight had been
measured in other measurements.
10/02/2022 27
Population and Sample
 Population refers to a collection of people or objects
that share common observable characteristics. E.g.,
all of the people who live in your city, all women
diagnosed with breast cancer during the last five
years
 Have something in common for which we wish to

draw conclusions at a particular time (target
population)
10/02/2022 28
Sample
 Is a subset of the population
 We use samples in making inferences about
populations
10/02/2022 29
Parameter and Statistic
 Parameter: A descriptive measure computed from
the data of a population.
 Used to describe the attributes of populations
– E.g., the mean (µ) age of the target population
 Statistic: A descriptive measure computed from the

data of a sample.
– E.g., sample mean age (x )
 Sample statistic in place of the population parameter
(estimation)
10/02/2022 30
Descriptive Statistics
 Techniques used to organize and summarize a set of
data in a concise way.
– Organization of data
– Summarization of data
– Presentation of data
 Raw data: Numbers that have not been summarized

and organized.
10/02/2022 31
 Before summarization and organization, we need to
know the types of variables and measurement scales
of our data.
 Before displaying or analyzing data, classify the

variables into their different types.
10/02/2022 32
Descriptive statistics include:
 Tables
 Graphs
 Numerical summary measures
- Measures of central tendency
- Measures of variability
10/02/2022 33
Frequency Distributions (Tables)
• The actual summarization and organization of data
starts from frequency distribution.
• Frequency distribution: a table which has a list of

each of the possible values that the data can assume
along with the number of times each value occurs
(frequency).
• Provides one of the most convenient ways to

summarize or display grouped data.
10/02/2022 34
Frequency distribution for categorical data
 Count (tally) the number of observations in each

category
 Present the number as frequency (How often)
 Relative frequency: percentage of the total number

of observations
Example: survey conducted among 600 students of

certain college indicated that 200 of them were
female students.
10/02/2022 35
Example
Sex Frequency Relative
frequency
Male 400 66.7
Female 200 33.3
Total 600 100.0%
10/02/2022 36
Frequency distribution for Quantitative data
 For a continuous variable (e.g. – age), the frequency

distribution of the individual ages is not so
interesting.
 Form the groups by amalgamating continuous values
into classes of intervals
 Select a set of continuous, non-overlapping intervals
such that each value can be placed in one, and only
one of the intervals.
 The first consideration is how many intervals to
include
10/02/2022 37
• Arrange the data in ascending order (ordered array)
• There is no clear-cut rule on the number of intervals

or classes
• With too many intervals, the data are not

summarized enough for a clear visualization of how
they are distributed.
• Too few intervals are undesirable because the data

are over summarized, and some of the details of the
distribution may be lost.
10/02/2022 38
• Between 5 and 15 intervals are acceptable
• Of course, this also depends on the number of

observations, we can and should use more intervals
for larger data sets
• The widths of the intervals must also be decided
• Intervals generally should be of the same width
10/02/2022 39
 To determine the number of class intervals and the
corresponding width, we may use:
Sturge’s Rule K  1  3.322(logn)
LS
W
K
Where
K = number of class intervals
n = number of observations
W = width of the class interval
L = the largest value
S = the smallest value
10/02/2022 40
Example: weights in pounds of 57 children at
a day-care center:
 68 63 42 27 30 36 28 32 79 27 22 23 24 25 44 65 43
25 74 51 36 42 28 31 28 25 45 12 57 51 12 32 49 38
42 27 31 50 38 21 16 24 69 47 23 22 43 27 49 28 23
19 46 30 43 49 12
 K = 1 + 3.332 (log 57) = 6.85  7
 Maximum value = 79 Minimum value = 12
 Range = 79-12 = 67
 Width: 9.57  10
10/02/2022 41
 Determining the frequencies or the number of values
or measurements for each interval
 Present the proportion or relative frequency in

addition to frequency for each interval
10/02/2022 42
10/02/2022 43
• Cumulative frequencies: when frequencies of two or
more classes are added.
• The cumulative relative frequency or cumulative

percentage: gives the percentage (number) of
observations less than or equal to the upper boundary
of a particular class interval.
• The cumulative relative frequency can be obtained by

summing the relative frequencies in a particular row
and in all the preceding class intervals.
10/02/2022 44
Weight f Relative Cumulative Cumulative
interval Frequenc Frequency relative
y (%) (cf) frequency (%)
10-19 5 8.8 5 8.8

20-29 19 33.3 24 42.1
30-39 10 17.5 34 59.6
40-49 13 22.8 47 82.4
50-59 4 7 51 89.4
60-69 4 7 55 96.4
70-79 2 3.5 57 99.9 100
Total 57 100
10/02/2022 45
 59.6% of the children in the data set have a
weight of 39.5 lb or less
 Question: What percentage of children in the

data set have a body weight of 59.5 lb or less?
10/02/2022 46
 True limits: Are those limits that make an interval of a
continuous variable continuous in both directions
• A true boundary is the average of the upper limit of one
interval and the lower limit of the next-higher interval.
 Used for smoothening of the class intervals

 Subtract 0.5 from the lower and add it to the upper limit
 Mid-point: the value of the interval which lies midway

between the lower and the upper limits of a class.
10/02/2022 47
Weight True limit Mid-point f
interval
10-19 9.5-19.5 14.5 5
20-29 19.5-29.5 24.5 19
30-39 29.5-39.5 34.5 10
40-49 39.5-49.5 44.5 13
50-59 49.5-59.5 54.5 4
60-69 59.5-69.5 64.5 4
70-79 69.5-79.5 74.5 2
Total 57
10/02/2022 48
Simple or one-way table
10/02/2022 49
Two-way table
 Shows two characteristics & is formed when either

the column or the row is divided into two or more
parts.
10/02/2022 50
Higher Order Table
 Desired to represent three or more characteristics in a
single table.
Variable Frequency Percent
Sex
Male
Female
Occupation
Student
Farmer
Merchant
Marital Status
Single
Married
10/02/2022 51
Guidelines for constructing tables
 Should be as simple as possible
 Should be self-explanatory
 Clear title telling what, when and where, how
classified and placed above the table
 Each row and column should be labeled
 Numerical entities of zero should be explicitly written
 State clearly the unit of measurement used,
 Explain codes and abbreviations in the foot-note,
 Show totals,
 If data is not original, indicate the source in foot-note
10/02/2022 52
Diagrammatic Representation
 They have greater attraction than mere figures
 They give quick overall impression of the data
 They have great memorizing value than mere figures.
 They facilitate comparison
 Used to understand patterns and trends
 Well designed graphs can be powerful means of communicating

a great deal of information
10/02/2022 53
Limitations of Diagrammatic
Representation
 Used only for purposes of comparison.
 Is not an alternative to tabulation
 Give only an approximate idea
 Fail to bring to light small differences
10/02/2022 54
Specific types of graphs include:
• Bar graph
categorical data
• Pie chart
• Histogram
• Frequency polygon
• Stem-and-leaf plot
• Box plot
• Scatter plot
• Line graph
10/02/2022 55
Bar charts (or graphs)
 Suitable when there are several groups
 Categories are listed on the horizontal axis (X-axis),
arranged: alphabetically, size of their proportions, or
on some other rational basis
 Frequencies or relative frequencies are represented on
the Y-axis (ordinate)
 The height of the bar represents the frequency or
relative frequency of occurrence of cases that belong to
particular category
 The width of the bar has no meaning
10/02/2022 56
Simple bar chart
 The bar represents the whole of the magnitude.

 The height of each bar indicates the size (frequency)
of the figure represented.
Distribution of patients in hopital X by source of referal, 1999
769
800
700 623
600
N o . o f p atien ts
500
400
300 256
200 161
97
100
0
Other GP OPD Casualty Other
hospital
Source of re feral
10/02/2022 57
Component (or Sub-divided) Bar chart:
 Bars are sub-divided into component parts of the figure.
 These sorts of diagrams are constructed when each total is
built up from two or more component figures.
• Actual Component Bar Diagrams: the over all height of the

bars & the individual component lengths represent actual
figures
• Percentage Component Bar Diagram: where the individual

component lengths represent the percentage each component
forms the over all total. Note that a series of such bars will all
be the same total height, i.e., 100 percent.
10/02/2022 58
Example: Plasmodium species distribution for
confirmed malaria cases, Batu, 2003
100 Mixed
P. vivax
80 P. falciparum
60
Percent
40
20
0
August October December
2003
10/02/2022 59
Multiple Bar Charts
 Component figures are shown as separate bars
adjoining each other.
 The height of each bar represents the actual value of

the component figure
 It depicts distributional pattern of more than one

variable
10/02/2022 60
Prevalence of self reported breathlessness among school
childeren, 1998
35
30
Breathlessness, per cent
25
20
15
10
5
0
Neither One Both
Parents smooking
Child never smoked smoked occassionaly child smoked one/week or more
10/02/2022 61
Method of constructing bar chart
 The bars should be of equal width and should
be separated from one another so as not to
imply continuity.
 All the bars should rest on the same line called

the base
 Label both axes clearly
10/02/2022 62
Pie chart
 Can be used for categorical and quantitative discrete
data
 Shows the relative frequency for each category by

dividing a circle into sectors, the angles of which are
proportional to the relative frequency
 The size of each wedge is determined by its angular

measurement
10/02/2022 63
Steps to construct a pie-chart
• Construct a frequency table
• Change the frequency into percentage (P)
• Change the percentages into degrees, where: degree =

Percentage X 360o
• Draw a circle and divide it accordingly
10/02/2022 64
Distribution fo cause of death for females, in England and Wales, 1989
Others
8%
Digestive System
4%
Injury and Poisoning
3%
Circulatory system
Respiratory system
42%
13%
Neoplasmas
30%
10/02/2022 65
Histograms
 Quantitative data
 The fraction of data in each class interval is represented
by a rectangle
 A bar graph with the class intervals (Base) listed on the

x-axis and the frequency (relative frequency) of
occurrence of the values in the interval on the y-axis
 Area is the fraction of data (relative frequency of data)

that fall in the class interval/proportional in the same
way to their interval frequencies.
10/02/2022 66
Example: Distribution of weights of 57 children
10/02/2022 67
Frequency polygon
 To draw a frequency polygon we connect the mid-
point of the tops of the cells of the histogram by a
straight line
 The total area under the frequency polygon is equal to

the area under the histogram
 Useful when comparing two or more frequency

distributions by drawing them on the same diagram
10/02/2022 68
Steps to construct frequency polygon
• Calculate mid points for each class interval (halfway

between the endpoints)
• Join the mid points
• To close the polygon, the midpoints of two additional

intervals are needed: one to the left of the first interval
and one to the right of the last interval observed, both
of these with zero observed frequencies.
10/02/2022 69
Example: Distribution of weights of 57 children
10/02/2022 70
Numerical summary statistics
 To summarize data by means of just a few numerical
measures, particularly before inferences or
generalizations are drawn from the data.
 Measures for describing the location (or typical

value) of a set of measurements and
 Their variation or dispersion are used for these

purposes.
10/02/2022 71
Measures of Central Tendency (MCT)
 On the scale of values of a variable there is a certain

stage at which the largest number of items tend to cluster.
 Since this stage is usually in the centre of distribution,

the tendency of the statistical data to get concentrated at
a certain value is called “central tendency”
 The various methods of determining the point about

which the observations tend to concentrate are called
MCT.
10/02/2022 72
• The objective of calculating MCT is to determine a
single figure which may be used to represent the
whole data set.
• In that sense it is an even more compact description

of the statistical data than the frequency distribution.
• Since a MCT represents the entire data, it facilitates

comparison within one group or between groups of
data.
10/02/2022 73
Characteristics of a good MCT
A MCT is good or satisfactory if it possesses the following
characteristics.
 It should be based on all the observations
 It should not be affected by the extreme values
 It should be as close to the maximum number of
values as possible
 It should have a definite value
 It should not be subjected to complicated and tedious
calculations
 It should be capable of further algebraic treatment
 It should be stable with regard to sampling
10/02/2022 74
 The most common measures of central
tendency include:
– Arithmetic Mean
– Median
– Mode
10/02/2022 75
Arithmetic Mean
 Center (average) of data set
 Most widely used measure of central location
 The mean can be used for discrete and continuous

data
 It is seriously affected by unusual/extreme values

called ―outliers.
10/02/2022 76
Ungrouped Data
 Is the sum of the individual values in a data set
divided by the number of values in the data set
• The sample mean x is the sample analog to the mean
of a finite population ().
10/02/2022 77
Example
 The heart rates for n=10 patients were as follows
(beats per minute):
167, 120, 150, 125, 150, 140, 40, 136, 120, 150
What is the arithmetic mean for the heart rate of these
patients?
10/02/2022 78
Grouped data
• Occasionally, data, especially secondhand data, are
presented in the grouped form of a frequency table.
• Where f denotes the frequency (i.e., the number of

observations in an interval),
• m the interval midpoint, & the summation is across the
intervals.
• m is obtained by calculating the average of the interval
lower true boundary and the upper true boundary.
• We assume that all values falling in to particular class
interval are located at the mid-point of the interval
10/02/2022 79
Example
10/02/2022 80
 When the data are skewed, the mean is
“dragged” in the direction of the skewness &
in this case, the mean is a poor measure of
central location or does not reflect the center
of the sample.
10/02/2022 81
Properties of the Arithmetic Mean
 For a given set of data there is one and only one
arithmetic mean (uniqueness).
 Easy to calculate and understand (simple).
 Influenced by each and every value in a data set
 Greatly affected by the extreme values.

 In case of grouped data if any class interval is open,
arithmetic mean can not be calculated.
10/02/2022 82
Median
• The middle observation, which divides the set into equal
halves.
• One half of the sample has values lying below the median
and one half of the sample has values lying above the median
• Appropriate for discrete and continuous data as well, but can
also be used for ordinal data
• Median is another name for 50th percentile. It is appropriate

for describing measurement data and ―”Robust to outliers,”
that is, not affected much by unusual values.
10/02/2022 83
 If the number of observations n is odd, there will be a
unique median
 If n is even, there is strictly no middle observation,
but the median is defined by convention as the
average of the two middle observations
10/02/2022 84
10/02/2022 85
10/02/2022 86
Properties of the median
 There is only one median for a given set of data
(uniqueness)
 The median is easy to calculate
 Median is a positional average and hence it is
insensitive to very large or very small values
 It is determined mainly by the middle points and less
sensitive to the remaining data points (weakness).
10/02/2022 87
Mode
 The mode is the most frequently occurring value
among all the observations in a set of data.
 It is not influenced by extreme values.

 It is possible to have more than one mode or no mode.
 It is not a good summary of the majority of the data.

 Can be used for all types of data, but may be especially
useful for nominal and ordinal measurements
10/02/2022 88
 Example
• Data are: 1, 2, 3, 4, 4, 4, 4, 5, 5, 6
• Mode is 4 “Unimodal”
 Example
• Data are: 1, 2, 2, 2, 3, 4, 5, 5, 5, 6, 6, 8
• There are two modes – 2 & 5
• This distribution is said to be “bi-modal”
 Example
• Data are: 2.62, 2.75, 2.76, 2.86, 3.05, 3.12
• No mode, since all the values are different
10/02/2022 89
Properties of mode
 It is not affected by extreme values
 Often its value is not unique
 The main drawback of mode is that often it does not
exist
10/02/2022 90
Skewness
 If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift
towards those scores.
Skewed to the right (positively skewed):
 Where the upper, or left, tail of the distribution is longer
(“fatter”) than the lower, or right, tail
 mode < median < mean
Skewed to the left (negatively skewed):
 The lower tail of the distribution is longer than the upper
tail
 mean < median < mode
10/02/2022 91
Symmetrical distribution
 A curve is symmetrical if one half of the curve

is the mirror image of the other half.
 Mean, median, and mode should all be

approximately the same
10/02/2022 92
Symmetric (B) and skewed distributions: right skewed (A) and left
skewed (C).
(Source: Centers for Disease Control and Prevention (1992).
Principles of Epidemiology, 2nd Edition, Figure 3.5, p. 151.)
10/02/2022 93
Measures of Dispersion
 Measurements tend to be different from one another.
 For example, height measurements taken from a

sample of persons obviously will not be all identical
 To describe these differences in height or other

biological characteristics, statisticians use the term
variability.
10/02/2022 94
 We need to know something about the
variability or spread of the values — whether
they tend to be clustered close together, or
spread out over a broad range
 Measures that quantify the variation or

dispersion of a set of data from its central
location -measures of dispersion
10/02/2022 95
• Two samples of cholesterol measurements on a
given person with different techniques
– Method 1: 177, 193, 195, 209, 226
– Method 2: 192, 197, 200, 202, 209
 Calculate mean for both data sets?
 What can you say about the variability in the

data sets?
10/02/2022 96
Measures of dispersion include:
 Range
 Quartiles and percentiles
 Inter-quartile range
 Variance and Standard deviation
 Coefficient of variation
10/02/2022 97
Range
 The difference between the highest and lowest
observations in a data.
Example
―Data values 45, 70, 95, 100, 125
―Range = 125-45 = 80
 Data set with higher range exhibit more
variability
10/02/2022 98
Properties of range
 The value of the range is determined by only two of
the original observations.
 Very sensitive to extreme observations
 The larger the sample size, the larger the range
10/02/2022 99
Percentiles
• Numerical values that divide an ordered data set into
100 pieces.
• Percentile = p(n + 1), p is the required percentile
• The pth percentile is a value of Vp such that p% or

less of the observations are <Vp and (100-p)% or less
of the observations are >p.
• E.g., the 25th percentile is value of a variable such that
25% of the observations are less than that value and
75% of the observations are greater.
10/02/2022 100
The pth percentile is:
 The observation corresponding to p(n+1)th, if p(n+1)

is an integer
 The average of (k)th and (k+1)th observations if p(n+1)

th
is not an integer, where k is the largest integer less
than p(n+1).
 If p(n+1) = 3.6, the average of 3th and 4th observations
10/02/2022 101
 Given a sample of size n = 60, find the 30th percentile
of the data set.
p(n+1) = 0.30(60+1) = 18.3
= Average of 18th and 19th
– 30% of the observations are below this value and
70% of them are the value
10/02/2022 102
Quartile
• Divide the ranked data into four equal parts

• There are three quartiles
25% 25%
25% 25%
1st quartile 2nd quartile
10/02/2022 103
a) The first quartile (Q1): 25% of all the
ranked observations are less than Q1.
b) The second quartile (Q2): 50% of all the

ranked observations are less than Q 2. The
second quartile is the median.
c) The third quartile (Q3): 75% of all the

ranked observations are less than Q3.
10/02/2022 104
 1st quartile = 0.25 (n+1)th
 2nd quartile = 0.5 (n+1)th
 3rd quartile = 0.75 (n+1)th
 20th percentile = 0.2 (n+1)th
 15th percentile = 0.15 (n+1)th
10/02/2022 105
Example:
• Given the data set: 8, 9, 9, 10, 13, 15, 16,19, 20.
Find the first quartile
n= 9
Q1= 0.25 (9+1) = 2.5 position
Take the average of the second and the third values
Q1= 9
10/02/2022 106
Interquartile range (IQR)
 Indicates the spread of the middle 50% of the
observations, and used with median
IQR = Q3 - Q1
 A large IQR indicates a large amount of variability

among the middle 50% of the observations and a
small IQR indicates a small amount of variability
10/02/2022 107
Interquartile range (IQR)
• Suppose the first and third quartile for weights of

girls 12 months of age are 8.8 Kg and 10.2 Kg,
respectively.
IQR = 10.2 Kg – 8.8 Kg = 1.4
• i.e., 50% of the infant girls weigh between 8.8 and

10.2 Kg.
10/02/2022 108
Properties of IQR:
 It is not based on all observations but only on two

specific values
 It is important in selecting cut-off points in the

formulation of clinical standards
 Since it excludes the lowest and highest 25% values,

it is not affected by extreme values
 Less sensitive to the size of the sample

10/02/2022 109
Variance
 Used to measure the dispersion of values

relative to the mean.
 When values are close to their mean (narrow

range) the dispersion is less than when there is
scattering over a wide range.
– Population variance = σ2
– Sample variance = S2
10/02/2022 110
 Population variance N
 2   ( Xi   )2
i 1
N
 Sample variance n
S 2   ( Xi  X )2
i 1
n1
  xi 2  (  xi )2 / n
n1
Where
 n is the sample size and X is the sample mean
 N = the total number of elements in the
population and  population mean
10/02/2022 111
Standard deviation (, s)
• It is the square root of the variance.
• Describes the variability among individual values in a
given data set
• This produces a measure having the same scale as
that of the individual values.
2 2
   and S = S
• The SD has the advantage of being expressed
in the same units of measurement as the mean
10/02/2022 112
 The standard deviation is the square root of the
average of the square of the deviations from the
sample mean.
10/02/2022 113
• Example: Blood Cholesterol Measurements for a
Sample of 10 Persons
10/02/2022 114
Coefficient of variation (CV)
 When two data sets have different units of

measurements, or their means differ sufficiently in
size, the CV should be used as a measure of
dispersion.
 It is the best measure to compare the variability of
two series of sets of observations.
 Data with less coefficient of variation is considered
more consistent.
 Is independent of the unit of measurement
10/02/2022 115
 CV is the ratio of the SD to the mean
multiplied by 100.
S
CV   100
x
10/02/2022 116
Example: suppose two samples of human males
yield the following results:
CV for sample 1 is 6.9

CV for sample 2 is 12.5
The weight of sample 2 is more variable
10/02/2022 117
Methods of Data Collection
 Before any statistical work can be done data must be
collected
 Data is the key component of a research
 The information needed to address the primary study

hypothesis (objective) determines what data need to
be collected and how it should be collected (methods).
10/02/2022 118
Data collection techniques
 Observation
 Interviewing
 Focus group discussion (FGD)
 Use of available data
10/02/2022 119
Observation
 Involves systematically selecting, watching and
recording behavior and characteristics of living
things, objects or phenomena
 Participant observation – observer takes part in the

situation he or she observes
 E.g., A doctor hospitalized with a broken hip, who
now observes hospital procedures ‘from within’
 Non-participant observation – observer watches the
situation, openly or concealed, but does not
participate
10/02/2022 120
 Observation can give additional, more accurate
information on behavior of people than interviews or
questionnaires
 Observations can also be made on objects
 E.g., the presence or absence of latrines and the state

of cleanliness
 Time consuming, they are most often used in small
scale studies
 Investigators or observer‘s own biases may occur
10/02/2022 121
Interviewing
 It involves oral questioning of respondents, either

individually or as a group
 Can be employed by face-to- face interview, telephone

interview or self administered questionnaires.
 Designing a good questionnaire is required to collect reliable

and valid information.
 Self-administered questionnaires are simpler and cheaper in

comparison to face-to face interview since questionnaires can
be administered to many persons simultaneously
10/02/2022 122
 A written questionnaire can be administered in
different ways, such as by:
– Sending questionnaires by mail
– Gathering all or part of the respondents in one place

at one time, giving oral or written instructions, and
letting them fill out the questionnaires
– Hand-delivering questionnaires to respondents and

collecting them later
10/02/2022 123
Focus group discussions (FGDs)
 FGDs allow a group of 8-12 informants to
freely discuss a certain subject with the
guidance of a facilitator.
10/02/2022 124
Questionnaire Designing
 A questionnaire is an instrument used to obtain
information about respondents in a sample survey.
 A poorly designed questionnaire affects the reliability

of responses of the investigation.
• “Open ended” questions, the subject answers in his

own words, or “closed” questions, answered by
choosing from a number of fixed alternative
responses.
10/02/2022 125
Examples
 Have you ever heard of sexually transmitted

infections?
 Have you ever experienced STIs?
10/02/2022 126
In questionnaire design remember to:
 Use familiar and appropriate language
 Avoid abbreviations, double négatives, etc
 Avoid two elements to be collected through one
question
 Avoid embarrassing and painful questions
 Avoid language that suggests a response
 Start with simpler questions
 For open ended questions, provide sufficient space
for the response
 Arrange questions in logical sequence
10/02/2022 127
Chapter II Probability and
Probability Distribution
10/02/2022 128
Objectives
• At the end of the sequence of lectures, you will be able
to:
– Describe the union, intersect, complement and mutually
exclusive events, using a Venn diagram.
– Define relative frequency probability for practical use.
– Describe and implement the formula for finding the
probability of one outcome given another.
– Describe independence of factors, why it is important in a
study, and how it is used.
– Describe dependence of factors, why it is important in a
study, and its impact.
10/02/2022 129
• Doubt is not a pleasant condition, but certainty
is absurd.
Voltaire 1894—1778
• Probability is a way of quantifying uncertainty.
• We can use probability to measure risk so as to

help us in decision making, and we also use it to
assess our inference.
10/02/2022 130
What is Probability?
 Determines the likelihood of occurrence of events
that are subject to chance.
 Assumes a “stochastic” or “random” process: i.e..
the outcome is not predetermined - there is an
element of chance
E.g.,
 Probability that the head comes up on a coin toss
 Probability that a sick patient who receives a new
medical treatment will survive for five or more
years
10/02/2022 131
Some faces showed
up more frequently
than others
10/02/2022 132
Why probability
 Many events in life are uncertain.
 A formal way to measure the chance of these uncertain events.
 Experiment = any process with an uncertain outcome
 Sample space: the set of all possible outcomes of an

experiment
 Event/element: any subset of the sample space that may

happen or not when the experiment is performed
10/02/2022 133
Event
• An event is simply a set of descriptions; it is a
proposition
• An event can occur or not occur
• Represented by uppercase letters such as A, B, C……
• We focus on whether the event occurred or did not

occur (or happened, or did not happen). Or in the
future, will it or will it not occur.
10/02/2022 134
Combination of events
• We have introduced events, so now let us start
expanding on the grammar of events by combining
events to create new events
1. Intersection
• Given two events, say A and B, create a new event
that occurs if both A and B occur.
• The notation we use to denote the new event is A
upside down cup B; A∩B.
10/02/2022 135
Intersection
• Let A represent the event that a randomly
selected newborn is LBW, and B the event that
he or she is from a multiple birth
• The intersection of A and B is the event that

the infant is both LBW and from a multiple
birth
10/02/2022 136
Intersection
10/02/2022 137
2. Union
 The union of A and B, denoted AUB (P ( A or B )) , is
the event that occurs if either A or B, or both, occur. It
is not the exclusionary “or”. It is either A, or B, or
both.
 In the example above, the union of A and B is the

event that the newborn is either LBW or from a
multiple birth, or both
10/02/2022 138
Union
10/02/2022 139
3. Complement
 The complement of an event A, denoted by Ā or Ac,
is the event that A does not occur
 Consists of all the outcomes in which event A does

NOT occur
P(Ā) = P(not A) = 1 – P(A)
10/02/2022 140
 If the event A is, I live to be 25, then the event AC is, I
do not live to be 25; so dead by 25.
10/02/2022 141
Null event
• It is amazing how much we can do just with
those three operations.
• For example, we can define the null event,

usually denoted by Ø. It is the event that
cannot happen. It is a contradiction.
10/02/2022 142
Null event
• The event A cannot happen at the same time as AC. It just

cannot happen. You can't have it both ways.
10/02/2022 143
Mutually Exclusive Events
• The idea of null event leads us to a very important
collection of events, and these are mutually exclusive
events.
• Mutually exclusive events are also called disjoint events.
• A pair of mutually exclusive events cannot happen
simultaneously. So, either one happens, the other, or
neither, but you cannot have both of them happening
together.
• In particular, we note that A and Ac are mutually
exclusive events.
10/02/2022 144
Mutually Exclusive Events
Example
– A = “live to be 25”
– B = “die before 10th birthday”
• Then you cannot have both, and A∩B=Ø. So

this can be thought of as an extension of the
idea of a complement.
10/02/2022 145
What is probability of an event?
• Probability of an event is the relative frequency of
the set of outcomes over an indefinitely large
(infinite) number of trials.
• In real life, experiments cannot be performed infinite

number of times
• Instead, probabilities of events are estimated from the

empirical probabilities obtained from large samples.
10/02/2022 146
What is probability of an event?
 Theoretical probability models are constructed from
which probabilities of many different events can be
computed.
 Probability of an Event E, P(E)

= a number between 0 and 1 representing the
proportion of times that event E is expected to
happen when the experiment is done over and over
again under the same conditions
10/02/2022 147
Classical Probability
• If an experiment is repeated n times under essentially
identical conditions and the event A occurs m times,
then as n gets large the ratio approaches the
probability of A. 𝒎
𝑷 ( 𝑨 )=
𝒏
• Probability is symmetric around a half. At the edges
we are certain—at zero we are certain it will not
happen, at one we are certain it will happen. We have
maximal uncertainty at the center, when p=1/2.
10/02/2022 148
Classical probability cont…
 If you toss a coin 100 times and head comes up 40

times,
P(H) = 40/100 = 0.4. or 4%
 Suppose that out of N=100,000 persons of a certain

target population, a total of 5500 are positive reactors
to a certain screening test; then what is the probability
of being positive?
10/02/2022 149
Classical probability cont…
For any event A
Complement
10/02/2022 150
Additive Rule
 If A and B are mutually exclusive events, that means
they cannot happen at the same time (A∩B=Ø)
 So, the probability of either of the two events

occurring is obtained by adding the probabilities of
each event (Additive rule)
AUB or P(A or B) = P(A) + P(B)
10/02/2022 151
Additive rule cont…
 When A and B are not mutually exclusive,
 pr(A or B) = Pr(A) + Pr(B) cannot be used.

 The reason is that in such a situation A and B overlap (A
n B) in a venn diagram, and the elements in the overlap
are counted twice.
 So the general formula is the probability of AUB is the

probability of A plus the probability of B minus the
probability of A∩B, and that is the Additive Law.
10/02/2022 152
10/02/2022 153
Conditional Probability
• As time evolves, we gather information, information
that, if relevant, will possibly change our
probabilities.
• Acknowledging that our probability depends on the

information at hand, and how that probability
changes, leads us to the concept of conditional
probability
10/02/2022 154
Conditional Probability
 The chance a particular event happens depends on the
outcome of some other event
 So the probability of B, given that A has happened, is

the probability that both happen divided by the
probability of A.
 provided that P(A) ≠ 0.
10/02/2022 155
Conditional Probability cont…
 E.g., Suppose in country X the chance that a person lives to age
25 is 0.95, whereas the chance that he lives to age 65 is .65.
• Suppose that the event B is that a person will live to be 65, and
the event A, is that a person is alive at age 25. Then the event B
given A, is that a 25-year-old person will be alive at 65.
• What is the chance that a person 25 years of age survives to

age 65?
10/02/2022 156
Solution
 B = “A person will be alive at 65”
 A = “A person will be alive at age 25”
 B | A = “A 25 year old person will be alive at 65”

 A n B = “A person will reach 25 & 65”
= “A person reaches 65”
 Then, Pr(B/A) = Pr(A n B ) / Pr(A) = .65/.95 = .684 . That

is, a person aged 25 has a 68.4 percent chance of living to
age 65.
10/02/2022 157
Multiplicative law and Independence
• n
10/02/2022 158
Multiplicative law and Independence
 Two events A and B are independent if occurrence or
nonoccurrence of one does not in any way affect the
occurrence or nonoccurrence of the other.
 Knowing that A happens does not influence our
probability of B happening seem
– P(A∩B) = P(A) x P(B) (Independent events)

– P(B|A) = P(B), P(A)=P(A|B).
 P(A∩B) ≠ P(A) x P(B) (Dependent events)

10/02/2022 159
Note
• When the two events are mutually exclusive, then the
additive law says that the probability of the union is
the sum of the probabilities.
• When the events are independent events, the

multiplicative law says that the probability of the
intersection is the product of the probabilities.
• Clearly, if two events are mutually exclusive they

cannot be independent.
10/02/2022 160
Exercise
• Suppose that we are conducting hypertensive screening
program in the home. Suppose that hypertensive status
of the mother doesn't depend at all on the hypertensive
status of the father. Let event A: mother’s DBP>95 and
event B: father’s DBP>95. pr(A)=0.1, pr(B)=0.2.
– Are the two events mutually exclusive
– What is the probability that both the mother and the
father are hypertensive
– What is the probability that either the mother or the
father, or both are hypertensive
10/02/2022 161
Properties of Probability
1. The numerical value of a probability always lies
between 0 and 1, inclusive.
0  P(E)  1
 A value 0 means the event can not occur
 A value 1 means the event definitely will occur
 A value of 0.5 means that the probability that the
event will occur is the same as the probability
that it will not occur.
10/02/2022 162
Properties of Probability
2. The sum of the probabilities of all mutually exclusive
outcomes is equal to 1.
P(E1) + P(E2 ) + .... + P(En ) = 1.
3. For two mutually exclusive events A and B,

P(A or B ) = P(AUB)= P(A) + P(B).
If not mutually exclusive:

P(A or B) = P(A) + P(B) - P(A and B)
10/02/2022 163
Clarification aid:
• IF A and B are mutually exclusive then (Additive Law)
P(AUB)=P(A)+P(B)
• IF A and B are independent then (Multiplicative Law)

P(AnB)=P(A)´P(B)
• So union is additive. You add things up and make them bigger

when you unite them.
• Whereas, when you take an intersection it is like multiplying

things and making them smaller.
10/02/2022 164
Probability Distributions
10/02/2022 165
Probability models
• Now we are going to start applying what we learned
about probability, and we start by applying
probability to numbers and models
• Mathematical models are an idealization, an

idealization where there is right, wrong, exact, etc,
and we now search how to use such models to
approximate, or model, reality.
10/02/2022 166
Probability Distributions
 Describe the probability of events
 A device used to describe the behaviour that a
random variable may have by applying the theory of
probability.
 Parameters are characteristics of probability
distributions.
 The statistic that we use to estimate parameters are
also random variables.
 We are interested in the distributions of these
statistics and will use them to make inferences about
population parameters.
10/02/2022 167
Random Variable
 Any quantity or characteristic that is able to
assume a number of different values such that
any particular outcome is determined by
chance
 Why random?
10/02/2022 168
Random Variable
• A discrete random variable is able to assume

only a finite or countable number of outcomes.
E.g., marital status: an individual can be single,
married, divorced, or widowed
• A continuous random variable can take on any

value in a specified interval (such as weight or
height)
10/02/2022 169
Random Variable
Dichotomous or binary variable
• The simplest random variable
• Dichotomous (Bernoulli): X = 0 or 1
P(X=1) = p
P(X=0) = 1-p
e.g. Heads, Tails
True, False
Success, Failure
Vaccinated, Not vaccinated
10/02/2022 170
Dichotomous or binary variable
e.g. Suppose that 80% of the villagers should be
vaccinated. What is the probability that at random you
choose a vaccinated villager?
1=success (vaccinated person)
0=failure (unvaccinated person)
1 Trial
• P(0) = 1-p = 0.2
• P(1) = p = 0.8
10/02/2022 171
Discrete Probability Distributions
• For a discrete random variable, the probability
distribution specifies each of the possible outcomes
of the random variable along with the probability that
each will occur (probability mass function)
• Random variables are denoted by capital letters and

their values by lower-case letters.
P( X  x)
10/02/2022 172
 The following data shows the number of diagnostic
services a patient receives
10/02/2022 173
a. What is the probability that a patient receives
exactly 3 diagnostic services?
b. What is the probability that a patient receives at most

one diagnostic service?
c. What is the probability that a patient receives at least

four diagnostic services?
10/02/2022 174
Answers
a. P(X=3) = 0.031
b. P (X≤1) = P(X = 0) + P(X = 1)
= 0.671 + 0.229
= 0.900
c. (X≥4) = P(X = 4) + P(X = 5)

= 0.010 + 0.006
= 0.016
10/02/2022 175
The Expected Value of a Discrete RV
• Let X be a discrete random variable which takes the
values X1, . . . ,Xn.
• The expected value or mean of X is the number E(X)

or µ n
E ( X )   xiP ( X  xi )
i 1
10/02/2022 176
The Expected Value of a Discrete RV cont…
• Multiply each possible outcome by its associated

probability and sum all these with a probability greater
than 0
• For the diagnostic service data:
Mean (X) = 0(0.671) +1(0.229) +2(0.053)

+3(0.031) +4(0.010) +5(0.006)
= 0.498 ≈ 0.5
• We would expect an average of 0.5 services for each visit
10/02/2022 177
Binomial Distribution
• Consider dichotomous (binary) random variable
• Is based on Bernoulli trial

– When a single trial of an experiment can result in
only one of two mutually exclusive outcomes
(success or failure; dead or alive; sick or well,
male or female)
• The binomial distribution applies in situations where

there are only two possible outcomes
10/02/2022 178
Binomial Distribution cont…
 Now a binomial random variable counts the number
of successes in n independent trials each associated
with a Bernoulli(p) random variable
Example:
 We are interested in determining whether a newborn
infant will survive until his/her 70th birthday
 Let Y represent the survival status of the child at age
70 years
 Y = 1 if the child survives and Y = 0 if he/she does
not
10/02/2022 179
• Are the outcomes mutually exclusive and
exhaustive?
• Suppose that 72% of infants born survive to

age 70 years
P(Y = 1) = p = 0.72
P(Y = 0) = 1 − p = 0.28
10/02/2022 180
Binomial assumptions
 The experiment consist of n identical trials.
 There are a fixed number, n, of such trials.
 The probability of A (success), denoted by p, remains

the same from trial to trial. The probability of B
(failure), denoted by q,
q = 1- p.
 The trials are independent.

10/02/2022 181
• If an experiment is repeated n times and the outcome
is independent from one trial to another, the
probability that outcome A occurs exactly r or x
times is:
10/02/2022 182
Notations
• n the number of fixed trials
• x the number of successes in the n trials
• p the probability of success
• q the probability of failure (1- p)

Binomial
Coefficients
10/02/2022 183
Factorial
 For any positive integer n, we define n
factorial as: n(n-1)(n-2)...(1).
 We denote n factorial as n!.
 The number n! is the number of ways in which

n objects can be ordered.
 By definition 1! = 1 and 0! = 1.
10/02/2022 184
Combinations
• The possible selections of r items from a group of n
items regardless of the order of selection. The
number of combinations is denoted and is read as n
choose r.
• An alternative notation is nCr.
• We define the number of combinations of r out of n
elements as
 n n!
  C 
 r  n r r! (n  r)!
 
Forexample :
 n 6! 6! 6 * 5 * 4 * 3 * 2 * 1 6 * 5 * 4 120
  C       20
 r  6 3 3!(6  3)! 3!3! (3 * 2 * 1)(3 * 2 * 1) 3 * 2 * 1 6
 
10/02/2022 185
Mean and Variance of Binomial distribution
• The mean of binomial distribution is n and the
variance is n(1- )
• Example: Assume that, when a child is born, the
probability it is a girl is ½ and that the sex of the
child does not depend on the sex of an older sibling.
a) Find the probability distribution for the number of

girls in a family with 4 children.
b) Find the mean and the standard deviation of this
distribution.
10/02/2022 186
a) Probability distribution
X 0 1 2 3 4
P(X=r) 0.0625 0.25 0.375 0.25 0.0625
b) mean= nP = 4 x 1/2 = 2
Variance and SD are 1
10/02/2022 187
Continuous Probability Distributions
• A continuous random variable X can take on any
value in a specified interval or range
• With a large number of class intervals, the

frequency polygon begins to resemble a smooth
curve.
• The probability distribution of a continuous

random variable X is represented by a smooth
curve called a probability density function
10/02/2022 188
Continuous Probability Distributions cont…
• Instead of assigning probabilities to specific
outcomes of the random variable X, probabilities
are assigned to ranges of values
• The probability associated with any one particular
value is equal to 0
• Therefore, P(X=x) = 0
• Also, P(X ≥ x) = P(X > x)
• We calculate:
Pr [ a < X < b], the probability of an
interval of values of X.
10/02/2022 189
Normal Distribution
• The most important probability distribution in statistics
• Frequently called the “Gaussian distribution” or bell-

shape curve.
• It is often called a "Bell Curve"

because it looks like a bell.
• Variables such as blood pressure, weight, height, serum
cholesterol level, and IQ score — are approximately
normally distributed
10/02/2022 190
• The concept of “probability of X=x” in the discrete
probability distribution is replaced by the
“probability density function f(x)
• A random variable X is said to follow ND, if and only

if, its probability density function is:
10/02/2022 191
10/02/2022 192
• The notation N(, 2) denotes a normal distribution
with mean  and variance 2.
1. The mean µ tells you about location
– Increase µ - Location shifts right
– Decrease µ – Location shifts left
– Shape is unchanged
2. The variance σ2 tells you about narrowness or
flatness of the bell
– Increase σ2 - Bell flattens. Extreme values are more likely
– Decrease σ2 - Bell narrows. Extreme values are less likely
– Location is unchanged
10/02/2022 193
Properties of the Normal Distribution
 A probability distribution of a continuous variable. It
extends from minus infinity (-∞) to plus infinity
(+∞).
 Symmetrical about its mean, .
 The mean, the median and mode are almost equal. It
is unimodal.
 The total area under the curve about the x-axis is 1
square unit.
 The curve never touches the x-axis.
 As the value of  increases, the curve becomes more
and more flat and vice versa.
10/02/2022 194
Properties of the Normal Distribution
cont…
 The distribution is completely determined by
the parameters  and .
 The height of the frequency curve, which is

called the probability density, cannot be taken
as the probability of a particular value.
10/02/2022 195
Normal probabilities Empirical Rules
• The probability that a normal random variable will be
within 1 standard deviation from its mean (on
either side) is 0.6826, or approximately 0.68.
within 2 standard deviations from its mean is
0.9544, or approximately 0.95.
within 3 standard deviation from its mean is
0.9974.
10/02/2022 196
10/02/2022 197
The Standard Normal Distribution
• We have different normal distributions

depending on the values of μ and σ2.
• We cannot tabulate every possible distribution
• Tabulated normal probability calculations are

available only for the ND with µ = 0 and σ 2=1.
10/02/2022 198
The Standard Normal Distribution
• The standard normal random variable, Z, is the
normal random variable with mean  = 0 and
standard deviation  = 1: Z~N(0,12).
Standard Normal Distribution
0 .4
0 .3
=1
f( z )
0 .2
{
0 .1
0 .0
-5 -4 -3 -2 -1 0 1 2 3 4 5
=0
Z
10/02/2022 199
10/02/2022 200
Finding Probabilities of the SND: P(0 < Z <
1.56)
 Outcomes of the random variable Z are denoted by z;
 The whole number and tenths decimal place of z are
listed in the column to the left of the table, and
 The hundredths decimal place is shown in the row
across the top
• For a particular value of z, the entry in the body of
the table specifies the area beneath the curve to the
right of z, or P(Z> z)
10/02/2022 201
Some sample values of and their
corresponding areas are as follows
z Area in the right tail
0.00 0.5000
1.65 0.049
1.96 0.025
2.58 0.005
3.00 0.001
10/02/2022 202
Since the SND is symmetric about 0, the area
under the curve to the right of z is equal to the
area to the left of - z.
-z Area in the right tail
0.00 0.5000
-1.65 0.049
-1.96 0.025
-2.58 0.005
-3.00 0.001
10/02/2022 203
Finding Probabilities of the Standard Normal
Distribution: P(0 < Z < 1.56)
Standard Normal Probabilities
Standard Normal Distribution z .00 .01 .02 .03 .04 .05 .06 .07 .08 .09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.4 0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.3 0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
f(z)
0.2 0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
0.1
1.56 1.0
1.1
0.3413
0.3643
0.3438
0.3665
0.3461
0.3686
0.3485
0.3708
0.3508
0.3729
0.3531
0.3749
0.3554
0.3770
0.3577
0.3790
0.3599
0.3810
0.3621
0.3830
{
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
0.0 1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
-5 -4 -3 -2 -1 0 1 2 3 4 5 1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
Z 1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
Look in row labeled 1.5 2.1

2.2
0.4821
0.4861
0.4826
0.4864
0.4830
0.4868
0.4834
0.4871
0.4838
0.4875
0.4842
0.4878
0.4846
0.4881
0.4850
0.4884
0.4854
0.4887
0.4857
0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
and column labeled .06 to 2.4
2.5
0.4918
0.4938
0.4920
0.4940
0.4922
0.4941
0.4925
0.4943
0.4927
0.4945
0.4929
0.4946
0.4931
0.4948
0.4932
0.4949
0.4934
0.4951
0.4936
0.4952
find P(0  z  1.56) = 2.6

2.7
0.4953
0.4965
0.4955
0.4966
0.4956
0.4967
0.4957
0.4968
0.4959
0.4969
0.4960
0.4970
0.4961
0.4971
0.4962
0.4972
0.4963
0.4973
0.4964
0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
0.4406 2.9
3.0
0.4981
0.4987
0.4982
0.4987
0.4982
0.4987
0.4983
0.4988
0.4984
0.4988
0.4984
0.4989
0.4985
0.4989
0.4985
0.4989
0.4986
0.4990
0.4986
0.4990
10/02/2022 204
Z 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.0000 0.0040 0.0080 0.0120 0.0160 0.0199 0.0239 0.0279 0.0319 0.0359
0.1 0.0398 0.0438 0.0478 0.0517 0.0557 0.0596 0.0636 0.0675 0.0714 0.0753
0.2 0.0793 0.0832 0.0871 0.0910 0.0948 0.0987 0.1026 0.1064 0.1103 0.1141
0.3 0.1179 0.1217 0.1255 0.1293 0.1331 0.1368 0.1406 0.1443 0.1480 0.1517
0.4 0.1554 0.1591 0.1628 0.1664 0.1700 0.1736 0.1772 0.1808 0.1844 0.1879
0.5 0.1915 0.1950 0.1985 0.2019 0.2054 0.2088 0.2123 0.2157 0.2190 0.2224
0.6 0.2257 0.2291 0.2324 0.2357 0.2389 0.2422 0.2454 0.2486 0.2517 0.2549
0.7 0.2580 0.2611 0.2642 0.2673 0.2704 0.2734 0.2764 0.2794 0.2823 0.2852
0.8 0.2881 0.2910 0.2939 0.2967 0.2995 0.3023 0.3051 0.3078 0.3106 0.3133
0.9 0.3159 0.3186 0.3212 0.3238 0.3264 0.3289 0.3315 0.3340 0.3365 0.3389
1.0 0.3413 0.3438 0.3461 0.3485 0.3508 0.3531 0.3554 0.3577 0.3599 0.3621
1.1 0.3643 0.3665 0.3686 0.3708 0.3729 0.3749 0.3770 0.3790 0.3810 0.3830
1.2 0.3849 0.3869 0.3888 0.3907 0.3925 0.3944 0.3962 0.3980 0.3997 0.4015
1.3 0.4032 0.4049 0.4066 0.4082 0.4099 0.4115 0.4131 0.4147 0.4162 0.4177
1.4 0.4192 0.4207 0.4222 0.4236 0.4251 0.4265 0.4279 0.4292 0.4306 0.4319
1.5 0.4332 0.4345 0.4357 0.4370 0.4382 0.4394 0.4406 0.4418 0.4429 0.4441
1.6 0.4452 0.4463 0.4474 0.4484 0.4495 0.4505 0.4515 0.4525 0.4535 0.4545
1.7 0.4554 0.4564 0.4573 0.4582 0.4591 0.4599 0.4608 0.4616 0.4625 0.4633
1.8 0.4641 0.4649 0.4656 0.4664 0.4671 0.4678 0.4686 0.4693 0.4699 0.4706
1.9 0.4713 0.4719 0.4726 0.4732 0.4738 0.4744 0.4750 0.4756 0.4761 0.4767
2.0 0.4772 0.4778 0.4783 0.4788 0.4793 0.4798 0.4803 0.4808 0.4812 0.4817
2.1 0.4821 0.4826 0.4830 0.4834 0.4838 0.4842 0.4846 0.4850 0.4854 0.4857
2.2 0.4861 0.4864 0.4868 0.4871 0.4875 0.4878 0.4881 0.4884 0.4887 0.4890
2.3 0.4893 0.4896 0.4898 0.4901 0.4904 0.4906 0.4909 0.4911 0.4913 0.4916
2.4 0.4918 0.4920 0.4922 0.4925 0.4927 0.4929 0.4931 0.4932 0.4934 0.4936
2.5 0.4938 0.4940 0.4941 0.4943 0.4945 0.4946 0.4948 0.4949 0.4951 0.4952
2.6 0.4953 0.4955 0.4956 0.4957 0.4959 0.4960 0.4961 0.4962 0.4963 0.4964
2.7 0.4965 0.4966 0.4967 0.4968 0.4969 0.4970 0.4971 0.4972 0.4973 0.4974
2.8 0.4974 0.4975 0.4976 0.4977 0.4977 0.4978 0.4979 0.4979 0.4980 0.4981
10/02/2022 205
2.9 0.4981 0.4982 0.4982 0.4983 0.4984 0.4984 0.4985 0.4985 0.4986 0.4986
z ... .06 .07 .08
To find P(Z<-2.47): .
.
. .
Find table area for 2.47 .

.
. .
. . .
P(0 < Z < 2.47) = .4932 .
2.3 ... 0.4909 0.4911 0.4913
P(Z < -2.47) = 2.4 ...
2.5 ...
0.4931
0.4948
0.4932
0.4949
0.4934
0.4951
.
.5 - P(0 < Z < 2.47) .
.
= .5 - .4932 = 0.0068
Standard Normal Distribution
Area to the left of -2.47 0.4
P(Z < -2.47) = .5 - 0.4932
= 0.0068 0.3
Table area for 2.47
P(0 < Z < 2.47) =
f(z)
0.2
0.4932
0.1
0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Z
10/02/2022 206
find P(1  Z  2):
To find P(1  Z  2):
1. Find table area for 2.00
F(2) = P(Z  2.00) = .5 + .4772 =.9772
2. Find table area for 1.00
F(1) = P(Z  1.00) = .5 + .3413 = .8413
3. P(1  Z  2.00) = P(Z  2.00) - P(Z  1.00)
= .9772 - .8413 = 0.1359
10/02/2022 207
Exercise
1. Compute P(-1 ≤ Z ≤ 1.5)
2. Find the area under the SND from 0 to 1.45
3. Compute P(-1.66 < Z < 2.85)
4. Compute P (Z< 1.045)
10/02/2022 208
Z - Transformation
 If a random variable X~N(,) then we can
transform it to a SND with the help of Z-
transformation
Z= x-

 Z represents the Z-score for a given x value
10/02/2022 209
• This process is known as standardization and
gives the position on a normal curve with μ=0
and σ=1, i.e., the SND, Z.
• A Z-score is the number of standard

deviations that a given x value is above or
below the mean.
10/02/2022 210
Example
• The diastolic blood pressures of males 35–44 years of
age are normally distributed with µ = 80 mm Hg and
σ2 = 144 mm Hg2
σ = 12 mm Hg
• Therefore, a DBP of 80+12 = 92 mm Hg lies 1 SD

above the mean
• Let individuals with BP above 95 mm Hg are

considered to be hypertensive
10/02/2022 211
a. What is the probability that a randomly selected male
has a BP above 95 mm Hg?
• Thus we can make statements like: If we choose a

person at random from this population, the probability
is 0.1056 that the person has BP above 95 mm Hg
10/02/2022 212
b. What is the probability that a randomly
selected male has a DBP above 110 mm Hg?
Z = 110 – 80 = 2.50
12
P (Z > 2.50) = 0.0062
• Approximately 0.6% of the population has a

DBP above 110 mm Hg
10/02/2022 213
Exercise
• Each child born to a particular set of parents

has a probability of 0.25 of having blood type
O. If these parents have 5 children.
• What is the probability that
a. Exactly two of them have blood type O
b. At most 2 have blood type O
c. At least 4 have blood type O
d. 2 do not have blood type O
10/02/2022 214
Chi-square Distribution
10/02/2022 215
Introduction
 Chi square distribution is one of the probability
distributions
 2 distribution is not symmetrical, it is always
skewed to the right
 The distribution only takes positive values between 0
and infinity
 The skeweness diminishes as n gets larger
 It depends on degree of freedom, df=(R-1)(C-1),
where R and C are the number of rows and columns
respectively. (The only parameter of the distribution)
10/02/2022 216
Test of significance using the 2
 Used for categorical data analysis
 It compares the actual observed frequency in each
group with the expected frequency
 Allows us to test for association between categorical
(nominal) variables
 The null hypothesis for this test is that there is no
association between the variables.
• E.g., the proportion of disease is the same regardless
of exposure
 HA is there is an association between the variables
10/02/2022 217
Assumptions of 2 teat
 Each of the observations should be independent of
the other observations
 It should not be used when the total number of

observed values is < 40 and any expected value is < 5
 For large sample, 80% of the cells should have

expected frequency of at least 5
 No observed cell should be 0

10/02/2022 218
Calculation of 2 value
 General formula
 Oij – Observed frequency of ith row and jth

column
10/02/2022 219
Calculation of expected frequency
 The expected frequency in each cell is the
product of the row and column totals divided
by the sum of all the observed frequencies (i.e.
sample size)
10/02/2022 220
 Counts in the Chi-Square Test of a 2x2 table
are represented as “a”, “b”, “c” and “d”.
10/02/2022 221
Example 1
 Compute the expected table for the Oral

contraceptive use Myocardial infarction data
below
MI status over
OC-use 3 years
group Yes No Total
OC users 13 4987 5000
Non-OC 7 9993 10,000

users
Total 20 14,980 15,000
10/02/2022 222
Solution
MI status over 3-
OC use group years
Yes No Total
OC users 6.7 4993.3 5000

Non-OC users 13.3 9986.7 10,000
Total 20 14,980 15,000
• X2 ≈ 8, 0.001 <p-value < 0.005
10/02/2022 223
Example 2: Observed Numbers
Response by Treatment
10/02/2022 224
Expected Numbers
10/02/2022 225
10/02/2022 226
10/02/2022 227
10/02/2022 228
 A study was conducted to investigate the possible
cause of gastroenteritis outbreak following a lunch
served in a high school cafeteria. Among the 225
students who ate the sandwiches, 109 became ill.
While, among the 38 students who did not eat the
sandwiches, 4 became ill.
 Present the data by 2x2 contingency table
 Test hypothesis for difference in the proportion of

gastroenteritis among those who ate sandwiches and
who didn't
10/02/2022 229
Chapter III - Sampling
Methods and Sample Size
Determination
10/02/2022 230
Introduction
In reality there is simply not enough; time,

energy, money, labour/man power, equipment,
access to suitable sites to measure every single
item or site within the parent population.
 In making inferences about populations we use
samples.
 A sample is a subset of the population

10/02/2022 231
Terminologies
 Sampling is the process of selecting a portion of the
population to represent the entire population.
 Sampling allows one to obtain a representative

picture about the population, without studying the
entire population.
 Reference population (target population): the

population of interest, to which the investigators
would like to generalize the results of the study.
10/02/2022 232
Terminologies cont…
• Sampling population: the subset of the target
population from which a sample will be drawn.
 Study population: the actual group in which the

study is conducted = Sample
 Study unit: the units on which information will be

collected: peoples, households, …
10/02/2022 233
Terminologies cont…
 Sampling frame: the list of all the units in the
reference population, from which a sample is to be
picked.
 Sampling fraction: the ratio of the number of units

in the sample to the number of units in the reference
population (n/N)
 Sample size: The number of units in the sample.
10/02/2022 234
Why sampling?
 Often, it is too expensive or impossible to
collect information on an entire population.
 For appropriately chosen samples, accurate

statistical estimates of population parameters
are possible.
10/02/2022 235
Advantages & disadvantages of sampling
Advantages
 Saves resources
 Improves quality of data
Disadvantages
 Sampling error: errors in the selection of a sample

 A different sample would give a different estimate, the
difference being due to sampling variation.
 Should be minimized
10/02/2022 236
Sampling methods
 Is the scientific procedure of selecting those sampling
units which would provide the required estimates
with associated margins of uncertainity, arising from
examining only a part and not the whole.
 How to select a sample in an efficient, appropriate
way is a challenge
 The goal of sample selection needs to be as accurate
as possible in order to draw a meaningful inference
about population characteristics from results of the
sample.
10/02/2022 237
 While selecting a SAMPLE, there are basic
questions:
– What is the group of people (POPULATION)

from which we want to draw a sample?
– How many people do we need in our sample?
– How will these people be selected?
10/02/2022 238
Methods of Sample Selection
 There are two broad categories of sampling

methods
1. Probability sampling methods
2. Non-probability sampling methods
10/02/2022 239
Probability sampling methods
 Every sampling unit has a known and non-zero
probability of selection into the sample.
 Involves random selection of a sample
 Reliable estimates can be produced and
 Inferences can be made about the population.
 Might be costly
10/02/2022 240
How random samples can be selected?
1. Simple Random Sampling
2. Systematic random sampling

3. Stratified random sampling
4. Cluster sampling
5. Multi-stage sampling
10/02/2022 241
Simple Random Sampling
 Least biased of all sampling techniques, there is no
subjectivity - each member of the total population has an
equal chance of being selected
 One of the easiest and most convenient methods for

achieving reliable inferences about a population is to take a
simple random sample
 The selection is usually made with the help of random

numbers.
 Needs sampling frame

10/02/2022 242
To select a simple random sample you need to:
 Make a numbered list of all the units in the population
 Each unit should be numbered from 1 to N (where N is

the size of the population)
 Select the required number using lottery method, table

of random numbers or computer programs (ensure
randomness of sample).
10/02/2022 243
Example (random numbers)
 Suppose there are N=850 students in a school from

which a sample of n=10 students is to be taken.
 The students are numbered from 1 to 850.
10/02/2022 244
Systematic sampling
 Selection of individuals from the sampling frame
systematically rather than randomly
 We have to number the data items from 1 to N

(sampling frame).
 Suppose the sample size be n, then we have to

calculate the sampling interval by dividing N by n.
And generate a number between 1 and N/n and select
that data item to be in the sample.
10/02/2022 245
Systematic sampling cont…
• Other items in the sample are obtained by adding the
sampling interval N/n successively to the random
number.
 Individuals are chosen at regular intervals (every kth)

from the sampling frame. The first unit to be selected is
taken at random from among the first k units.
 Advantage of this method is that the sample is evenly

distributed over the entire data.
10/02/2022 246
Systematic sampling cont…
 It is more biased, as not all members have an equal
chance of being selected
 Create a problem (bias) if the elements appear in a

cyclical pattern in the list instead of being uniformly
distributed throughout the list
10/02/2022 247
Example
 To select a sample of 100 from a population of 400,

you would need a sampling interval of 400 ÷ 100 = 4.
 Therefore, K = 4.
 You will need to select one unit out of every four units
to end up with a total of 100 units in your sample.
 Select a number between 1 and 4 from a table of

random numbers.
10/02/2022 248
Example cont…
 If you choose 3, the third unit on your frame would

be the first unit included in your sample;
 The sample might consist of the following units to

make up a sample of 100: 3 (the random start), 7, 11,
15, 19...395, 399 (up to N, which is 400 in this case).
10/02/2022 249
Stratified random sampling
 It is done when the population is known to have
heterogeneity with regard to some factors, and those
factors are used for stratification
 The population is divided into homogeneous,

mutually exclusive groups called strata according to
a characteristic of interest (e.g., sex, geographic area,
prevalence of disease, etc.).
10/02/2022 250
Stratified random sampling cont…
 A separate sample is taken independently from
each stratum.
 Any of the sampling methods mentioned in this

section (and others that exist) can be used to sample
within each stratum.
 Produces an unbiased estimate of the population

mean with better precision than does simple random
sampling with the same total sample size n.
10/02/2022 251
• Sampling frame for the entire population has to be
prepared separately for each stratum
• You need to decide the sample size for each stratum

(Equal allocation or proportionate allocation).
 Equal allocation:
– Allocate equal sample size to each stratum
10/02/2022 252
Proportionate allocation:
nj = n/N Nj
 nj is sample size of the jth stratum
 Nj is population size of the jth stratum
n = n1 + n2 + ...+ nk is the total sample size
 N = N1 + N2 + ...+ Nk is the total population size
10/02/2022 253
Cluster Sampling
 Method of sampling in which the element selected is a
group (as distinguished from an individual), called a
cluster.
 The most widely used to reduce the cost
 The clusters should be homogeneous, unlike stratified

sampling where the strata are heterogeneous
 The sampling unit is a cluster, and the sampling frame is

a list of these clusters
10/02/2022 254
Cluster Sampling cont…
 These clusters are often geographic units (e.g.,
districts, villages, etc.)
Steps
 Divide the population into groups or clusters.
 Select clusters randomly and include all units within

selected clusters in the sample.
10/02/2022 255
Example
 In a school based study, we assume students of the

same school are homogeneous.
 We can select randomly sections and include all

students of the selected sections only
10/02/2022 256
Advantages
 A list of all the individual study units in the reference
population is not required. It is sufficient to have a list
of clusters
 Cost reduction
Disadvantages
 Sampling error is usually higher than for a simple
random sample of the same size.
 It is usually better to survey a large number of small
clusters instead of a small number of large clusters.
10/02/2022 257
Multi-stage sampling
 Similar to the cluster sampling, except that it involves
picking a sample from within each chosen cluster,
rather than including all units in the cluster.
 This method is appropriate when the reference

population is large and widely scattered. Selection is
done in stages until the final sampling unit
10/02/2022 258
Multi-stage sampling cont…
 This type of sampling requires at least two stages.
 The primary sampling unit (PSU) is the sampling

unit in the first sampling stage (e.g., kebeles).
 The secondary sampling unit (SSU) is the sampling

unit in the second sampling stage (e.g., households),
etc.
10/02/2022 259
Multi-stage sampling cont…
 You do not need to have a list of all of the units in the
population. All you need is a list of clusters and list of
the units in the selected clusters.
 Saves a great amount of time and effort by not having

to create a list of all the units in a population.
10/02/2022 260
Non-probability sampling
 Every item has an unknown chance of being selected
 There is an assumption that there is an even

distribution of a characteristic of interest within the
population
 Since elements are chosen arbitrarily, there is no way

to estimate the probability of any one element being
included in the sample.
10/02/2022 261
Non-probability sampling
 Inappropriate if the aim is to measure variables and
generalize findings obtained from a sample to the
population
 They are quick, inexpensive and convenient.
10/02/2022 262
The most common types of NPS
1. Convenience or haphazard sampling

2. Volunteer sampling
3. Judgment sampling
4. Quota sampling
10/02/2022 263
Convenience or haphazard sampling
 Sometimes referred to as accidental sampling.
 Study units that happen to be available at the time of
data collection are selected.
 It can deliver accurate results when the population is
homogeneous.
 For example, a scientist could use this method to
determine whether a lake is polluted or not.
 Assuming that the lake water is well-mixed, any
sample would yield similar information.
10/02/2022 264
Volunteer sampling
 As the term implies, this type of sampling occurs
when people volunteer to be involved in the study.
 In psychological experiments or pharmaceutical

trials (drug testing), for example, it would be difficult
and unethical to enlist random participants from the
general public.
 In these instances, the sample is taken from a group

of volunteers.
10/02/2022 265
Quota sampling
 Sampling is done until a specific number of units
(quotas) for various sub-populations have been
selected.
 In this method the investigator interviews as many

people in each category of study unit as he can find
until he has filled his quota.
10/02/2022 266
Advantages of NPS
 Easy
 Less expensive
 Does not require sampling frame
Disadvantages of NPS
 Not representative of the population
 Bias
10/02/2022 267
Sample Size Determination
 In studies concerned with estimating some
characteristic of a population, sample size
calculations are important to ensure that estimates are
obtained with required precision or confidence.
 Too small sample is waste of time and resource and

results have no practical use
 Too large sample is waste of resources and data

quality compromised
10/02/2022 268
Sample size determination depends on the:
– Objective of the study
– Design of the study

– Accuracy of the measurements to be made
– Degree of precision required for generalization

– etc
10/02/2022 269
Determination of Sample Size for
Estimating Means
 To estimate population mean, 
 n = Zα/2 2
d2
10/02/2022 270
Example
• Find the minimum sample size needed to estimate
the drop in heart rate (µ) for a new study using a
higher dose of propranolol than the standard one.
We require that the two-sided 95% CI for µ be no
wider than 5 beats per minute and the sample
standard deviation for change in heart rate equals 10
beats per minute.
2 2 2
n = (1.96) 10 /(2.5) = 62 patients
 We round up to the next largest whole number if the

calculation yield a number that is not itself an integer.
10/02/2022 271
Determination of Sample Size for
Estimating Proportions
 To estimate population proportion, P
10/02/2022 272
Example
 A survey is being planned to determine what
proportion of family in a certain area are medically
indigent. It is believed that the proportion can not be
greater than 0.35. A 95% confidence interval is
desired with d = 0.05. What size sample of families
should be selected?
n = (1.96)2(0.35)(0.65)
(0.05)2
10/02/2022 273
Note
 2 and P are not known, so we

 Do Pilot or preliminary sample:
– Select a pilot sample and estimate 2 with
the sample variance, S2

 Or take from previous or similar studies
 You can take conservative P=50%
10/02/2022 274
Sampling Distribution
10/02/2022 275
Introduction
 The target of a scientist’s investigation is a population
with certain characteristic of interest: E.g., systolic
blood pressure
 A numerical characteristic of a target population is

called a parameter: E.g., the population mean 
(average SBP) or the population proportion p (a
drug’s response rate).
10/02/2022 276
Introduction
 It would be too time consuming or too costly to
obtain the totality of population information in order
to learn about the parameter(s) of interest.
 We take sample to reach conclusions about

population parameter
 Inferential statistics are the statistical methods used

to draw conclusions from a sample and make
inferences to the entire population.
10/02/2022 277
Introduction
 Statistic obtained from a sample help us draw
inferences or conclusions about population parameters.
E.g., sample mean and sample proportion p.
 If we take a different sample, we almost certainly have

a different numerical value for that same statistic
 We consider sample statistics as a variable that takes

different values from sample to sample (random
variables).
10/02/2022 278
Sampling Distributions
 The distribution of values of a statistic

obtained from repeated random samples of the
same size from a given population
10/02/2022 279
Main types of sampling distributions
• Distribution of the sample mean
• Distribution of the difference between two means
• Distribution of the sample proportion
• Distribution of the difference between two

proportions
10/02/2022 280
Construction of sampling distributions
 From a population of size N, randomly draw

all possible samples of size n.
 Compute the statistic of interest for each

sample.
 Create a frequency distribution of the statistic.
10/02/2022 281
Sampling distribution of sample mean
 Suppose we have a population of size N=4,
constituting the ages of four outpatients.
x, Age (years): 18, 20, 22, 24
μ
x i
N
18  20  22  24
  21
4
σ
 i
(x  μ) 2
 2.236
N
10/02/2022 282
 Now consider all possible samples, with
replacement, of size 2
1st 2nd Observation 1st 2nd Observation
Obs 18 20 22 24
18 18,18 18,20 18,22 18,24
Obs 18 20 22 24
20 20,18 20,20 20,22 20,24 18 18 19 20 21
22 22,18 22,20 22,22 22,24
20 19 20 21 22
24 24,18 24,20 24,22 24,24
22 20 21 22 23
16 possible samples
24 21 22 23 24
16 sample means
10/02/2022 283
Sample means Freq P( )
18 1 0.0625
19 2 0.1250
20 3 0.1875
21 4 0.2500
22 3 0.1875
23 2 0.1250
24 1 0.0625
10/02/2022 284
Sampling distribution of all sample means
Sample Means
Distribution
P(x)
.3
.2
.1
0
18 19 20 21 22 23 24
_
x
10/02/2022 285
Mean and SD of sample means
μx 
 x
i18  19  21    24
 21
N 16
σx 
 i x
(x  μ ) 2
N
(18 - 21)2  (19 - 21)2    (24 - 21)2
  1.58
16
10/02/2022 286
Mean and SD of sample means
• We note that the mean of the sampling
distribution has the same value as the mean of
the original population.
• However, the variance is ≠ the original

population variance; but is equal to the
population variance divided by the sample
size used to obtain sampling distribution.
10/02/2022 287
Standard error
• The standard deviation of any sample statistic is
called its standard error
• SE is determined by both the sample size and the

degree of variability among the individual
observations
• Quantifies the variability among means of repeated

samples drawn from that population
10/02/2022 288
Properties of sampling distribution of mean
• The standard deviation/error of the

distribution of sample means is equal to the
standard deviation of the population divided
by the square root of the sample size
σ
σx 
n
10/02/2022 289
z-score/value
 Helps in computing the probability of
obtaining a sample with a mean of some
specified magnitude. (x μ)
z 
z-score/value σ
n
Where: x = sample mean
 = population mean
σ = population standard deviation
n = sample size
10/02/2022 290
Distribution of the sample proportion
 The sample proportion is derived from counts

or frequency data.
x number of successes in the sample
p 
n sample size
 Sample proportion =
 Population proportion = p or π
10/02/2022 291
Properties
• The mean of the distribution, μp, will be equal
to the true population proportion, P, and the
variance of the distribution, will be equal to
p(q)/n.
p(1  p)
μp  p σp 
n
• The sampling distribution of will be
approximately normal when the sample size n
is large
10/02/2022 292
z-Value for Proportions
 Standardize p to a z value with the formula:
pp pp
z 
σp p(1  p)
n
10/02/2022 293
Summary for SE
Standard error
Statistic
• Sample mean, x • SEx = s / sqrt( n )
• Sample proportion, • SEp = sqrt [ p(1 - p) / n ]
10/02/2022 294
Exercises
1. Suppose a population has mean μ = 50 and
standard deviation σ = 16. Suppose a random
sample size of 64 is selected.
 Find the probability that the sample mean is >53
Steps
 Write the given information
 Sketch a normal curve
 Convert the mean to z-score
 Find the corresponding area under the SND curve
10/02/2022 295
 The area of the SND above a value of z = 1.5
gives an area of 0.0668. The probability P (z >
1.5) = 0.0668
 The probability that mean is greater than 53 is

0.0668.
10/02/2022 296
2. According to a recent estimate, 19.4% of the under-
five children in a population are stunted
 What is the probability that in a random sample of
size 150 from this population fewer than 15% will be
stunted?
• Find z-score
• A value of z <-1.36 gives an area of .0869 which is

the probability P (z < -1.36) = .0869
• The probability that p < 15 is .0869.
10/02/2022 297
Chapter IV Estimation
10/02/2022 298
Introduction
• The values of population parameters are usually not known
• The sample from a population is used to provide the

estimates of the population parameters
• The process of drawing conclusions about an entire

population based on the data in a sample is known as
statistical inference.
• Methods of inference usually fall into one of two broad

categories: estimation or hypothesis testing.
10/02/2022 299
Estimation
 Is concerned with estimating/computation of

the values of specific population parameters
based on sample statistics.
 For example, sample means are used to

estimate population means; sample
proportions, to estimate population
proportions.
10/02/2022 300
Estimation
 The statistic itself is called an estimator
 The value or values that the estimator assumes are called

estimates.
 A different samples could have come up with different

results. The amount of variation that exists among the
estimates from the different possible samples is the
sampling error.
10/02/2022 301
Types of estimation
 An estimate of a population parameter may be
expressed in two ways: point estimate and interval
estimate
Point estimate
 A point estimate of a population parameter is a single
value of a statistic.
 For example, the sample mean is a point estimate of
x
the population mean μ. Similarly, the sample
proportion is a point estimate of the population
proportion P.
10/02/2022 302
 The value of your sample statistic (e.g., your
sample mean or sample correlation) is used to
estimate the population parameter (e.g., the
population mean or the population correlation).
 For example, if you take a random sample from

University students and you find that the
average/mean age for the students in your sample is
23 years, then your best guess or your point estimate
for the population of students in the University will
be 23 years.
10/02/2022 303
Properties of good estimate
 Unbiasedness: if one could take repeated samples of
size n from the population the average of these
estimates would equal the value of the population
parameter. The sample mean and median are unbiased
estimators of the population mean .
 Small/minimum variance: the estimator which has

the smallest variance or dispersion about the true
parameter.
10/02/2022 304
Interval estimate
 The probability of getting a sample statistic value that

is exactly equal to the corresponding population
parameter is usually quite small.
 An interval estimate is defined by two numbers,

between which a population parameter is said to lie.
 For example, a < x < b is an interval estimate of the

population mean μ. It indicates that the population
mean is greater than a but less than b.
10/02/2022 305
• An interval estimate (also called a confidence
interval) is a range of numbers inferred from the
sample that has a known probability of capturing
the population parameter over the long run (i.e.,
over repeated sampling).
• An interval estimate provides more information about

a population characteristic than does a point estimate
• Such interval estimates are called confidence

intervals.
10/02/2022 306
10/02/2022 307
 A point estimate does not give any indication
on how far away the parameter lies.
 A more useful method of estimation is to

compute an interval which has a high
probability of containing the parameter.
 A confidence interval is a guess (point

estimate) together with a “safety net” (interval)
of guesses of a population characteristic.
10/02/2022 308
 CI has 3 components:
A point estimate (e.g. the sample mean)
The standard error of the point estimate ( e.g.

SEM =σ/√ n)
 A confidence coefficient (conf. coeff)
10/02/2022 309
 Confidence coefficient is the measure of how
confident we want to be, critical value
 The “safety net” (confidence interval) that we

construct has “lower” and “upper” limits defined
Lower limit = (point estimate) – (confidence

coefficient)(SE)
Upper limit = (point estimate) + (confidence

coefficient)(SE)
10/02/2022 310
 CIs also give information about the precision of an
estimate.
 How much uncertainty is associated with a point

estimate of a population parameter?
 CIs will be wider; with small sample size, sampling

variability
 Wider CIs indicate less certainty.
10/02/2022 311
Confidence interval (CI)Tolerance
error
of
e.g. CI for means

95% CI = a = 5%
x – 1.96 SE up to x + 1.96 SE
1-α
α/2 α/2
Lower limit upper limit s

of 95% CI of 95% CI
Indicates the amount of random error in the estimate

Can be calculated for any ‚‘‘test statistic“, e.g.: means, proportions, ...
10/02/2022 312
Confidence level
 Also called degree of confidence
 Confidence in which the interval will contain the

unknown population parameter
 Usually 90%, 95%, 99%

 Also written (1 - α)
 The coefficient to be multiplied with the standard

error of the mean should be determined accordingly.
10/02/2022 313
Example
Degrees of freedom Coefficient

99% 2.576
95% 1.96
90% 1.645
E.g.,
95%CI : point estimate 1.96/ n
10/02/2022 314
Example: 95% CI
 If the data collection and analysis could be replicated many

times, the CI should include the true value of the measure
95% of the time.
 95% of all CIs of all possible random samples would contain

the unknown population parameter; the remaining 5% would
not. (Probabilistic)
 We are 100 (1-α)% [e.g., 95%] confident that the single

computed interval contains the unknown population
parameter. (Practical)
10/02/2022 315
CI for a Single Population Mean
(normally distributed)
 Is the population is normally distributed?
 If population standard deviation () is not
known, use sample standard deviation (S).
 A 100(1-)% C.I. for  is:
10/02/2022 316
Finding the Critical Value
10/02/2022 317
Example
 Suppose that the mean of percentage of bile for 31
male patients is 84.64, and the standard deviation 24.
Find the 95% CI.
po int estimate, x  84.64, S  24
95 %CI : 84.6 4  1.96 x 24/ 31

 ( 76.2;93.1)
 We are 95% confident that the true mean

percentage of bile is between 76.2 and 93.1
10/02/2022 318
CI for a single population proportion
• The distribution of the sample proportion is
approximately normal if sample size is large
P (1  P )
SE 
n
10/02/2022 319
10/02/2022 320
 Lower limit = Point Estimate - (Critical Value) x
(Standard Error of Estimate)
 Upper limit = Point Estimate + (Critical Value) x

(Standard Error of Estimate)
Hence,
is an approximate 95% CI for the true proportion p.
10/02/2022 321
Example
 A random sample of 100 people shows that 4 are
smokers. Form a 95% CI for the true proportion of
smokers.
 Point estimate = P  40 / 100  0.4
 SP =
0.4(0.6) / 100  0.049
 95% CI: 0.4 + 1.96(0.049) = 0.3039;0.496
 We are 95% confident that the true proportion of

smokers in the population is between 30.39% and
49.6%
10/02/2022 322
Hypothesis Testing
10/02/2022 323
What is Hypothesis Testing?
• A statistical hypothesis is an assumption
about a population parameter. This assumption
may or may not be true.
• Hypothesis testing refers to the formal

procedures used by statisticians to accept or
reject statistical hypotheses.
10/02/2022 324
What is Hypothesis Testing?
 Hypothesis testing aids in reaching a decision
(conclusion) concerning a population by
examining a sample from that population.
 If sample data are not consistent with the

statistical hypothesis, the hypothesis is
rejected.
10/02/2022 325
Types of statistical hypotheses
1. Null Hypothesis, HO
 Specifies a hypothesized real value, or values, for a
parameter
 Is a statement claiming that there is no difference between

the hypothesized value and the population value
 HO is a statement of agreement (or no difference)

 HO is always about a population parameter, not about a
sample statistic
10/02/2022 326
Null Hypothesis, HO
 Begin with the assumption that the Null

hypothesis is true
 Always contains “=”, “ ≤” or “≥ ” sign
 May or may not be rejected
10/02/2022 327
2. The Alternative Hypothesis, HA
 Specifies a real value or range of values for a

parameter that will be considered when the HO is
rejected.
 Is a statement of what we will believe is true if our

sample data causes us to reject HO.
 Is a statement that disagrees (opposes) with Ho
10/02/2022 328
How to Test Hypotheses
1. State the hypotheses
 The hypotheses are stated in such a way that
they are mutually exclusive
Example
H0:  = 0 H0:  ≤ 0 H0:  ≥ 0
H1:   0 H1:  > 0 H1:  < 0
two-tailed one-tailed one- tailed
10/02/2022 329
How to Test Hypotheses…
2. State the assumptions necessary for computing

probabilities
 A distribution is approximately normal

(Gaussian)
 Variance is known or unknown
 Guides to select appropriate test statistic

10/02/2022 330
3. Decide on the appropriate test statistic for the

hypothesis (Z, t, 2...).
 Test Statistic is some function of the data that uses

estimates of the parameters we are interested in and
whose sampling distribution is known when we assume
the null hypothesis is true.
 It is a value computed from the sample data that is used

in making the decision about the rejection of the null
hypothesis
10/02/2022 331
Which test statistic to use?
a. Use the Z-statistic if:
 Data are measured on quantitative scale,
– Normally distributed and σ 2 is known.
– Not normally distributed but large sample
size (by the central limit theorem) and σ 2 is
known.
b. When the parameter in the Ho involves
categorical data, you may use a chi-square
statistic as the test statistic
10/02/2022 332
Z-statistic
 Test statistic = (Statistic - Parameter) / (Standard
deviation of statistic)
 Test statistic = (Statistic - Parameter) / (Standard error
of statistic)
 Where Parameter is the value appearing in the null
hypothesis, and Statistic is the point estimate of
Parameter.
 As part of the analysis, you may need to compute the
standard deviation or standard error of the statistic.
10/02/2022 333
4. Specify the desired level of significance for the
statistical test (=0.05, 0.01, etc.)
10/02/2022 334
Level of significance
 You are the one who decides on the
significance level to use in your research study.
 A significance level is not an empirical result;

it is the level that you set so that you will
know what probability value will be small
enough for you to reject the HO.
10/02/2022 335
5. Determine the critical value
 A value the test statistic must attain to be

declared significant.
 The values of the boundaries of the critical

region
10/02/2022 336
10/02/2022 337
6. Obtain sample evidence and compute the test statistic
7. Reach a decision and draw the conclusion

 If the numerical value of the test statistic falls in the
rejection region, we reject the Ho, conclude that HA is
true (or accepted).
 If the test statistic does not fall in the rejection region,

we do not reject H0, conclude that HO may be true.
10/02/2022 338
Decision Rules
 For rejecting the null hypothesis
 Statisticians describe these decision rules in two ways -

with reference to a P-value or with reference to a
region of acceptance.
 P-value. The strength of evidence in support of a null

hypothesis is measured by the P-value. The P-value is the
probability of observing a test statistic as extreme as the
one observed, assuming the Ho is true. If the P-value is
less than the significance level, we reject the HO.
10/02/2022 339
Region of acceptance
 The region of acceptance is a range of values.
 If the test statistic falls within the region of

acceptance, the null hypothesis is not rejected.
 The region of acceptance is defined so that the chance

of making a Type I error is equal to the significance
level.
10/02/2022 340
Region of acceptance
 The set of values outside the region of acceptance is

called the region of rejection.
 If the test statistic falls within the region of rejection,

the Ho is rejected. In such cases, we say that the
hypothesis has been rejected at the α level of
significance.
10/02/2022 341
Decision Errors
 Two types of errors can result from a hypothesis test.
 Type I error: occurs when the researcher rejects a

null hypothesis when it is true.
 The probability of committing a Type I error is called

the significance level.
 This probability is also called alpha, and is often

denoted by α.
10/02/2022 342
Decision Errors
• Type II error: occurs when the researcher fails to

reject a null hypothesis that is false.
• The probability of committing a Type II error is

called Beta, and is often denoted by β.
• The probability of not committing a Type II error is

called the Power of the test.
10/02/2022 343
Types of errors
Truth
No diff Diff
H0 to be not rejected H0 to be rejected (H1)
Right decision 
H0 not rejected
Decision 1-
based on No diff Type II error
the p  Right decision

H0 rejected (H1)
value
1-
Diff Type I error
• H0 is “true” but rejected: Type I or  error

• H0 is “false” but not rejected: Type II or  error
10/02/2022 344
Hypothesis Test for a Mean (normally
distributed)
 Known variance
10/02/2022 345
Example 1
 The mean age of a random sample of 40
individuals is 27. If the variance of the
population is 20, can we conclude that the mean
age of the population is different than 30 years?
Step I
 H0 : μ = 30
 HA : μ  30
Step II: n=27, 2= 20, mean = 27, normally
distributed population
10/02/2022 346
Step III: test statistic
Z- statistic is appropriate
Step IV: level of significance, α = 0.05
10/02/2022 347
Step V: Rejection region and Critical value
10/02/2022 348
Step 7:
 Since -4.24 falls in the rejection region, we reject the Ho

and conclude the mean age of the population is not 27.
 P-value: P(Z>|4,24|) = 2P(Z>4.24) Since there are two

parts to the rejection region in a two tail test, the P-
value is twice P(Z>4.24).
 P-value: <0.0002
10/02/2022 349
Example 2
 Suppose that a random sample of 40 people gave a
mean age of 27. If the population variance is 20, can
we conclude that μ < 30?
Take significance level (α = 5 %)
 H0: μ = 30 HA: μ < 30
 Ztab = -1.645
 Zcal
10/02/2022 350
• Zcal < Ztab, Reject H0.
• P-value < 0.0001 this time because it is only

a one tail test and not a two tail test
10/02/2022 351
HT about a single population proportion
• H0 : P= o
• HA : P  o
z  p  o
o(1  o ) / n
• o is the hypothesized value of population proportion

in the null hypothesis
10/02/2022 352
Example
 Suppose 8.27% of patients over 40 years of age

survives at least five years after diagnosis with lung
cancer. For the sample of 52 persons under 40 who
have been diagnosed with lung cancer, 11.5%
survived 5 years. Can we conclude the same fraction
of lung cancer patients surviving at least five years
after diagnosis is the same among persons under 40?
(α = 0.05)
10/02/2022 353
 HO: P = 0.082
HA: P  0.082
 With α = 0.05, the critical values of Z are -
1.96 and +1.96. We reject Ho if Zcal < -1.96 or
Z > +1.96.
 Since Zcal<Ztab, we do not reject the null hypothesis
10/02/2022 354
Chapter V - Demography
10/02/2022 355
Introduction
 Demography is a science that studies human
population with respect to size, distribution,
composition, social mobility and its
variation with respect to all the above features
and the causes of such variation and the
effect of all these on health, social, ethical,
and economic conditions.
10/02/2022 356
 Size: the number of persons in the population at a
given time.
 Distribution: the arrangement of the population in

the territory of the nation in geographical, residential
area, climatic zone, etc.
 Composition (Structure): the distribution of a

population into its various groupings mainly by age
and sex.
10/02/2022 357
 Change: refers to the increase or decline of the
total population or its components. The
components of change are birth, death, and
migration.
 Demography is concerned with this essential

‘numbering of the people’ and with
understanding population dynamics— how
populations change in response to the interplay
between fertility, mortality, and migration
10/02/2022 358
 The health and health-care needs of a
population cannot be measured or met without
a knowledge of its size and characteristics.
 This understanding is a prerequisite for

making the forecasts about future population
size and structure which should underpin
healthcare planning.
10/02/2022 359
Uses of Demographic Data in Public health
 Planning
Health service provision
Quantities of health personnel and health

facilities
Types of services
 Health indicators
10/02/2022 360
Sources of Demographic Data
1. Census
 Periodic complete enumerations of a population
 A population census is taken to determine the size of

the population of a country at a given date and to
obtain statistical information on various
demographic, economic and social characteristics of
every individual in the population.
10/02/2022 361
Characteristics of census
Universality
Simultaneity
Individual enumeration
 There are two main different schemes for
enumerating a population in a census.
De facto: The enumeration is done according to
the actual place of residence on the day of the
census
De jure: The enumeration (or count) is done
according to the usual or legal place of residence
10/02/2022 362
Common errors in census data
 Omission and over enumeration.
 Miss reporting of age due to memory lapse,
preference of terminal digits, over/under estimation.
 Overstating of the status within the occupation.
 Under reporting of births due to problem of reference
period and memory lapse.
 Under reporting of deaths due to memory lapse and
tendency not to report on deaths, particularly on
infant deaths.
10/02/2022 363
2. Sample survey
 From data obtained from a part of the population (the

sample), a sample survey infers information valid for
the whole population.
 A sample survey is a lighter operation than a census,

needing less time, less hands and less funds.
 Smaller size than census allows collection of more in-

depth information that can then be generalized
10/02/2022 364
3. Vital events registration
 Changes in population numbers are taking

place every day. Additions are made by births
or through new arrivals from outside the area.
Reductions take place because of deaths, or
through people leaving the area.
 It is an ongoing recording of vital events as

they become available.
10/02/2022 365
Population pyramid
• Population pyramid presents the population of
an area or country in terms of its composition
by age and sex at a point in time
• By convention, males are shown on the left of

the pyramid, females on the right, young
persons at the bottom, and the elderly at the
top
10/02/2022 366
Population pyramid
 The pyramid consists of a series of bars, each drawn
proportionately to represent the relative contribution
of each age-sex group (often in five year groupings)
to the total population
 The shape of the pyramid reflects the major

influences on births and deaths, plus any changes due
to migration, over the three or four generations
preceding the date of the pyramid
10/02/2022 367
Population Pyramid
 A triangular, broad-based pattern of a pyramid
reflects a high birth rate over a long period of
time
 Only a small proportion of persons have

survived into the older age groups; as a result
the median age is relatively young
10/02/2022 368
Population Pyramid, Ethiopia
2007
10/02/2022 369
Demographic Transition
 Demographic transition is a term used to describe the
major demographic trends of the past two
centuries.
 The change in population basically consists of a shift

from an equilibrium condition of high birth and death
rates, characteristics of agrarian societies, to a newer
equilibrium in which both birth and death rates are at
much lower levels.
10/02/2022 370
 Pre-transitional: characterized by high
mortality and high fertility, with low
(moderate) population growth (young
population). Type I
 Transitional: characterized by high birth rate

and reduced death rate, with high (rapid)
growth rate (young population).
10/02/2022 371
Demographic Transition
 Post –transitional: characterized by low birth

and death rates with stable, moderate growth
rate. Narrow based pyramid and steeper sides.
• Now days, in developed nations we observe

decreasing birth rate with long survival (e.g.,
Japan)
10/02/2022 372
Vital statistics
Dependency Ratio
 It describes the relation between the
potentially self-supporting portion of the
population and the dependent portions
(young and aged) of the population.
10/02/2022 373
Sex Ratio
 The sex ratio (Sx), at a given age x (or age

group) is obtained by dividing the number of
x-aged males by the number of x-aged females
 Sx =M/F x 100 where M and F are total

number of male and female populations,
respectively.
10/02/2022 374
Measures of Fertility
 Crude birth rate

 General fertility rate
 Age specific fertility rate
 Total fertility rate
 Gross reproduction rate
 Net reproduction rate
10/02/2022 375
1) Crude birth rate (CBR)
 Is the number of live births in a year per 1000 mid
year population in the same year.
CBR = No of live births in a year x 1000
Mid year population of the same year
2) General fertility rate (GFR)
 Is the number of births in a specified period per 1000
women aged 15-49 year;
10/02/2022 376
3) Age specific fertility rate (ASFR)
 Because fertility varies within the childbearing years,
demographers often measure fertility according to the
age of the mother.
 ASFR is used to measure the reproductive

performance of a given age, thus showing variation
in fertility by age.
10/02/2022 377
4) Total fertility rate (TFR)
 It estimates the total number of live births 1,000 women would
have if they all lived through their entire reproductive
period and were subject to a given set of ASFRs.
 Is the sum of all age specific fertility rates for each year of age
from 15- 49 years.
 It is the average number of children that a synthetic
(artificial) cohort (a group of persons who share a common
experience within a defined period) of women would have at
the end of reproduction, if there were no mortality among
women of reproductive age; each woman will live up to 49
years of age, about a total of 35 years.
10/02/2022 378
49 Bi
TFR  
i 15 P f
x1000for a single year age classifica tion
i
7 Bi
TFR  5 x  P f x1000for 5 years ageclassifica tion
i 1 i
 Where Bi = Bf + Bm = birth of both sex at age

i of mothers Pif = female population at age
(age interval) i.
10/02/2022 379
5) Gross Reproduction Rate (GRR):
 Is the TFR restricted to only female births
 The GRR is a standardized rate similar to the TFR,

except that it is the sum of age-specific rates that
include only live female births in the numerators.
 GRR gives the average number of daughters a

synthetic cohort (group) of women would have at
the end of reproduction, in the absence of mortality.
10/02/2022 380
 where bi(f) indicates the births of baby girls in the
age group i and Pi represents the average female
population of age group i.
 It shows how many baby girls-potential future

mothers-would be born to 1000 women passing
through their child bearing years. It is an indication of
the extent to which women reproduces themselves
during a generation, assuming no mortality.
10/02/2022 381
 Let Bf = Number of female births
 Let Bm+f = Number of male and female births,
i.e. total births
10/02/2022 382
6) Net reproduction rate (NRR)
 The GRR measures the production of females.
 It makes no allowance for the fact that some women

may die during the childbearing years.
 Thus for a more accurate measure of the replacement

of daughters by their mothers in the hypothetical
cohort, we turn to the Net Reproduction Rate (NRR).
10/02/2022 383
 NRR is the average number of daughters that
would be born to a woman if she passed
through her life-time from birth to the end of
her reproductive years conforming to the age
specific fertility and mortality rates of a given
year.
10/02/2022 384
 Replacement Level Fertility is said to have
been reached when NRR=1.0
– Surviving women in the hypothetical cohort

have exactly enough daughters (on average) to
replace themselves in the population
 NRR = 1 stationary population (i.e., 1

daughter per woman)
 NRR < 1 declining population
10/02/2022 385
 GRR and the NRR, asks whether a given set
of fertility rates implies that the population
will grow, exactly replace itself, or decline.
10/02/2022 386
Population Growth and Projection
 The rate of increase or decline of the size
population by natural causes (births and
deaths) can be estimated crudely by using the
measures related to births and deaths.
10/02/2022 387
 Population projection provides information on the
future size and composition of the population of a
given area.
 Based on this rate of increase (r), the population (Pt)

of an area with current population size of (Po) can be
projected at some time t in the short time interval.
10/02/2022 388
 Geometric projection model:
 Exponential Projection model:
 Doubling time
10/02/2022 389
 For example, if the CBR=46, CDR=18 per 1000
population and population size of 25,460 in 1998, then ,
Crude rate of natural increase = 46 - 18 = 28 per 1000 =
2.8 percent per year.
 The estimated population in 2003, after 5 years, using the

first formula will be P2003 = P1998 (1 + 0.028)t = 25,460(1
+ 0.028)5 = 25,460(1.028)5 = 25,460(1.148) = 29,230
 The population of the area in 2003 will be about 29,230.
 For the above example, the doubling time (t) would be 25

years.
10/02/2022 390
The End
Thank You
10/02/2022 391

Bistatstics MLT 1-5

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bistatstics MLT 1-5

Uploaded by

Copyright:

Available Formats

Biostatistics Handout

chapter I: Introduction &

• Use graphs to interpret various types of data..

• Define and calculate measures of location of a data set.

• Define and calculate the measures variance of a data set

• Types of numerical data

• Tables and Graphs

• Measures of Central Tendency

– Collection, organization, analysis, summarization

– Where by inferences are made about specific random

• Biostatistics: the application of statistical

– Mathematical: development of new methods of

– Applied: application of the methods of

 Ways of organizing and summarizing data

 Identify the general features and trends in a set of

 Conveying the final results of a study

• Statistical inference makes use of information from a

• Used to draw conclusions about a population based

 Provide methods of organizing information

 Health program evaluation

• Evaluation of a new vaccine or drug

2. Secondary data: which had been collected by certain

• Variable: a characteristic which takes different values

• Any aspect of an individual or object that is measured

– Categorical (or Qualitative) or

– Quantitative (or numerical variables).

 Do not have numerical values

 The notion of magnitude is absent or implicit

 Can be nominal or ordinal

 Measured (or counted) and expressed numerically.

 Example: age, height, weight, heart rate, blood

• E.g., the number of daily admissions to a hospital

• Characterized by gaps or interruptions in the values

• Both the order and magnitude of the values matter.

 Both the magnitude and the order of the values matter

 Does not possess the gaps or interruptions

 Example: age, height, weight

 The forms in which data is found or the scales on

 Classified as nominal, ordinal, interval, and ratio.

 Stated in terms of increasing information content

 The simplest type of data, in which the values fall

 Consists of “naming” observations or classifying

 Uses names, labels, or symbols to assign each

• Binary or dichotomous variable: If nominal data

• The only valid operations for variables

• Distance between observations is meaningful.

• Here the numbers assigned to the observations

 Measurement begins at a true zero point and

 Note on meaningfulness of “ratio”-

 Have something in common for which we wish to

 Statistic: A descriptive measure computed from the

 Raw data: Numbers that have not been summarized

 Before displaying or analyzing data, classify the

 Numerical summary measures

- Measures of central tendency

• Frequency distribution: a table which has a list of

• Provides one of the most convenient ways to

 Count (tally) the number of observations in each

 Relative frequency: percentage of the total number

Example: survey conducted among 600 students of

 For a continuous variable (e.g. – age), the frequency

• There is no clear-cut rule on the number of intervals