Biostatistics Lecture Notes

PART I - DEFINITIONS
1.0 BIOSTATISTICS
1.1 STATISTICS IN HEALTH
This topic serves as an introduction to the whole module. It describes the main areas covered
under the heading statistics in health and introduces the idea of statistical investigations.
A particular problem for health managers is that most decisions need to be taken in the light of
incomplete information. That is not everything is known about the current health processes and
very little (if anything) will be known about the future situations.
The techniques described in this module enable structures to be built up which help management
to alleviate this problem.
1.2 Definition of statistics
A collection of techniques and methods that may be used to solve problems that arise when one
wants to draw general conclusions from data from epidemiological and other types of empirical
studies.
Statistics is the science of collecting, summarizing, presenting and interpreting data, and of using
them to test hypothesis.
1.3 Statistical method can be described as:
(a) the selection, collection and organisation of basic facts into meaningful data and then
(b) the summarising , presentation and analysis of data into useful information (Francis
2004:1)
The gap between facts as they are recorded anywhere in the health institutions/environment) and
Information which is useful to management is usually a large one. (a) and (b) above describe the
process that enable this gap to be bridged.
1.3 Probability can be thought of as the ability to attach limits to areas of uncertainty.
1.4 Statistical investigations
Health Management decisions are based on numerous pieces of information obtained from
different sources. They may have used, one some or all of the techniques which have been
described as statistical methods or probability. What the decisions will all have in common
however is that they are the final product of a general structure (or set of processes) known as
investigation or survey.
Some significant factors are listed as follows:
a) Investigations can be fairly trivial affairs, such as looking at the patient attendance at
the health institution.
b) Investigations can be carried in isolation or in conjunction with others.
c) Investigations can be regular (routine or ongoing) or ‘one off’.
d) Investigations are carried out on population. A population is the entirety of people or
items (technically known as members) being considered.
Stages in an investigation
However small or large an investigation is, there are certain landmarks in identifiable
stages, through which it should normally pass. These are listed as follows:
a) Definition of target population and objectives of the survey.

b) Choice of method of data collection.
c) Design of questionnaire or the specification of other criteria for data measurement.
d) Implementation of a pilot (or trial) survey.
e) Selection of population members to be investigated.
Organisation of manpower and resources to collect the data
f) Copying, collation and other organisation

g) Analyses of data (with which much of the module is concerned).
h) Presentation of analyses and preparation of reports.
1.5 Sampling and data collection
There are different methods used in choosing the subjects for an investigation and the different
ways of collecting data.
1.6 Data refers to facts, opinions and statistics that have been collected together and recorded
for reference or for analysis. Refer to Francis introduction to computers
There are two main sources of data namely Primary data and secondary data
1.7 Primary data is the name given to data that are used for the specific purpose for which
they were collected. (E.g. Censuses and samples)
1.8 Secondary data is the name given to data that are being used for some purpose other
than that for which they were originally collected.
1.9 Censuses
A census is the name given to a survey which examines every member of a population.
The Zambian government statistical office carries out many official censuses. Some of them are
1. A Population census is taken every ten years , obtaining information such as age, sex,
relationship to head of household, occupation, education, number of rooms in a place of
dwelling for the whole population of Zambia
2. Demographic Health Surveys (DHS)
3. Economic surveys etc
A census has the advantage of completeness and being accepted as representative but has the
disadvantages of cost and delays in the collection and release of results taking up to two years
before results come out.
1.10 Samples
In practice due to various limitations including time financial resources, most of the
information obtained by organisations about any population will come from examining a
small, representative subset of the population. This is called a sample.
Advantages of sampling are usually: small cost, time and resources, results can be out within
a short time.
A general disadvantage is natural resistance by the lay man in accepting the results as
representative. Other disadvantages depend on the particular method of sampling used.
1.11 Bias
Bias can be defined as the tendency of a pattern of errors to influence data in an

unrepresentative way. The errors involved in the results of investigations that have been
subjected to bias are known as systematic errors Francis Andre2004.
The main types of bias are
a) Selective bias. This can occur if the sample is not truly representative of the population.
b) Structure and wording Bias. This could be obtained from badly worded questions
c) Interview Bias. If the subjects of an investigation are personally interviewed, the
interviewer might project biased opinions or an attitude that might not gain full
cooperation of the subjects.
d) Recording bias this could result from badly recorded answers or clerical errors made by
an untrained workforce.
1.12 Sampling frames
Certain sampling methods require each member of the population under consideration to be
known and identifiable. The structure which supports this identification is called a sampling
frame.
1.13 Sampling techniques
The sampling techniques most commonly used in health surveys can be split into three
categories.
a) Random sampling. This ensures that each and every member of the population under
consideration has an equal chance of being selected as part of the sample.
Two types of random sampling used are:
(i) Simple random
(ii) Stratified (random) sampling
b) Quasi-random sampling (Quasi means ‘almost’ or ‘nearly’) this type of technique, while
not satisfying the criterion given in (a) above, is generally thought to be as representative
as random sampling under certain conditions. It is used when random sampling is either
not possible or too expensive to consider.
Two types that are commonly used are :
(i) Systematic sampling
(ii) Multistage sampling
c) Non random sampling. This is used when neither of the above techniques are possible or
practical. Two well used types are :
(i) Cluster sampling
(ii) Quota sampling.
1.14 Random sampling numbers
The two types of random sampling, above (random and quasi-random) normally require the use
of Random sampling numbers. These consist of the ten digits from 0 to 9, generated in a random
fashion (normally from a computer (note that some scientific calculators also perform this
function)) and are arranged in groups for reading convenience. The term ‘generated in a random
fashion’ can be interpreted as ‘the chance of any one digit occurring in any position in the table is
no more or less than the chance of any other digit occurring’.
NB: Read on sampling methods advantages and disadvantages of each method
1.15 Sample size
There is no universal formula for calculating the size of a sample. However, as a starting point,
there are two facts that are well known from the statistical theory and should be remembered.
1. The larger the size of sample, the more precise will be the information given about the
population.
2. Above a certain size, little extra information is given by increasing the size. This entails
that a sample need only be large enough to be reasonably representative of the
population.
Some general factors involved in determining the sample size include
(a) Money and time available

(b) Aims of the survey
(c) Degree of precision
(d) Number of subsamples.
1.16 Methods of Primary data collection
Data collection is the means by which information is obtained from the selected subjects of an
investigation. There several methods the common ones being
(a) Individual (personal) interview. This method is probably the most expensive, but has the
advantage of completeness and accuracy.
Other factors involved are
i. interviewers need to be trained;
ii. interviews need arranging
iii. can be used to advantage for pilot surveys, since questions can be thoroughly
tested;
iv. uniformity of approach if only one interviewer is used;
v. an interviewer can see or sense if a question has not been fully understood and it
can be followed up on the spot
vi. this form of data can be used in conjunction with random or quasi random
sampling.
b) Postal questionnaire
c) Street(informal) interview
d) Telephone interview
e) Direct observation it is normally considered to be the most accurate form of data
collection, but is very laborious intensive and cannot be used in many situations.
1.17 Questionnaire Design
Andre Francis (2004:16) states that “If a questionnaire is used in a statistical survey, its
design requires careful consideration. A badly designed questionnaire can cause many
administrative problems and may cause incorrect deductions to be made from statistical
analyses of the results. One major reason for the pilot surveys are carried out is to check
typical responses to the question. Some important factors in the design of questionnaires
include the following:
a) the questionnaire should be as short as possible

b) questions should:
i. be simple and unambiguous.
ii. Not be technical
iii. Not involve calculations or test of memory
iv. Not personal, offensive or leading
c) as many questions should have simple answer categories
d) questions should be asked in a logical order
a useful check on the adequacy of the design of a questionnaire can be given by conducting a
pilot survey”
1.18 The use of secondary data
Secondary are generally used when:
a) The time, manpower and resources necessary for your own survey are not available or
b) It already exists and provides most, if not all , of the information required.
The advantages of using secondary data are saving on time, manpower and resources in sampling
and data collection. In other words, somebody else has done the “spade work” already.
Remember to acknowledge the sources of your secondary data to avoid plagiarism.
Surname, First name (Year), Title of Article (2nd Edition) Publisher; Place of Publication
Sichone, C, (2013), Biostatistics Module, (2nd Ed) Cavendish University, Lusaka
The disadvantages of using secondary data can be formidable and careful examination of sources
of data is essential problems include the following
i. Data quality might be questionable

ii. The data collected might be now out of date
iii. Geographical cover of the survey might not coincide with what you require
iv. The strata of the population covered might not be appropriate for your purpose
v. Some terms used might have different meanings (Definition of terms in ach research
report)
Sources of secondary data and their use
Two categories internal and external
Internal
a) patient file
b) patient registers
c) using information on raw materials for the institution
External
a) the results of undertaken within the health sector

b) Health service utilisation surveys in other institutions.
Without doubt the most important external secondary data sources are official statistics
supplied by the central statistical office and other government departments. These include:
a) Annual abstract statistics

b) Monthly digest of statistics
c) Financial statistics
d) Economic trends
e) Regional trends
f) National accounts
g) Balance of payments
h) Social trends
1.18 Data and their accuracy
This topic is concerned with the forms that data can take, how data are measured and the
errors and approximations that are often made in their description.
Data classification
a) By source – primary or secondary

b) By measurement- Numeric (quantitative) or Non numeric (qualitative)
c) By precision- precisely(described as discrete) or approximation(continuous)
d) By number of variables - Univariable or bivariable
Discrete data: data that can be measured precisely
Continuous data: data that cannot be measured precisely; their values can only be
approximated to.eg dimensions length, height, weights areas volumes, temperatures; times.
Rounding and its conventions
a) Data are normally rounded for one or two reasons.

i. If they are continuous, rounding is the only way to give single values which will
represent the magnitude of the data.
ii. If they are discrete, the values given may be too detailed to use as they stand.
b) Fair rounding is the technique of cutting off particular digits from a given numeric value
and depending on whether the first digit discarded has value 5 or more (or not), adding 1
to the last of the remaining digits or not (known as rounding up or down).
There are two conventions used for displaying the result of fair rounding
(i) By decimal place
(ii) By number of digits.
Errors and their causes
Two main causes
Unpredicted errors
(i) Incomplete or incorrect records

(ii) Ambiguous or complicated questions asked as part of the questionnaires
(iii) Mistakes in copying data from one form to another.
Planned or (predictable) Errors
(i) Measuring continuous data

(ii) Rounding discrete data
Methods of describing errors
An Internal- range
An estimate with a maximum absolute error – real value.
Rounding errors
Errors in expressions
Error rules the range of error for the addition, subtraction, multiplication and division of two
numbers , each of which is subject to error is given by the following.
Range of error
If the value of the number X lies in the range [a, b] and Y lies in the range [c, d] then:
X+Y lies in the range [a + c, b + d]
X x Y lies in the range [a x c, b x d]
X - Y lies in the range [a - d, b - c]
X/Y lies in the range [a/d, b /c]
Fair and biased rounding
Compensating and systematic errors
Avoiding errors when using percentages
1.19 Frequency distribution
This topic deals with the organisation and presentation of numeric (universal) data. It
describes how numeric data can be organised into frequency distributions of various types,
their graphical presentation and their interpretation.
Raw statistical data: Before the data obtained from a statistical survey or investigations have
been worked on they are referred to as raw data.
Data arrays. One way of extracting some information from data is by arranging them into
size order. This is known as a data array.
Simple frequency distributions: a simple frequency distribution consists of a list of data
values, each showing the number of items having that value (called the frequency).
Tally chart is a chart made of rows and columns for data value, tally marks and the total.
This information may then be recorded in a frequency distribution table.
Grouped frequency distributions
A group frequency distribution summarises data into groups of values, each showing the
number of items having values in this group. E.g. hospital attendance by age groups
Group Number of patients
0-1 200
2-5 100
6-16 50
20-25 20
Note: individual data values cannot be identified with this type of structure.
Rules and practices for compiling group frequency distributions
a) All data values being represented must be contained within one class overlapping classes
must not occur.
b) The classes of the distribution must be arranged in size order
c) There should be normally 8-10 classes in total with not less than five or more than 15.
d) Class descriptions should be easy to estimate with ranges that naturally describe the data
being presented.
Formation of grouped frequency distributions
Calculate the range
Divide range by 10 and adjust figure upward to obtain a standard class width
Construct the distribution classes
Tally
Frequency table
Histograms
A histogram is a chart consisting of a set of vertical bars.
Frequency polygons and curves
Cumulative frequency distributions
Cumulative frequency polygons and curves
1.20 General Charts and graphs
The types of diagrams described in this topic include pictograms, various types of bar charts, pie
charts, line and strata graphs, Z-charts, Ghant charts and semi-logarithm graphs. Since all the
above graphs and charts display either non-numeric frequency distributions or time series, the
topic begins with the describing these types of data structure.
Non-Numeric frequency distributions take on a similar form to their numeric counterparts, except
that the groups (or classes ) describe qualitative (i.e. Non numeric) characteristics.
Time series
The name given to data describing the values of some variables over successive time period is a
time series e.g. years months
Purpose of statistical diagrams
(i) Present data in an attractive and colourful way

(ii) Enable a general perspective of the data to be shown without excessive details.
Types of charts and graphs
Pictograms
Some features of pictograms
a) Also called ideograms

b) Presented horizontally, scale axis can be included
c) Alternate of duplicating the symbol
Advantage
(i) easy to understand
Disadvantage
(i) Sometimes awkward to construct
(ii) Not accurate enough
(iii) Magnification can be confusing
Simple bar charts
A simple bar chart is a chart consisting of a set of non-joining bars.

Some features of a simple bar charts
(b) Can be horizontal or vertical

(c) Useful for both positive and negative values
(d) Advantages
(i) easy to construct
(ii) easy to understand
(iii) easily adapted to show negative values or for comparison purposes.
(e) There are no significant disadvantages.
Pie charts
A pie chart shows the totality of the data being represented using a single circle (a ‘pie’). The
circle is split into sectors the size of each being drawn in proportion to the class frequency. Each
sector can be shaded or coloured differently if desired.
Some features of a pie chart
(a) Also referred to as circular diagrams or divided circles

(b) To construct a pie chart, the size of each sector in degrees need to be calculated
Line diagrams
A line diagram plots the values of a time series as a sequence of points joined by straight lines
Some features of line diagrams
Others
Components , percentage and multiple charts
Multiple pie charts
Strata charts
Ghant charts etc.
ACTIVITY 1
Define statistics and state the importance of statistics in the health sector.
Write notes on literature review.
Compare and contrast primary and secondary data.
What is the difference between data and information?

PART II DESCRIPTIVE STATISTICS
2.0 Descriptive Statistics
Describing data using statistics
Descriptive methods only describe variations in data. They do not solve problems that were raised
in connection with the examples. Descriptive statistics enable you to describe (and compare)
variables numerically. Statistics to describe a variable focus on two aspects:
a) Central Tendency
The three ways of measuring central tendency most used in research are the
1. Value that occurs most frequently (mode),

2. The middle value or midpoint after the data have been ranked (median)
3. Value often known as the average, that includes all data values in its calculation
(mean)
b) The dispersion
As well as describing the central tendency for a variable, it is important to describe how the data
values are dispersed around the central tendency. This is only possible for quantifiable data. Two
of the most frequently used ways of describing the dispersion are the:
Difference within the middle 50 percent of values (inter-quartile range)
Extent to which values differ from the mean (standard deviation) Saunders p 351-367
Table 11.4 Research methods for business students.
From the fore going we have the data before us our next step from the statistical point of view is
now to deal with basic analysis of univariate data (data obtained from measuring just one
attribute). Statistical measure is the name given to describe this type of analysis, the measures
themselves being split into various groups.
Measure of location: commonly called averages, are the most well-known measures of numeric
data.
Measures of dispersion: Describe how spread out a set (or distribution) of values is
Measures of skewness: shows how evenly a set of items is distributed, and these, together with
various measures of dispersion, are dealt with later.
2.1 Measures of location
2.1 Arithmetic mean

The arithmetic mean of a set of values is defined as “the sum of the values ‘divided by ‘the
number of values’.
That is, the arithmetic mean =the sum of all the values/the number of values.
The arithmetic mean is normally abbreviated to just the ’mean’.
Notation for general values
x1, x2, x3,….,xn
Where n is the number of items in the set, this notation is really just a compact way of saying:
‘The 1st X-value, the 2nd x-value, and ... and so on.
E.g if there totals 4, 5, 6, 12 and 2
This would correspond to: x1=4, x2=5
In this case n=5
Notation for the sum of values
Enabling sums to be written compactly
X1+x2+x3+x4+…+xn
Is written as ∑ x
∑is the Greek symbol for capital S (for Sum and ∑ x can be simply translated as add up all the x
values under consideration. For the example above
∑ x =4+5+6+12+2= 29
Formula for the mean of a set of values
x= x1+x2+…+xn
= ∑x
The mean of a simple frequency distribution
X 10 12 13 14 16 19
f 2 8 17 5 1 1
fx 20 96 221 70 16 19
thus the mean x =∑ x = 452 =13.3 (2D)
n 34
The mean of a grouped frequency distribution
Thus the mean x = ∑ fx
∑f
Weighted Mean x = ∑ n x
Significance of the mean
a) generally understood as the standard average

b) technically , it is considered as the ‘ mathematical average’
c) Regarded as truly representative of the data, since all values are taken into account in its
calculation.
d) The mean does not necessary take a value that is the same as one of the original values
(where as often other averages do).
e) Not suitable for data sets with extreme values at one end , since these are taken into
account and can result in an average that is not really representative and therefore not
usable in practice.
Summary
Activity
What is a measure of location?
How is arithmetic mean defined?
Why is the special notation x1 +x2, …xn used?
What does ∑ fx mean?
Why is the formula for the arithmetic mean of a frequency distribution different from that
for the mean of a set?
2.2 Median
The median of set of data is the value of that item which lies exactly half way along the set
(arranged into size order).
For a set with an odd number (n) of items, the median can be precisely identified as the value
n+1/2th item. Thus in a size ordered set of 15 items, the median would be the 15+1/2 th =8 th item
along.
Median for a simple frequency distribution
The mean of a simple frequency distribution
X 10 12 13 14 16 19
f 2 8 17 5 1 1
fx 20 96 221 70 16 19
Procedure for calculating the median
Step 1 Calculate the value of ∑f +1
=34+1/2
=17.5
Step 2 Form an F (cumulative frequency) column
X F F fx
10 2 2
12 8 10
13 17 27
14 5 32
16 1 33
19 1 34
34 =∑f=n
Step 3 Find that F value which first exceeds ∑f +1
=34+1/2th item
=17.5
Step 4 the median is that x –value corresponding to the F value identified in step 3.
Therefore the median is 13.

Median for grouped frequency distribution
There are two methods employed for estimating the median
a) Using an interpolation formula

b) By graphical interpolation
Interpolation in this context is a simple mathematical technique which estimates an unknown

value by using immediately surrounding known values.
Activity
Given the following set of numbers: 1, 8, 9, 3, 5, 6, 7, 2, 9
1. Describe how you would go about finding the median

2. Find the mean.
3. Find the mode.
4. Describe the skewness of the set of numbers.
2.3 Mode
The mode of a set of data is that value which occurs most often or, equivalently, has the largest
frequency.
The mode of the set 2,1,3,3,1,1,2,4 is 1, since the value occurs most often.
The mode of the following simple discrete frequency distribution.
X 4 5 6 7 8 9 10
f 2 5 21 18 9 2 1
Is 6, since the value has the largest frequency (of 21)
Measurement of dispersion
Measurement of dispersion describe how spread out or scattered a set or distribution of numeric
data is. There are different bases on which the spread of data can be measured.
a) spread about the mean. This is concerned with the measuring the distance between the
items and their common mean. There are two measures of this type used:
i. the mean deviation
ii. the standard deviation (a measure so important and widely used )
b) Central percentages spread of items: these measures have links with the median .
i. the 10 to 90 percentile range
ii. the quartile deviation
c) overall spread of items this measure is called the range
2.4 Range
The range is defined as the numerical difference between the smallest and largest values of the
items in a set or distribution.
Eg for a set of values 2,8,7,5,3,9,7
Range =9-2=7
Characteristics of a range
a) the range is a simple concept and easy to calculate

b) the major disadvantage of the range is the fact that it only takes two values into account
(the smallest and largest)and is thus only too obviously affected by extreme values.
c) The range has no natural partner in a measure of location and is not used in further
advanced statistical work.
d) One of its common practical uses is for the quality control purposes. Small samples of
output are taken at regular intervals and the sample mean and range are calculated and
recorded on separate charts. The chart for the ranges of samples enables a check to be
kept on the variability of production in a quick and easy way.
2.5 Mean Deviation
The mean deviation is a measure of dispersion that gives the average absolute difference (i.e.
ignoring “minus’ signs) between each item and the mean.
Formula for calculating the mean deviation
The mean deviation (md) is calculated through the use of the formulae
Mean deviation
For a set: md=∑ x- x
For a frequency distribution: md= ∑f x- x
∑f
NOTE: the modulus symbol I….I means ‘the absolute value of’ and simply ignores the sign of
the expression inside it.
e.g I -6 I =I 6 I = 6
Example The mean deviation for a set
Question
Calculate the mean deviation of 43, 75, 48, 39, 51, 47, 50, 47
Answer
First determine the mean as 400/8 =50
md = ∑ x- x
= I 43-50 I +I 75-50 I + I 48-50 I + I 39-50 I + I 51-50 I+ I 47-50I + I 50-50 I + I 47-50 I
= 7+ 25+ 2 + 11+ 1 +3 +0 + 3= 52/8
= 6.5
In other words, each value in the set is, on average, 6.5 units away from the common mean.
Example Mean deviation for a frequency distribution
Question
The data in table 1 following relates to the number of successful sales made by the salesmen
employed by a large microcomputer firm in a particular quarter.
Calculate the mean and the mean deviation of the number of sales.
Answer:
The standard layout and calculation s are shown in table 2. The mean is calculated first, then used
to find the mean deviation.
Table 1 Number of sales made by salesmen

Number of sales 0-4 5-9 10-14 15-19 20-24 25-29
Number of sales men 1 14 23 21 15 6
Table 2 Layout of calculations
Number of sales Number of salesmen Midpoint
(f) (x) (fx) I x- x I f I x- x I
0 to 4 1 2 2 13.3 13.3
5 to 9 14 7 98 8.3 116.2
10 to 14 23 12 276 3.3 75.9
15 to 19 21 17 357 1.7 35.7
20 to 24 15 22 330 6.7 100.5
25 to 29 6 27 162 11.7 70.2
Totals 80=(n)=∑f (∑fx) = 1225 411.8
Mean number of sales, x = 1225/ 80
= 15.3(1D)
For the ‘0-4’ group: I x- x I = I 2-15.3 I
= I-13.3I
=13.3
For the ‘5-9’ group: I x- x I = I7-15.3 I
=I-8.3 I
=8.3
For the ’10-14’ group I x- x I =I12-15.3I
=I 3.3 I
=3.3
For the 15-19 group I x- x I =I17-15.3I
=I1.7I
=1.7
For 20-24 group I x- x I =I22-15.3I
=I6.7I
=6.7
For25-29 group I x- x I =I27-15.3I
=I11.7I
=11.7
1(13.3)+14(8.3)+23(3.3)+21(1.7)+15(6.7)+6(11.7) = 411.8 =
80 80
Thus the mean deviation, md = ∑f Ix- x I
∑f
=411.8/80
= 5.1 sales
Characteristics of the mean deviation
a) The mean deviation can be regarded as a good representative measure of dispersion that
is not difficult to understand. It is useful for comparing the variability between
distributions of like nature.
b) Its practical disadvantage is that it can be complicated and awkward to calculate if the
mean is anything other than a whole number.
c) Because of the modulus sign, the mean deviation is virtually impossible to handle
theoretically and thus not used in more advanced analysis.
2.6 Standard deviation
The standard deviation is defined as the root of the mean of the squares of deviation from the
common mean ‘of the set of values.
A procedure for calculating the standard deviation is described and at the same time,
demonstrated using the set of values, 2, 4, 6 and 8.
It is calculated as follows:
STEP 1 Calculate the mean
x= 2+4+6+8 = 20/4 =5
STEP 2 Find the sum of the squares of deviations of items from the mean.
(2-5)2 + (4-5)2 +(6-5)2 + (8-5)2 = (-3)2 + (-1)2 + (+1)2 + (3)2
=9 + 1+1 + 9
=20
Step3 Divide this sum by the number of items and take the square root.
√20/4 = √5 = 2.24(2D)
The value so obtained is the standard deviation.
The procedure can be summarised in general terms as follows:
Standard deviation for a set of values
S= ∑(x- x )2
Computational formula for the standard deviation of a set of values.
Using the same figures above of (2, 4,6 and 8)
Step 1 Sum the squares of all values.
22 +42+ 62+ 82 = 120.
Step 2 Divide the sum by the number of values n or (n-1) in case the sample size is less than 30.
120/4 = 30
Step 3 Subtract the square of the mean
(The mean is 2+4+6+8/4 =5, the square being 25.)
We now have 30-25 =5.
Step4 take the square root, giving the standard deviation
Thus the standard deviation is: √5 =2.24(2D)

The generalized formula for this procedure is:
S= ∑(x2 - x 2 )
Standard deviation for a frequency distribution
∑f x 2 - ∑f x 2
∑f ∑f
Example The standard deviation for a frequency distribution
Question
The data in Table 2 bellow relates to the number of successful sales made by the salesmen
employed by a large microcomputer firm in a particular quarter. Calculate the mean and standard
deviation of the number of sales.
Answer
The standard layout and calculations are shown in table 3 and the subsequent text.
Table 2 Successful sales made
Number of sales Number of salesmen
0 to 4 1
5 to 9 14
10 to 14 23
15 to 19 21
20 to 24 15
25 to 29 6
Table 3 Table of calculations
Number of sales Number of salesmen Midpoint

(f) (x) (fx) f ( x2 )
0 to 4 1 2 2 4
5 to 9 14 7 98 686
10 to 14 23 12 276 3312
15 to 19 21 17 357 6069
20 to 24 15 22 330 7260
25 to 29 6 27 162 4374
Totals 80=( ∑f) (∑fx)= 1225 ∑fx2 =21707
Mean number of sales, x = ∑fx/∑f=1225/ 80=15.3(1D)
Standard deviation for a frequency distribution, s =
∑fx2 - ∑fx 2
∑f ∑f
s=
21707-1225 2
80 80
= √271.33-(15.3)2
=√271.33-234.47
=√36.86
= 6.1 sales
Refer to spread sheet workout
Take note on the exercise
a) Take care when evaluating fx2

‘ x 2’ mean f times x times x (and not fx times fx ). Thus fx can be calculated using fx times x.
b) Use of a calculator. When calculating (and at the same time accumulating) values of f x
and f x 2, use the type of technique described in the previous example. It will ensure the
least number of keystrokes for all the information required in the table. for example, in
the case of fx values , the calculator procedure is :
1x2M + and write down value (2)
14x7M+ and write down value (98)
………… etc, down to:
6x27 M+ and write down value (162)
RM gives the value of ∑fx(1225)
c) Validation of calculated measures.

Whenever the mean and standard deviation are calculated, it is always wise to check on
the reasonableness of the results.
(i) Since the mean is a measure of location it should be roughly centrally located. In
this case 15.3 is acceptable.
(ii) The standard deviation should be approximately ‘one-sixth of the range for
roughly symmetrical distributions.
(iii) For moderately skewed distributions, it will be slightly larger. In this case, the
distribution is not very skew and dividing the range (29-0=29) by 6 gives
approximately 5, which agrees very well with the calculated value of 6.1.
An examiner will always give students credit for noticing that a calculated value
is unreasonable (even if they have no time to correct their error).
Characteristics of the standard deviation

a) The standard deviation is the natural partner to the arithmetic mean in the following
respects:
(i) ‘ by definition’. The standard deviation is defined in terms of the mean.
(ii) ‘Further statistical work’. In further statistical analysis there is a need to deal
with the most commonly occurring natural distributions, called the normal
distribution, which can only be specified in terms of both the mean and
standard deviation.
(f) It can be regarded as truly representative of the data, since all data values are taken into
account in its calculation.
(g) For distributions that are not too skewed:
(i) visually all of the items should lie within 3 standard deviations of the mean. i.e
range = 6xstandard Deviation(approximately).
(ii) 95% of the items should lie within two standards deviations of the mean
(iii) 50 % of the items should lie within 0.67standard Deviations of the mean.
2.7 The coefficient of variation
It is sometimes necessary to compare two different distributions with regard to variability. For
example, if two machines were engaged in the production of identical components, it would be of
considerable value to compare the variation of certain critical dimension of their output.
However, the standard deviation is used as a measure for comparison only when the units in the
distribution are the same and the respective means are roughly comparable.
In the majority of cases where distributions need to be compared withy respect to variability, the
following measure, known as the coefficient of variation, is much more appropriate and is
considered as the standard measure of relative variation.
Coefficient of variation
S= Standard deviation x100%
Mean
In words, the coefficient of variation calculates the standard deviation as a percentage of the
mean.
Since the standard deviation is being divided by the mean, the actual units of measurements
cancel each other out, leaving the measure unit free and thus very useful for relative
comparison.
Example Calculation of the coefficient of variation.
Over a period of three months the daily number of components produced by two comparable
machines was measured, giving the following statistics.
Machine A: Mean 242.8; sd =20.5
Machine B: Mean 281.3; sd = 23.0
The coefficient of variation for machine A= 20.5/242.8 x100%
=8.4%
The coefficient of variation for machine A= 23.0/281.3 x100%
=8.2%
Thus, although the standard deviation for machine B is higher in absolute terms, the dispersion
for machine A is higher in relative terms.
2.8 Pearson’s measure of skewness
Skewness was described in the previous chapter and it was shown that the degree of skewness
could be measured by the difference between the mean and the mode. However, for most
practical purposes, it is usual to require a measure of skewness to be unit free ( i.e. a coefficient )
and the following expression, known as parsons’ measure of skewness (Psk) is of this type.
Pearson’s measure of skewness

Psk = mean –mode/ standard deviation
=3x(mean-Median)/Standard deviation
Thus the skewness of two different sets of employees remuneration can be compared if, perhaps,
one is given in terms of weekly wages and other in terms of annual salary. Not that:
Psk < 0 shows there is left or negative skew.
Psk = 0 signifies no skew (mean =mode for a symmetric distribution).
Psk > 0 means there is a right or positive skew.
The greater the value of Psk (positive or negative) , the more the distribution is skewed.
Example determination of Pearson’s measure of skewness
Question
The data in Table 2 bellow relates to the number of successful sales made by the salesmen
employed by a large microcomputer firm in a particular quarter. Calculate the mean and standard
deviation of the number of sales.
Number of sales 0-4 5-9 10-14 15-19 20-24 25-29
Number of sales men 1 14 23 21 15 6
The mean and standard deviation (earlier calculated) are 15.3 and 6.1 respectively.
Estimate the value of the mode and thus calculate Pearson’s measure of skewness.
Answer
The interpolation formula is used to estimate the mode.
The modal class is 10-14(since it has the largest frequency) with a lower class bound of
9.5 the difference between the largest frequency and those either side are: D1 =23-14 = 9; D2=
23-21= 2 and the modal class width is 5(i.e. 10, 11, 12, 13, and 14)
Thus mode =9.5+ (9/9+2) x5 =13.6 (ID)
Pearson’s measure of skewness, Psk = mean -mode
Standard deviation
= 15.3-13.6
6.1
= +0.28(2D)
This value of Psk demonstrates a small degree of right skew, which can be confirmed by
inspecting the given frequency distribution table.
Summary
a) The standard deviation of a set of values is defined as ‘the root of the mean of the
deviation from the mean’.
b) This most important measure of dispersion is always paired with the arithmetic mean
because :
i. it is defined in terms of the mean
ii. it is relatively easy to handle theoretically
iii. both measures are needed when dealing with normal distribution s
c) the standard deviation is usually , calculated using the ‘computational formula’.
d) For distributions that are approximately normal the standard deviation should cover
approximately one-sixth of the range.
e) The coefficient of variation is an alternative (unit free) measure to the standard deviation
when comparing distributions. It is found by calculating the standard deviation as a
percentage of the mean.
2.9 Quartiles and the quartile deviations
Quartiles and their use
A quartile is a name given to an item that lies at some proportional way along a size –ordered set
or distribution.
They are normally considered in groups, the members of which split up the data set into equal
portions.
a) The median, as we already know, is the middle item of a size-ordered set. Thus, by
definition it can be said to split a set (for frequency distribution) into two equal portions.
In other words, it is a particular example of the quartile.
b) Quartiles are best suited to types of business data that:
(i) Are particularly susceptible to extremes: wages of employees; turnover of
companies; value of customer orders; in fact, any distribution that has at least a
moderate amount of skew.
(ii) Have distributions that have either open-ended classes or data that are difficult,
expensive or impossible to obtain at extremes.
Remember that this is just the type of data that was described in chapter 7 as being best suited for
analysis using the median.
c) A particular problem at this stage is the fact that no measure of dispersion so far
introduced can be paired naturally with the median, in order to represent data of the type
described above.
The range is a simple measure of dispersion which stands alone.
Both the mean deviation and the standard deviation are defined in terms of the mean and thus
unsuitable.
Thus it is necessary to find measures of dispersion and skewness, based on the idea of splitting
size –ordered data into equal portions, that will naturally partner the median. Such a measure, the
quartile deviation, is developed in the following sections.
Quartiles
A (size-ordered) set of data can be split up into four equal portions. The three values that do this,
lying respectively one-quarter, one-half, and three-quarters of the way along the set, are known as
‘quartiles’.
For example the set 7,4,5,3,3,9,8 can be size-ordered and the quartiles identified as follows:
3,3, 4,5,7,8,9
Middle (2nd) quartile
Lower (1st) quartile =median upper 3rd) quartile
For this particular set of data, the quartiles are easily identified as 3, 5 and 8 respectively and are
shown above. (Note that the middle quartile is the median.)
To summarise:
The quartiles of distribution
The three quartiles of an ordered set or frequency distribution are those values that lie one-
quarter, one-half and three-quarters of the way along the group, and are respectively called, the
lower, middle and upper quartiles (Q1, Q2 and Q3).
The median is the middle quartile (Q2).
Identification of the quartiles for a set
For an ordered set of data, just as the median can be identified as the value of the n+1/2 th item,
the other two quartiles can be identified as follows:
Q1 is the value of n+1/4 th item
Q3 is the value of the 3(n+1)/4 th item
Note that although the median is, by definition, the middle quartile, the term ‘quartiles’ is often
used to describe only the lower and the upper quartiles, Q1 and Q3 respectively.
The quartile deviation
The identification of the quartiles enables the measure of dispersion to be defined. This is known
as the quartile deviation and is defined as half the range of the middle 50% of items (i.e. the
difference between the lower and the upper quartiles divided by 2). It is thus sometimes referred
to as the ‘semi-interquartile range’.
For the set of values: 7,4,5,3,3,9,8
Introduced in section 2 we found that Q1 =3 and Q3=8.
Therefore the quartile deviation is calculated as: 8-3/2 =2.5.
To summarise:
Quartile deviation (semi-interquartile range)
qd = Q3-Q1
The place of the quartile deviation
It was stressed earlier that, because of the importance of the median, there is a need for a measure
of dispersion to pair with it. This measure is the quartile deviation. Q3-Q1 gives what is called the
inter quartile range which is the range covered by the central 50% of items, and dividing by 2
gives (what can only be described as) the average distance between the median and the quartiles.
Thus, approximately, it can be considered that approximately 50% of all items lie within one
quartile deviation either side of the median. Generally, from now on the median and quartile
deviation will be calculated and referred to as a linked pair.
For the set: 43,75,48,51,51,47,50
The quartiles were found above as Q1=47 and Q3 =51
Thus the quartile deviation, qd= Q3-Q1/2

=51-47 =2
Identifying the quartiles of a frequency distribution
a) The quartiles have the property that they split a distribution into four equal segments,
which means effectively that the area under the frequency curve is divided into four equal
parts.
25% 25% 25%
25%
Q2 Q1
Median
b) when calculating the median and quartiles for a simple discrete frequency
distribution( consisting of discrete values which are not grouped), the technique used is
essentially the same as that for a set. However, the form of the data requires the same
approach as that used for the calculation of the median (chapter 7, section5 and 6).
Example 3, which follows demonstrates this technique for the calculation of the median
and quartiles.
c) For a grouped frequency distribution, the quartiles can only be estimated (as was the case
for the mean, standard deviation and median). The technique normally used is graphical.
The procedure is described in section 11 and demonstrated in example 4.
2.10 Normal Distribution
The Normal distribution
The normal distribution is the name given to a type of distribution of continuous data that
occurs frequently in practice. It is a distribution of ‘natural phenomena’, such as:
1) weights (of products

2) heights
3) lengths
4) times
The main characteristics of the distribution are:
a) It has a symmetric (frequency) curve at about the mean of the distribution. In other words
one half of the curve is a mirror image of the other.
b) The majority of the values tend to cluster about the mean, with the greatest frequency at
the mean itself.
c) The frequencies of the values tapper away
Confidence interval versus confidence limit
A confidence interval is an estimated range of values about a point estimate that indicates the
degree of statistical precision that describe the estimate. The level of confidence is set
arbitrary, but for any given level of confidence the width of the interval expresses the
precision of the measurement: a wider interval implies less precision, and a narrower interval
implies more precision. The upper and lower boundaries of the interval are the confidence
limits. (K.j Rothmans 1998:124)
2.11 Confidence limits
The study of normal distributions help in the process of estimating certain population
measures based only on the results of small samples. Two measures of interest (for the
syllabuses covered in this text) are the mean of a population and a population
proportion.
Confidence Limits specify a range of values within which some unknown population
value (mean or proportion) lies with a stated degree of confidence. They are always based
on the results of a sample.
The following statement is typical of what needs to be able to be calculated and
understood:
‘95% confidence limits for the lead time of particular type of stock item order are 4.1 to
7.4 working days’
The above would have been calculated on the basis of a representative sample and is
stating that there is a 95% probability that the true (unknown) Mean lead time lies
between 4.1 and 7.4 working days.
2.12 Confidence limits for a mean
The following gives a technique for evaluating a confidence interval for an unknown
population mean.
Confidence interval for mean
Given a mean is:
x ± z s/ √n
Where: x = sample mean
s = the sample standard deviation
n = the sample size
z = the confidence factor (1.64 for 90%: 1.96 for 95%: 2.58 for
99%).
Note: s/√n is known as the standard error (se) of the mean
As an example of the use of these confidence units, suppose a sample of 100 invoices yielded a
mean gross value of $45.50 and standard deviation US$ 3.24.
Here, n=100, x =45.5 and s=3.24.
We would calculate a 95% confidence interval as follows:
95% interval = x + z s/√n (here z=1.96 for a 95% interval
=45.50 + (1.96)(3.24/√100)
= 45.50 + (1.96)(0.324) = 45.50 +0.635
Upper limit =45.5+0.635 = 46.135
Lower Limit =45.5-0.635= 44.865
=(44.865, 46.135)
In words, there is a 95% probability that the mean of the complete population of invoices (from
which the sample was taken) is between 44.9 and 46.1.
2.13 Confidence limits for a proportion
The following gives the technique for evaluating the confidence interval for unknown population
proportion.
Confidence interval for a proportion
Given a random sample from some population, a confidence interval (CI) for the unknown
population mean is:
p± z p(1-p)
where p = the sample proportion
n= the sample size
z= confidence factor (1.64 for 90%; 1.96 for 95%; 2.58 for 99%).
Note: p(1-p) is known as standard error (se)of the proportion

n
An example of the use of these confidence limits, suppose 4 faulty components are discovered in
a random sample of 20 finished components taken from a production line. What statement can we
make about the defective rate of all finished components?
Notice that here p = 4/20 = 0.2 and n=20.
Thus, 95% CI for the overall rate is given by :
p± z p(1-p) = 0.2 ± (1.96) (0.2)(0.8)
n 20
=0.2 ± (1.96)(0.089)
=0.2 ± 0.175(3D)
=(0.025, 0.375)
Therefore we can state that there is a 95% chance that the defective rate of finished
components lies between 0.025 and 0.375.
2.14 Test of significance for the mean
Test of significance are directly connected to confidence limits and are based on normal
distribution concepts.
Suppose we suspect that the value of type A customer monthly orders has changed from last
year. Last year’s type A customer average monthly order was $234.50. We now take a random
sample of 20 customers and calculate a mean of $241.52 and standard deviation $13.92. Is the
difference significant? That is, is the value 241.52 far enough from 234.50 to suspect that things
have not changed? The following statement gives a structure of answering the above question.
2.15 Test of a mean
To test whether a sample of size n, with mean x and standard deviation s could be considered as
having been drawn from a population with mean µ the test statistic:
z= x- µ
s/√n
must lie within the range -1.96 to +1.96.
Note: µ is the Greek letter m, pronounced ‘mew’.
In the test, we are looking for evidence of a difference between x and µ. The evidence is found if
z lies outside the above stated limits. If z lies within the limits we say ‘there is no evidence’ that
the sample mean is different to the population mean.
For example, in the above situation, we have x =241.52: µ=234.50; s=13.92; n=20
Thus, test statistic z= z = x- µ = 241.52-234.50 = 2.26
s/√n 13.92/√20
Thus there is evidence of a difference, since z lies outside the range -1.96 to +1.96. This can now
be translated as ‘there is evidence that the value of type A customer monthly orders has
changed’.
ACTIVITY
Descriptive statistics look at three measures namely Measure of Location, measure of spread
and skewness.
Write short notes on the three giving the parameters used to measure each.
Define the following:
a) Mean
b) Mode
c) Median
d) Mean deviation and
e) standard deviation
f) confidence interval for the mean
Give the formula for the confidence interval for a proportion

PART III - ANALYTICAL STATISTICS
3.0 ANALYTICAL STATISTICS
In part one we defined statistics as: A collection of techniques and methods that may be used to
solve problems that arise when one wants to draw general conclusions from data from
epidemiological and other types of empirical studies. We further stated that Statistics is the
science of collecting, summarising, presenting and interpreting data, and of using them to test
hypothesis.
Analysis is defined as the ability to break down data and to clarify the nature of the component
parts and the relationship between them.(Saunders2003: 472.)
Example 1/. Spread of infectious diseases, Data on children exposed to siblings with measles,
chicken pox or mumps.
No. of Children exposed Infected Attack rates

1 Measles 251 201 0.80
2 Chicken pox 238 172 0.72
3 Mumps 218 82 0.38
Two statistical problems
1. To which degree are we permitted to generalize the information on attack rates to other
exposed children? Are the results reliable?
2. Can we be sure that attack rates are higher for measles and lowest for mumps?
Example 2 Mc Neil Table 2.3and 2.4
A study on the effect of family therapy on anorexia
Mean Weight
Treatment Before Treatment After Treatment n
Therapy 37.8 41.1 17
No therapy 37.0 36.8 26
Statistical problems
How reliable are these figures for the complete population?
Is the study biased?
Does treatment work?
The examples have so far illustrates some routine descriptive statistical techniques
Relative frequencies (example 1)
Histograms (Example 2)
Mean (Example 2)
Standard deviations (Example 2)
NB: Make copies of the analysis findings referred to above
Descriptive methods only describe variations in data. They do not solve problems that were raised
in connection with the examples.
The basic set up for statistical analysis.
Analysis is defined as the ability to break down data and to clarify the nature of the component
parts and the relationship between them.(Saunders2003: 472.)
3.1 Selecting samples
What is sampling?
Sampling involves the selection of a number of study units from a defined study population.(Anita
1995:205)Whatever your research question(s) and objectives you will need to collect data to
answer them. If you collect and analyse data from every possible case or group member this is
termed a census. However for many research questions and objectives it will be impossible for
you either to collect or to analyse all the data available to you owing to restrictions of time,
money and often access. Sampling techniques provide a range of methods that enable you to
reduce the amount of data you need to collect by considering only data from a subgroup rather
than all possible cases or elements.
Population
X x x x x x x x x xx
x Xx
xx
X x x x x x xxx x x x x x
‘A’ Random sample
Case or element
The need to sample
Sampling provides a valid alternative to a census when:
 It would be impracticable for you to survey the entire population;

 Your budget constraints prevent you from surveying the entire population;
 Your time constraints prevent you from surveying the entire population
 You have collected all the data but need the results quickly (Saunders et el
2003:150,151)
Sampling strategies
The sampling techniques available to you can be divided into two types
1. The probability or representative sampling (e.g. simple random, systematic, stratified

random, cluster, and multistage sampling).
2. Non probability sampling or judgmental sampling (e.g. Quota, purposive, snowball
and convenience sampling).
3.1.1 Probability sampling
Probability sampling involves random selection procedures to ensure that each unit of the sample
is chosen on the basis of chance. All units of the study population should have an equal or at least
a known chance of being included in the sample. Anita p. 208)
With probability samples the chance, or probability, of each case being selected from the
population is known and is usually equal for all cases. This means that it is possible to answer
research questions and to achieve objectives that require you to estimate statistically the
characteristics of the population from the sample. (Saunders 2003:152)
3.1.1.1 Simple Random Sampling
This is the simplest form of probability sampling. To select a simple random sample you need to:
-Make a numbered list of all the units in the population from which you want to draw a sample;
-Decide on the size of the sample
-select the required number of sampling units, using a ‘lottery ‘ method or a table of random
numbers.
HOW TO USE RANDOM NUMBER TABLES
1. First, decide how large a number you need. Next count if it is a one, 2, or larger digit
Number.
For example, if your sampling frame consists of ten units, you must choose from
numbers 1-10(inclusive). You must use Two digits to ensure that 10 has an equal chance
of being included.
2. You also use two digits for a sampling frame consisting of 0-99 units. If, however, your
sampling frame has 0-999 units, then you obviously need to choose from three digits. In
this case, you take an extra digit from the table to make up the required three digits. For
example, the number in columns 10, 11, row 27:43, would become 431; going down, the
next numbers would be 107, 365 etc.
You would do the same if you needed a four digit number, for a sampling frame 0-9999
units.
In our example of the number on columns 10, 11,12, row 27 0f the table: 431, this would
now become 4316,the next down 1075, and so on.
3. Decide before hand whether you are going to go across the page to the right, down the
page. Across the page to the left L, or up the page.
4. Without looking at the table, and using a pancil, pen, stick, or even your finger, pin-point
a number.
5. If this number is within the range you need , take it. If not , continue to the next number
in the direction you chose before –hand (across, up or down the page), until you find a
number that is within the range you need.
For example, if you need a number between 0-50 and you began at column 21, 22,row 21
you get 74 which is obviously too big. So you could go down (having decided before
hand to go down) to 97, also too big, to 42, which is acceptable, and select it.
N.B. Need Random number table for Demonstration
3.1.1.2 Systematic Sampling
In Systematic Sampling individuals are chosen at regular intervals from the sampling frame.
Ideally we randomly select a number to tell us where to start selecting individuals from the list.
A systematic sample is to be selected from 1200 students of a school. The sample size selected is
100. The sampling fraction is: 100(sample size)/1200(study population) = 1/12
The sampling interval is therefore, 12. The number of the first student to be included is chosen
randomly, for example by blindly picking one out of twelve pieces of paper, numbered 1 to 12. If
the number 6 is picked, then every twelfth student will be included in the sample, starting with the
student number 6, until 100 students are selected: the numbers selected would be 6,18,30,42, etc
3.1.1.3 Stratified Sampling
The simple random sampling method described above does not ensure that the proportion of
individuals with certain characteristics in the sample will be the same as those in the whole study
population.
If it is important that the sample includes representative groups of study units with specific
characteristics (for example, residents from urban and rural areas, or different age group), then
the sampling frame must be divided into groups, or strata, according to these characteristics.
Random sampling or systematic samples of a predetermined size will then have to be obtained
from each group (stratum). This is called stratified Sampling. (Anita 1995:209)
Case control studies use stratified sampling from subpopulations with or without a specific
disease.
3.1.1.4 Cluster sampling
The selection of groups of study units (clusters) instead of the selection of study units
individually is called Cluster sampling.
3.1.1.5 Multistage sampling
A multi-stage sampling procedure is carried out in phases and usually involves more than one
sampling method.
Probability sampling is most commonly associated with survey based research where you need to
make inferences from your sample about a population to answer your research question(s) or to
meet your objectives. The process of probability sampling can be divided into four stages:
1. Identify a suitable sampling frame based on your research question(s) or objectives

2. Decide on a suitable sample size.
na= n x 100/re%
where na is the actual sample size required,
n is the minimum (or adjusted minimum)sample size.

re% is the response rate
Minimum sample size = n=p% x q%x[z/e%]2

where n is the minimum sample size required
p% is the proportion belonging to the specified category
q% is the proportion not belonging to the specified category
z is the z value corresponding to the level of confidence required (90%
certain z= 1.65, 95% certain z=1.96 99% certain z= 2.57)
e% is the margin of error required.
Where your population is less than 10000 a smaller sample size can be used without
affecting accuracy. This is called adjusted minimum sample size. It is calculated
using the formular:
n’=n/1+(n/N)
where n’ is the adjusted minimum sample size
n is the minimum sample size (as calculated above)is
N is the total population.(Saunders 203:158,466,467).
3. Select the most appropriate sampling technique and select the sample
4. Check that the sample is representative of the population.
For populations of less than 50 cases Henry (1990) advises against probability sampling. He
argues that you should collect data on the entire population as the influence of a single extreme
case on subsequent statistical analysis is more pronounced than for larger samples.(Saunders
2003: 153)
Stratified sampling probability sampling from sub population (definition on page 209 Anita).
Case control studies use stratified sampling from sub populations with or without a specific
disease.
If individuals are selected randomly and independently with each individual having the same
chance of being selected, then the probability of selecting an individual with which the property
or characteristic appears in the population. The population frequency = Probability.
Epidemiological concepts like risk, rate, prevalence, incidence etc may all be regarded as
probabilities.
Risks, rates, prevalence and incidence calculated for a random sample of individuals are to be
regarded as estimates of probabilities and or population proportions
Non- Random sampling, judgmental Sampling
A range of non-probability sampling techniques is available that should not be discounted as they
can provide sensible alternatives.
At one end of this range is quota sampling, which, like probability samples, tries to represent the
total population. Quota sampling has similar requirements for sample size as probability sampling
techniques. At the other end of this range are techniques based on the need to obtain samples as
quickly as possible where you have little control over the content and there is no attempt to
obtain a representative sample. These include convenience, and self-selection sampling
techniques. Purposive sampling and snowball sampling techniques lie between these extremes.
For these techniques the issue of sample size ambiguous. Unlike quota and probability samples
there are no rules. Rather it depends on your research questions and objectives – in particular
what you need to find out, what will be useful, what will have credibility and what can be done
within your available resources (Patton, 2002).
3.2.1 Quota sampling
Quota sampling is entirely non-random and is usually used for interview surveys. It is based on
the premise that your sample will represent the population as the variability in your sample for
various quota variables is the same as that in the population. Quota sampling is therefore a type of
stratified sample in which selection of cases within strata is entirely non-random (Barnett,1991).
To select a quota sample you:
1. Divide the population into specific groups

2. Calculate the quota for each group based on relevant and available data
3. Give each interviewer an assignment, which states the number of cases in each quota
from which they must collect data.
4. Combine the data collected by interviewers to provide the full sample.
3.2.2 Purposive sampling
Purposive or judgmental sampling enables you to use your judgment to select cases that will best
enable you to answer your research question(s) and to meet your objectives. This form of sample
is often used when working with very small samples such as in case study research and when you
wish to select cases that are particularly informative. Such samples cannot however be considered
to be statistically representative of the total population. The logic on which you base you’re your
strategy for selecting cases for purposive sampling should be dependent on your research
question(s) and objectives.
3.2.3 Snowball sampling
Snowball sampling is commonly used when it is difficult to identify members of the desired
population, for example people who are working while claiming unemployment benefit. You
therefore need to:
1. Make contact with one or two cases in the population.

2. Ask these cases to identify further cases
3. Ask these new cases to identify further cases (and so on)
4. Stop when either no new cases are given or the sample is as large as is manageable.
The main problem is making the initial contact.

3.2.4 Self-selecting sampling
Self-selecting sampling occurs when you allow a case, usually an individual, to identify their
desire to take part in the research. You therefore:
1. Publicise your need for cases, either by advertising through appropriate media or by
asking them to take part.
2. Collect data from those who respond
Cases that self-select often do so because of their feelings or opinions about the research
question(s) or stated objectives.
2.2.5 Convenient Sampling
Convenient or haphazard sampling involves selecting haphazardly those cases that are easiest
to obtain for your sample, such as the person interviewed at random in a shopping centre for a
television programme. The sample selection process is continued until your required sample
size has been reached. Although this technique of sampling is widely used it is prone to bias
and influences that are beyond your control, as the case only appear in the sample because of
the ease of obtaining them. Often the sample is intended to represent the total population, for
example managers taking an MBA course as a surrogate for all managers! In such instances
the choice of sample is likely to have biased the sample, meaning that the subsequent
generalizations are likely to be at best flawed. These problems are less important where there
is little variation in the population, and such samples often serve as pilots to studies using
more structured samples.
Table 6.5 Impact of various factors on choice of non-probability sample techniques

(Saunders, 2003:172).
Sample type Likelihood of Types of research Relative cost Control over

sample being in which useful sample
representative contents
Quota Reasonable to high Where costs Moderately Relatively high
although dependent constrained/ data high to
on selection of quota needed very quickly reasonable
variables so an alternative to
probability
sampling needed
Purposive Low although Where working with Reasonable reasonable
dependent on very small samples
researcher’s choices:
extreme case focus: unusual or
special
heterogeneous focus: key themes
homogeneous focus :in depth
critical case focus: importance
typical case focus: illustrative
Snowball Low but cases will Where difficulties in reasonable Quite low
have the required identifying cases
characteristics
Self- Low but cases self- Where exploratory Low low
selection selected research needed
Convenient Very low Where very little low Low
variation in
population
Confidence interval
A p% - confidence interval about an estimate is an interval that with probability p contains the
true value.
The 95% -confidence interval may in many cases (e.g. for means and proportions) be
calculated by the following formula:
Estimated value± 1.96*standard error
For measles the estimated value or attack rate =0.80
=0.80+(1.96*0.025)
=0.80+0.049
Upper limit =0.80+0.049 =0.849 =0.850
Lower Limit=0.80-0.049 =0.751
95% Ci = (0.751, 0.850)
Example 1.
Exposure n Attack rate standard errors 95% confidence interval

Measles 251 0.80 0.025 0.751-0.850
Chicken pox 238 0.72 0.029 0.666-0.780
Mumps 218 0.38 0.032 0.312-0.441
Example 2
Group Weight 95% conf
Treatment n Mean s.d interval
Before 17 37.75 2.27 36.59-38.92
After 17 41.06 3.86 39.08-43.04
Control
Before 26 37.01 2.59 35.96-38.06
After 26 36.80 2.16 35.92-37.67
3.1 Significance tests
The statistical problem
1. Can the difference between estimated attack rates be due to random variation
(Example 1)
2. Can the difference between weights before and after treatment for the treated girls be
due to random variation
Can the difference between the control and treatment groups be due to random
variation
3.2 Statistical Test Procedure
1. There are two alternatives
(a) The population parameters are the same, meaning that the difference between
estimates must be due to random variation
(b) The population parameters are different
Designate 1a as the null-hypothesis HO and 1b as the alternative Ha.
We will construct a statistical test which will be used in an attempt to falsify H o
Null hypothesis always assumes that there is no difference between e.g. exposure and
non-exposure to a disease.
2. Define a test –statistic measuring the discrepancy between H o and the observed data
A small value of this statistic should indicate a good fit between H o and data and
therefore lead to acceptance of Ho
A large value of the test statistic indicates misfit between H o and data. Ho will therefore
be rejected
3. Assume that T is the Test statistic and that to is the observed value.
The test –probability or the p-value associated with T is the probability that T≥ to
calculated under the assumption that Ho is true.
P=P(T≥to / Ho
4. A small p-value means that we have observed something that is improbable under H o.
Something that is difficult to explain as a random event if Ho is true.
Ho will therefore be rejected.
Two tests
1. For contingency tables- categorical data
-the chi-square test (X2) page 58-61
2. For comparison of means—continuous data
-the t-test page 57-58
Ho in both cases is that there is no association between variables (no association between
exposure and disease or no association between treatment and outcome/ weight)
Significance testing
It is becoming more and more recognised by medical researchers and statisticians that the most
and more informative way to indicate the statistical significance of a given value is to also present
the confidence interval.
In the examples for OR’s and RRs in the previous chapters two things are immediately obvious
from a confidence interval:
1. If the 95% interval does not include 1 (the entire interval is either above or below 1), then
we know that there is a good probability that the risk factor studies is really associated
with disease, and that this is not just a chance finding.
2. The width of a confidence interval gives a feeling for how precisely the OR or RR was
measured in the study. If the 96% Confidence interval for an RR was found to be from
1.3to 15 , then we would not really know if this was a very important risk factor for the
disease (high RR) or a relatively minor one.
The X 2 –Test
Observed infection
Exposure Yes No Total

Mumps 82 136 218
Chicken Pox 172 66 238
Measles 201 50 251
Total 455 252 707
Marginal Proportion 0.644 0.356
Expected under HO= total *Marginal Proportion
Exposure Infection
Yes No
Mumps 140.3 77.7
Chicken pox 153.2 84.8
Measles 161.5 89. 5
Residuals =Observed-Expected
Exposure Infection Infection

Yes No
Mumps -58.3 58.3
Chicken pox 18.8 -18.8
Measles 39.5 -39.5
Weighted distance = Residual2
Expected
Exposure Infection
Yes No
Mumps 24.22 43.74
Chicken pox 2.30 4.17
Measles 9.66 17.43
X2 = ∑ (obs-exp) 2
= 101.51
Exp
df= (# rows- 1).(#columns-1) = 2
P(X2(2) ≥101.51) + 0.00000000000000005
An extremely rare X2 – value if Ho is correct →HO is rejected.
The X2 Test Procedure
If Ho is correct then X2 will be close to Zero
Large X2 – value s will be regarded as evidence against HO
Accept HO
0 Critical value
Reject H0
Must have a small (< 5%) probability of happening if Ho is correct
If the p-value, P(X2 > X 2 obs) is small then the observed X2 –value lies in a critical probability if
Ho is correct.
The t-test for independent samples
Assume that the mean values, x 1, and x 2, have been calculated for two independent groups from
two different subpopulations.
x 1, and x 2, are estimates of the expected values (pop.means) for the two subpopulations.
population Pop.means Pop. Variance Estimate Standard error

Expected values
1 E1 Var1 x1 √var1/n1
2 E2 Var2 x2 √var 2/n2
The null hypothesis: Ho: E1=E2
If Ho is correct
Then E( x 1 - x 2) =0
The standard error of the difference =seo = Var1 + Var2
n1 n2
the t-statistic is the so called standardized difference
t= x 1, and x 2,
se0
The distribution of which can easily be computed (by the computer)
The t test procedure
If Ho is correct then t will be close to zero.
Numerically large t - values will be regarded as evidence against H O
Accept HO
Critical value 0 Critical value
Reject H0 Reject H0
Must have a small (<5%) probability if Ho is correct
If the p-value P T ≥ t obs is small then the observed t-value lies in a critical area with a
small probability under Ho therefore reject HO
Two different t-tests have been suggested
1. A t-test assuming different variances in two populations. Variances and standard deviations
must each be estimated for each group.
2. A t-test assuming equal variances. A common variance is estimated and used for the
calculation of the denominator of the t-value.
A statistical test (linene’s t- test) exists for the hypothesis of equal variances.
Family therapy and Anorexic girls
Before treatment
Mean Weight s.e.
Control 37.01 0.51
Therapy 37.75 0.55
No evidence of equality of variances (F=1.34, p=0.26)
t=-0.962
p=0.342
No evidence against equality of means (HO) →the study does not seem to be biased
After treatment
Mean Weight standard deviation s.e.
Control 36.8 2.2 0.42
Therapy 41.1 3.9 0.94
Some evidence against equal variance (F=5.00, p=0.031)
The t-test for different variances will therefore be used.
T=-4.157
P=0.00039
Ho is rejected. Strong evidence of an effect of treatment.
Some comments
P-values are in most cases approximate values requiring large samples to be reliable.
P-values for t-tests are exact if the distribution of the variable is normal
T-tests for repeated measurements where independence between measurements cannot be

assumed exists. The standard error of the difference between means is however somewhat
different from the standard t-test.
Many other test –statistics are available. The test procedure is however always the same after the
test statistic has been calculated.
Test statistic→ p-value→ evaluate significance.
MAKING USE OF RATES
There are two main reasons for using rates as opposed to whole numbers
1. To make comparisons between two different populations that may have different
numbers of people at risk, by standardizing for population size.
2. To calculate the number of expected cases. By using a known rate the approximate
number of cases that are expected to occur in the population can be calculated.
RISK, RELATIVE RISK, AND ATTACK RATE
The basic epidemiological concept of comparing risks is introduced, some confusing definitions
are discussed, and we meet attack rates.
A not uncommon situation for a practicing physician is when several members of a family, a day
care group, or a school class fall ill at almost the same time. Often, the disease is some kind of
gastroenteritis, and the patients as well as their doctor wonder if it might have been something
they ate. The answer to that question is complicated by the fact that it is not always possible to
single out the responsible meal, and even if one could, there are almost always several different
food items served during a meal.(Geiseche 1994:17)
An epidemiological analysis of an outbreak
The situation becomes simpler if a group of people who do not ordinarily eat together share a
meal, and some of them become ill afterwards. The following example shows how one could
analyse such a situation by calculating risk and relative risk.
Example
Fifteen people had New Years dinner together. Within 24 hours, five of them fell ill with
gastroenteritis. The dinner had consisted of several courses and food items, and the participants
had not all eaten the same things. How could the cause of their disease be assessed?
All guests were sent a list of food that had been served and asked what they had eaten. As the list
came back, their replies were recorded in double table, with the ones who had been ill on the left,
and the ones who remained well on the right (see table 3.1)
Table 3.1 Table filled out from questionnaires given to 15 people during an outbreak of
gastroenteritis
Gastroenteritis No gastroenteritis
(5 people) (10 People)
______________________________________________________________________________
Quiche II IIII III
Cheesecake IIII I
Swiss roll III IIII
Chocolate cake I II
Cheese dip IIII IIII II
What we want to know is what was ones chance of being ill if one had eaten each of the foods.
Rearranging the table we have table 3.2
Table 3.2 Number of subjects in table 3.1 who became ill out of total who ate each item.
Eaten ill total risk

Quiche 2 10 0.2
Cheesecake 4 5 0.8
Swiss roll 3 7 0.4
Chocolate cake 1 3 0.3
Cheese dip 4 11 0.36
RISK
The Risk associated with some potentially harmful factor is defined as: The proportion who
become ill out of all those exposed to it.
Risk = number who ate this food who are ill
Total number who ate this food
For the risk refer to the risk column in table 3.2 above.
Now comes the centre of this chapter: we must also look at the risk of being ill in those who did
not eat the items on the list. We know that almost half of those who had Swiss roll were ill, but
what conclusion would we draw if we found the same proportion ill in those who skipped the
Swiss roll? The people at the dinner ate many different things, and for most of the items there will
be a mixture between those who happened to eat the infected food, and those who did not. If the
item was innocent of causing an illness, we would expect the same risk of being ill regardless of
whether one ate or not.
This way of thinking is basic to all epidemiology: if an exposure has nothing to do with a disease,
then the proportion who are ill after having had this exposure should be the same as in those who
had not had the exposure.
We thus proceed to list the outcome according to what people did not eat. (Table 3.3). Looking at
table 3.1, we can see that three of the ill people did not eat quiche, and that two of the well people
also did not eat quiche, and so on:
Tale 3.3 Number of those subjects in table 3.1 who became ill out of the total who did not eat
each item.
Not Eaten ill total risk
Quiche 3 5 0.6
Cheesecake 1 10 0.1
Swiss roll 2 8 0.25
Chocolate cake 4 12 0.33
Cheese dip 1 4 0.25
RELATIVE RISK
A simple way of comparing the risk in those exposed versus those not exposed is to divide them
(always putting the risk in the exposed on top). This gives the relative risk (also called risk
ratio) RR
RR= Risk in exposed to a factor
Risk in unexposed to this factor
Table 3.4 relative risk (RRs) of illness associated with each item during an outbreak of
gastroenteritis.
Food RR
Quiche 0.33
Cheese cake 8.0
Swiss roll 1.72
Chocolate cake 0.93
Cheese dip 1.44
A relative risk around 1 means that the risk of disease was nearly equal in exposed and un
exposed, and that this item is unlikely to have caused disease. A high RR points to this item
being associated with the disease, and a RR close to 0 would indicate that the item is in some
way protective- the risk of disease is then much higher in those not exposed.
From our calculations we find that the RR of causing illness is clearly highest for cheesecake, and
we can conclude with some certainty that this was the item responsible for the gastrointestinal
illness.
Person to person spread
Table 3.5 number of infected out of siblings exposed to three childhood infections. Source :Hope
Simpson
Number of children exposed Number who fell Attack rates

ill
1 Measles 251 201 0.80
2 Chicken pox 238 172 0.72
3 Mumps 218 82 0.38
From these figures we can calculate the infectivity of each of these disease within a
family.
ATTACK RATE
The basic measure of infectivity is the attack rate. The definition is: the attack rate of a disease
is the number of cases, divided by the number of susceptible exposed, which is really the
same as the definition of risk above. The difference is that there we use the infectious disease
definition of exposure, by counting only the people who really were exposed to the microbe.
For the above example the attack rate is calculated as follows:
The attack rates will be
Measles 201/251= 0.80
Chicken pox 172/238 = 0.72
Mumps = 82/218 = 0.38
Thus, four out of every five children exposed to measles in the family will themselves
contract measles, etc. there is obviously appreciable differences in attack rate for these three
diseases.
Confidence interval for proportions
Example 1
Exposure N Attack Rate Standard error 95% Confidence

interval
Measles 251 0.80 0.025 0.751-0.850
Chicken Pox 238 0.72 0.072 0.666-0.780
Mumps 218 0.38 0.032 0.312-0.441
Measles Attack rate =201/251 =0.80 0.80 = p
N=total number of people exposed =251
=P x (1-P)/N
=0.8 x 0.2/251
=0.16/251
=6.3745019920318725099601593625498e-4
=0.00063745
Standard error (s.e) =√0.00063745
=0.025
95% confidence = Estimated value + 1.96*Standard error
= 0.80 + (1.96 X0.025)
=0.80+ 0.049
= 0.849
=0.850
Or
0.80-(1.96X0.025)
=0.80-0.049
= 0.751
The 95% Confidence interval is therefore: 0.751-0.850
Confidence intervals for proportions
Often one wants to estimate the proportion of the population that have some characteristics, such
as the proportion who have antibodies to disease A, or the proportion of the population who have
had a test for disease B. it is seldom possible to test or ask every one, so one would probably do
this by collecting a random sample.
Assuming that this sample is truly random, with no selection biases involved, one would want to
know how the proportion measured in the sample relates to the true prevalence in the population.
This is exactly the same reasoning followed regarding confidence interval for OR*: if one only
wants to make a statement about the sample just studied there is no need for confidence intervals.
If for example 308 out of 1000 sampled were found to have antibodies,
The confidence interval for a proportion is calculated in the following fashion:
1. Write the number as a proportion instead as a percentage.

For the last sample in the above paragraph the proportion would be 0.308 (308 persons
out of 1000 tested).
2. Call this proportion p. call the total Number of subjects in the study n.
3. Calculate the number p x (1-p) /n. 0.308x(1-0.308)/n
In our example this would be 0.308 x 0.692/1000.
4. Immediately take the square root of this number
=√P(1-p)/n or in the example =√0.308x0.692/1000 = 0.015
5. This number is called the standard error of a proportion.
6. Just as to get the error factor in the previous chapters, we now multiply the standard error
by 2, and again this is a statistical device to create a 95% confidence interval: 2x0.015=
0.030
7. However , this time we do not divide and multiply by our final number, but instead
subtract and add it to the original proportion (0.308 in the example):
Lower bound: 0. 308-0.030 = 0.278
Upper bound: 0.308+0.30 = 0.338
8. In words: we assume that we have taken truly random sample (no biases) of 1000 people
out of a much larger population. In this sample we have found 308 subjects to be positive
for Hepatitis A antibody. We can then state that with 95% probability the true sero
prevalence in the population must be between 27.8% and 33.8 %. Giesecke 1994:
Example 2
Group treatment N weight

Mean Sd 95% Conf.
interval
Treatment
Before 17 37.75 2.27 36.59 – 38.92
After 17 41.06 3.86 39.08 - 43.04
Control
Before 26 37.01 2.59 35.96 - 38.06
After 26 36.80 2.16 35.92 - 37.67
NB Check Confidence intervals for proportions page 54-55 Morden infectious disease
epidemiology by Johan Giesecke
THE COHORT STUDY, RATES
The studies we have been looking at in the previous examples have all analysed an
epidemiological pattern after the event has occurred. Sometimes one may have to plan an
epidemiological study a little better in advance. Such studies where a defined group of people
are followed overtime, are probably more familiar to most clinicians than case control
studies.
In a study of HIV infection and Tuberculosis in New York, 513 intravenous drug users were
initially tested for HIV antibody. Two hundred and fifteen were HIV positive and 298 HIV
negative. They were then followed for any signs of active tuberculosis during an average of two
years. The results of the study were :
HIV HIV
Seropositive Seronegative
initially Initially
Developed TB 8 0 8
No TB 207 298 505
215 298 513
The risk of developing tuberculosis in this group was thus 8/215 =0.037 for those who were
seropositive at entry, and 0/298=0 for those who were seronegative.
RR=Risk in exposed /Risk in unexposed=0.037/0
This type of study, where one first defines and measures the risk factor one wants to evaluate (in
this case HIV status) in a defined group , and then follows this group over time to see who
develops disease (in this case TB)is called the cohort study.
Another example: There has been much discussion about whether sexually transmitted diseases,
and especially genital ulcerative diseases, increase the risk of HIV transmission. In a study from
Nairobi, Cameron, et el. Followed 291 men who presented at an Sexually Transmitted Diseases
(STD) clinic. They are reported to have had sexual intercourse with women from a group of
prostitutes where HIV infection was known to be common. About half the men presented with an
ulcerative disease, the rest with urethritis. After the first visit, the men were tested repeatedly for
three months, to see also if they had sero-converted in an HIV antibody test(which may take
several weeks to become positive after the actual transmission).
The result was:
Presented Presented
with genital ulcers with another condition
Sero-converted 21 3 24
Remained HIV negative 128 141 269
149 144 293
The RR of sero-conversion for the factor ‘ulcerative disease’ was thus:
Risk = number of ill persons exposed/total number of people exposed
= 21/149
= 0.14
Risk = number of ill persons in the unexposed /total number of people unexposed
= 3/144
= 0.02
RR= 0.14/0.02 = 7
Confidence interval for RR is almost the same as for an OR. We first calculate the error factor:
Error factor = e2x√(1/a+1/b)
= e 2x√(1/21+1/3)
= e 2x0.62
=3.46
Divide and multiply the RR with this value (the error factor) to get the lower and upper bounds,
respectively
Lower bound =7/3.46 = 2.02
Upper bound= 7x3.46 =24.22
The confidence interval for a ratio is calculated as for an RR, i.e. by first getting the error factor
Error factor = e2x√1/a+1/b

Where a is the number of people who fall ill in the first group, and b the number in the
second (both should be greater than 10 for the formula to apply properly).
The confidence interval thus seems to be well above RR=1, but since the formula above
requires that both a and b are at least 10, this approximate interval cannot be trusted
entirely.
And the lower and upper bound for our confidence interval are found by dividing and multiplying
the RR by the error factor respectively.
In a cohort study, we can use risk and RRs for our comparisons, since we have started by defining
the total group of people that we want to study
CASE CONTROL STUDIES
Case control studies use stratified sampling from sub populations with or without a specific
disease.
Causes may be measured in terms of Odds instead of probabilities
In dealing with risk, relative risk and attack rates we have knowledge of the entire
population, i.e. we could count precisely how many were exposed and how many were infected.
In real life, and especially when studies are based in the community rather than in the clinic, we
will often only have information about some of all exposed and ill people. Such situations require
slightly different methods.
Reason being that calculation of RR is not possible.
For one to calculate risk you need that the denominator should include the total number of
persons who had eaten the item.
In this case we do not know how many people were exposed, nor do we know who was exposed
to which item. Also it is unlikely that we identified all the cases.
This type of epidemiological analysis is the basic form of case control study, where risk factors
for disease are ascertained by comparing different exposures (in this case type of food
eaten)between people who were ill(=the cases) and people who were not (=the controls). In
contrast to the previous chapter, we do not have knowledge of all cases, nor of all controls; what
we do have are two samples of people. Note also that we are doing analysis ‘backwards’ in time,
starting from a number of cases that we diagnosed, then identifying a number of controls, and
after that looking at possible causes of the disease.
Table 4.1 Results from questionnaires to 37 cases and 58 controls in an outbreak of gastroenteritis
in a large office block: Source salmon et el
Gastroenteritis No gastroenteritis
Item eaten Not eaten eaten Not eaten
Lunch 22/1 6 31 9 48
Lunch 23/1 18 19 14 43
Salad 12 24 5 52
Sandwiches 16 21 14 44
Chicken 4 33 4 54
ODDS
Comparison of risk factors in case control studies most often make use of the term odds. These
build on a similar idea as risks but instead of dividing the number of people who were ill by total
number exposed, which we do not know, we divide by something we do know, namely the
number of people in our study who did not become ill. The odds associated with each item
on our list is thus:
Odds = Number of ill persons exposed to the factor
Number of well persons exposed to the factor
As an example, the odds for the chicken in the table above would be 4/4 =1
ODDS RATIO
Just like with risks, we want to compare odds for those exposed to the odds for those not exposed.
The odds for salmonella infection if one had not eaten chicken are 33/54=0.61
The odds ratio (OR) is defined as:
OR = Odds in those exposed to the factor
Odds in those not exposed to the factor
Which for the chicken example would be 1/0.61= 1.64
The definition of odds and OR are very similar to the ones for risk and RR ( so similar, in fact,
that they are easily confused, which happens not infrequently in epidemiological literature).
Their advantage is that they can be calculated in situations where one does not have knowledge
about the entire population.
The disadvantage is that they have less intuitive meaning than the words risk and relative risk .
Odds do not really mean anything, they can just be compared to see which ones are greater.
2x2 table (pronounced ‘two-by-two’) table.

It is a cornerstone of all epidemiological research and often the first thing one draws up when one
starts to investigate some data.
The general 2x2 table looks like:
Exposed Not expose
Cases a b a+b
Controls c d c+d
a+c b+d Total
where
a=number of ill people who were exposed
c=number of well people who were exposed
b= number of ill people who were not expose
d= number of well people who were not exposed
a+b= total number of cases
c+d=total number of controls
a+c= total number who were exposed
b+d= total number who were not exposed.
For infectious disease with very high infectivity, there would not be any people in squares c and
b: All the ill people would be exposed (=a), and all the health would be unexposed (=d).
For factors ‘having had lunch in the canteen on 22nd January’, our 2x2 table would be:
Had lunch on 22nd no lunch on 22nd
Ill 6 31 37
well 9 48 57
15 79 94
From this table it is easy to calculate the odds and the OR associated with having lunch on 22 nd.
The odds of illness in those who had lunch is 6/9= 0.67, and the odds of illness in those who
didn’t eat in the canteen that day is 31/48= 0.65. The OR for the illness for the factor ‘lunch in the
canteen on 22nd’ is 0.67/0.65= 1.03.
An OR of 1 is equivalent to equal odds for disease in those exposed and not exposed to the
factor , which is the same as saying that an OR of one suggests that this factor is not associated
with disease.
Had lunch on 23rd no lunch on 23rd
Ill 18 19 37
Well 14 43 57
32 62 94
Here, the odds for illness are 18/14 = 1.29 and 19/43 = 0.44 for the exposed and unexposed
Respectively. That is: among the 94peoplewho answered this question, there was a considerably
Greater chance of having had lunch on 23rd for those who were infected than for those who were
Well.
The OR for illness for lunch in canteen on 23rd’ is 1.29/0.44=2.93.
In the same way , ORs can be calculated for the three different menu items, and the total list
becomes :
Table 4.2 odds ration (OR) for illness associated with each risk factor for the study given in Table
4.1.
Item OR
Lunch 22 1.03
Lunch 23 2.93
Salad 5.2
Sandwiches 2.39
Chicken 1.64
An OR of 1 is equivalent to equal odds for the disease in those exposed and not exposed to the
factor,
Which is the same as saying that an OR of 1 suggests that this factor is not associated with the
disease.
The formula for odds ratio can be manipulated a little to give an easier way to calculation:
OR = a/c =ad
b/d bc
or in words : ‘multiply upper left hand number by lower right hand, and divide by the upper right
multiplied by the lower left’.
For 23rdWork out the others
The probable range of the true OR can be calculated rather easily from the 2x2 table. For each of
our five ORs we first calculate something called the error factor, which is defined as
Error factor = e 2x√1/a+1/b+1/c+1/d
Where e is the so called natural logarithm =2.71828…)
The formula might seem complicated but it can be calculated on an ordinary calculator. As an
Example, for the exposure on lunch on the 23rd’ the calculation would be:
1. First divide 1 by each of the four numbers in the 2x2 table, add the result to the memory
of the calculator each time
1/18 +1/19 +1/14 +1/43 = 0.203
2. Take the square root of this number: √0.203 =0.45
3. Multiply by 2: 0.45 x 2 = 0.90
4. And finally raise e to this number: e 0.90= 2.46, which is our error factor
5. The lower bound of the probable range for OR for ‘lunch on the 23 rd ‘ is now defined by
dividing our calculated OR in the list above by the error factor:
Lower bound = 2.93/2.46 = 1.19.
6. The higher bound is given by multiplying our calculated OR with the error factor:
Higher bound = 2.93x 2.46 = 7.21
CONFOUNDING
Even if a factor is significantly associated to disease, this may just be a statistical finding, where
the division according to exposure also divides people into high risk and low-risk groups
according to some real factor. This is called confounding. The concept of confounding is closely
coupled to the concept of cause in epidemiology.
Adjusting for confounding by stratification
Odds ratios
Confounding
Stratification
Conditional independence
Homogeneity of Odds ratios
Effect notification
“Mortality among cases with rocky mountain spotted fever”
Mortality seemed to be higher in the western American states then in the eastern American states.
Are there two different types –a western type and an eastern type of typhus fever?
The data supporting the hypothesis of two types.
Outcome Western cases Eastern cases

Died 210 122
Lived 537 539
Total 747 661
Risk 28.1 18.5
To analyse this use the table for OR
What is the best procedure for comparison of risk/Probability?
The Odds Ratio
Had lunch on 22nd no lunch on 22nd
Ill 6 31 37
Well 9 48 57
15 79 94
The Odds ratio is very easily calculable
The 2x2 table

a b
c d
The odds ratio= ad
bc
CASE CONTROL STUDIES
A design for analysis of risk factors associated with rate events
Assume that
→Prob (disease=yes)/Exposed) is very small
→A prospective or cross-sectional study will contain very few cases with the disease
→There will be none or very little information on the disease
Case control studies = stratified sampling of all or many cases with the disease and some controls
without the disease
The disease variable is not a random variable.
If data contains information on exposure then we may estimate
P(exposure=Yes/Case)
P(exposure=yes/ control)
Can we use this information for anything?
The problem
We want to know something about
P(case/Exposure =yes)
P(case/Exposure=No)
But we only have information on
P(Exposure=yes/case)
P(exposure=yes/Control)
Rewrite the conditional probabilities
P (case/Exposure)= P(case , Exposure) = P(exposure/Case).P(Case)
P(Exposure) P(Exposure)
P(Control/Exposure) = P(Exposure/Control). P(Control)
P(Exposure)
Odds (case/exposure) = P(case/Exposure
P(control/Exposure)
= P(exposure/case) . P(case)
P(exposure/control) P(control)
Odds (Case/ no exposure) = P(No exposure/case) . P(case)
P(No Exposure/control) P(control)
OR case = odds (Case /Exposure)
Odds (Case / No Exposure)
= P(Expo/Case) . P(No Exp/Control)
P(no Exposure/Case) P(Exp/Control)
= odds (exp/case)
Odds(exp/control)
= ORexp
The retrospective information from case control studies will give us information on the
prospective odds –ratios that we are interested in.
The MacNeil data

A case control study of perinatal mortality
Birth weight
-2000 2.001+ Total
Cases 51 52 103
(Died) 49.5% 50.5%
Controls 10 299 309
(Died) 3.2% 96.8%
OR (died/low birth weight) = 29.3
95% confidence interval 14.0-61.4
Pro’s and con’s of case –control studies and cohort studies.
One conceptual difference between the two types of study has to do with time:
In cohort study we start with a number of subjects who are free from disease, and follow them
over time to see who becomes a case and who does not.
In a case control study the events have already happened before the study started, and we collect
the cases and try to find appropriate, disease free controls.
Cohort studies usually require carefully planned investigations, whilst a case control study can
quite often be performed quickly from a number of cases already collected.
For diseases with very low incidence, cohort studies may not be practical, or even feasible.
Another principal conceptual difference concerns the measures of strength of association in the
two types of studies; the ORs and the RRs. As earlier pointed out, an odds ratio does not have any
direct interpretable meaning; it just tells us how strongly an exposure and an outcome seem to be
related. However for rare diseases, the OR often provides a good approximation of the RR.
Cohort studies are more costly than case control studies.
Activity
1. Define a cohort study

2. Calculate the risk and Risk ratio for a cohort study
3. Calculate the confidence interval for the Risk ratio for a cohort study.
4. Calculate the odds and Odds ratio for a case control study
5. Calculate the confidence interval for a case control study
6. Compare and contrast cohort study and case control study.
The Chi-squared (X2 )Test
In the second of the two situations above we did not have continuous measurements of some
variable for the two groups, but instead numbers of people belonging to different categories. It
then becomes a bit strange to talk about the ‘average sex’ in a group of patients. The basic
situation is just our familiar friend the 2x2 table, which could be for example:
Vaccinated not vaccinated

Ill 10 40 50
Well 80 20 100
90 60 150
However, the 2x2 table could just as easily be extended to a table with more columns and/ or
rows if there were more categories of exposure or outcome or both.
In this situation, the subject would only belong to the two categories ‘vaccinated’ or ‘not
vaccinated’, and to either of the categories ‘ill’ or ‘well’. There is no meaningful way of giving an
average of health in the vaccinated group, or an average vaccination status in the well group.
This type of data is thus quiet different from the t test situation above and is usually called
categorical as opposed to continuous data.
In chapter 4 we saw how to calculate an OR for such a table, and also a confidence interval for
this value. If we now want to perform a significance test instead, the question to ask is: what is
the probability that the 150 subjects of the study will divide this way into ‘ill’ and ‘well’ just by
chance? A very low probability of such a chance would give increased weight to our hypothesis
that the vaccine has effect.
The way to reason is as follows: there are 50 people who fall ill and 100 who remain well. If the
vaccine was totally inefficient, we would assume that it did not matter whether or not a subject
was vaccinated. Since one third of the total group fell ill , this would be the expected proportion
in each of the individual groups. In the vaccinated group of 90 people, we would expect 30 to fall
ill, and in the unvaccinated group of 60 , 20 . the expected 2x2 table if the vaccine did not work at
all would be:
Expected table Vaccinated not vaccinated
Ill 30 20 50
Well 60 40 100
90 60 150
The general way of calculating the expected value for a cell in a 2x2 (or 3x3, or 5x3 or……)
table is to multiply the column sum at the bottom of the corresponding column by the row sum to
the right, and dividing this number by the total in the lower right corner. For the first cell in our
example this would be 90x50/150= 30, just as above. (In these calculations, one often gets
fractions of people in the cells of expected table but that does not affect the analysis at all.)
We can now compare the numbers in the expected tables to the actual ones to see if they are very
different. One way is just to take the difference between the numbers in the corresponding cells
(10-30) for the first Cell, 40-20 for the second and so on). If the vaccine had no effect we would
expect those differences to be small, and the larger they are, the more the result of our study
deviates from what would be expected just by chance distribution of the cases. The X2 test would
now consist of squaring all these differences, dividing each square by its expected value (from the
table above), and then adding them. The higher this number, the less chance that the distribution
of ill and health subjects according to vaccination status would have occurred by chance. For a
2x2 table like this, a X2 value above 3.84 indicates there is less than 5% probability that the result
occurred by chance.
In the example above the calculation would be:
X2= (10-30)2/30+(40-20)2/20+(80-60)2/60+(20-40)2/40 =50
We can see that there is a very small chance indeed that the high X2 value would arise by
chance , and we can state that there is statistical support that the vaccine does have a protective
effect. In fact, the probability that this would be a chance finding can be calculated to be p
<0.0001.
When you lookup a X2 table, you will find that they mention something call ‘degrees of freedom’.
For tables such as the above, this has to do with the number of rows and columns (categories for
exposure and outcome). The number of degrees of freedom is just (number of rows -1) x(number
of columns -1), and thus for a 2x2 table (2-1)x(2-1) =1. For a 3x4 table (three different outcomes,
four different exposures), there would be (3-1)x(4-1) =6 degrees of freedom, and you would have
to refer to this table for the X2 test. (Degrees of freedom is often abbreviated ‘d.f’.)
A quick way to calculate the X2 value from the general 2x2 table from chapter 4
Exposed Not Exposed
Cases a b a+b
Controls c d c+d
a+c b+d N
is with the formula
X2 = (ad-bc)2 x N
(a+c)(b+d)(a+b)(c+d)
where the parentheses in the denominator are just the column and row sums.
Since the X2 is so easy to perform it can often be used for an initial check even for continuous
data where one would otherwise use at test. If one for example wants to compare temperatures in
two groups of patients one could just choose the value that seems to be in the middle of all
temperature readings form both groups, and count the number of subjects in each group who have
a temperature above or bellow this value. The four figures thus attained are entered in a 2x2 table,
and the X2 calculated. If this X2 figure yields a low p value then you can also be confident that the
t test will also yield a low p value.
There are however some important restrictions on when the X2 test can be used. It is an
approximate method that gets more and more valid the larger the size of the study.
As rules of thumb these restrictions are:
1. Either the total sample size (N above ) should be greater than 40

or
2. N could be between 20 and 40, but none of the expected values in the 2x2 table should be
smaller than 5.
If neither of these conditions are fulfilled, one must use Fishers Test, which is the subject of the
next section.
MATHEMATICAL MODELS FOR EPIDEMICS
One of the perpetual dreams of mankind has always been to be able to predict the future.
The regular recurrence of epidemics and the similar shapes of consecutive epidemics of a
disease have for a long time tempted people with a mathematical inclination to make
some kind of model.
Basic Reproductive rate
The potential for a contagious disease to spread from person to person in a population is called
reproductive rate. It depends not only on risk of transmission in a contact, but also on how
common contacts are: a person with measles who meets no-one will not transmit the infection.
In a similar way the rate of acquisition of new sexual partners will influence the spread of
sexually transmitted diseases. The principle determinants of reproductive rate are:
Ro =is the average number of persons directly infected by an infectious case during his entire
infectious period, when he enters a totally susceptible population.
Ro=βxKxD
1. The probability of transmission in a contact between an infected and a susceptible(β)

2. The frequency of contacts in the population(K)
3. How long an infected person is infectious (D)
4. The proportion already immune in the population (p).
Point 2 above is the most interesting from the epidemiological point of view, and also the
most frequently overlooked. The spread of infectious diseases not only depends on the
properties of the pathogen or the host , but in at least equal degree on the contact patterns
in the society- who meets whom? How often? What kind of contact do they have?
How can we apply the reproductive rate in controlling infectious diseases?
If a new disease enters a population, what is its probability of spreading? From the above
it may happen that:
Ro<1 disease will eventually disappear
Ro=1 the disease will become endemic
Ro>1 there will be an epidemic
In order to prevent epidemics of a disease, the proportion of the population that must be
vaccinated is higher than 1 minus the inverse of the basic reproductive rate.
p > Ro-1/Ro =1-1/Ro
Example: Ro for measles has been shown to be around 15 i.e. every case of measles will
infect 15 other people, on average.
Then the formula predicts that if we want to prevent measles epidemics, more than 1-1/15
=0.94 or 94% of the population must be vaccinated.
The level of immunity in the population, which prevents epidemics, (even if some
transmission may still occur), is called herd immunity.
What determines Ro?

Ro is always an average value where the number of transmissions from each infectious
person is averaged
Formula is: Ro=βxKxD
Where:β is the risk of transmission per contact (i.e. basically the attack rate),
k is the number of potential infectious contacts that the average person in the
population has per time unit, and
D is the duration of infectivity of an infected person, measured in the same time

unit as k was.
Many public health measures to prevent the spread of infections aim at decreasing β, such
as using a condom, wearing a face mask, or washing ones hands
K - isolation, have fewer sexual partners
D - duration of infectivity antibiotics
Objections to the model
The first is that every person in the population will meet every other with equal
probability
The second unrealistic assumption is that only one kind of contact exists, with a given β
(Giesecke 1994:108-123)
DETECTION AND ANALYSIS OF OUT BREAKS p. 137
The three important factors to be characterized in an outbreak analysis are time, place
and person. In order to identify the cases for this analysis, one needs a case definition,
and this is used to actively search for more cases than the ones who present themselves.
The epidemic curve can give indication on type of exposure: point source, extended
source, or person to person. In a point source outbreak it is often possible to estimate the
common time of exposure, if the disease and its incubation time are known, or
conversely, diagnose the disease if time of exposure is known.
Plots should be made not only for time course of the outbreak but also for sex, and age
distribution, and often for the geographical location of the cases.
After the cases have been identified, the probable cause of the outbreak can be searched
by mare analytical methods, most often starting with case control study. In many
instances, the cause will be clear from the outset, since an unexpected increase in rate of
diagnosis of a certain pathogen in the microbiological laboratory will often be what
triggered the investigation. However, if the pathogen was previously unknown,
epidemiology and microbiology must often work hand in hand to reveal the cause.
6.1.14 INVESTIGATING AN EPIDEMIC PAGE 288-294
Epidemic Threshold
An epidemic may be defined as an indisputable increase in the number of cases of a disease
compared to its usual rate. This definition should reflect the norms for individual disease
prevalence in a given geographical area.
The alarm signal
The alarm signal may be sounded by:
 the population itself
 the surveillance system
 Rumors of unknown origin to the effect that people are dying.
The reliability of such an alarm depends on its origin. By definition information furnished by the
surveillance system is more credible than rumors. Whatever the source of the alarm , however,
an investigation will have to be undertaken to confirm or disprove the initial reports.
Organising the investigation of an epidemic

Investigating an epidemic involves several stages.
Confirming the epidemic
An investigation undertaken to confirm an epidemic must adopt a two pronged approach,
both defining cases and confirm the increase in the number of cases
Defining cases
Confirming an abnormal increase in the number of cases
Surveying the measures taken and local resource potential
Data analysis
Epidemiological data
 Persons
What group or groups are affected? The rate of infection must be measured for
different population groups. Similarly, the rate of mortality specific to the disease
in question must be determined for each of the groups affected.
 Place/Space
Where did the epidemic begin? Which regions are most affected? Both the rate of
infection and mortality rate must be determined by geographical area.
 Time
When were the first cases identified? An epidemic curve should be constructed to
show the number of cases in relation to time.
Risk factors
The risk factors are:
 Salubrity of the environment

 Nutritional status
 Housing conditions
 Level of health education
 Deterioration of health care services
 Insecurity
 Size of the group at risk
Are these factors strong enough to contribute to the outbreak of the epidemic?
Determining a strategy for Action
The strategy for action consists in reviewing all the stages of the communicable disease cycle
and making a list of possible actions. Examples include;
 Prophylactic action against the pathogenic agent
-treatment of health carriers
 Vector control
-distruction of vectors
-action to eliminate breeding grounds
 Active protection
-immunisation
 Passive protection
-chemoprophylaxis
 Early screening for cases
- putting health facilities on alert
-Promoting public awareness (through the media)
-Actively seeking out cases
 Treatment of diagnosed cases
-reinforcement of health care personnel’s technical expertise
-provision of necessary equipment
 Removal and cremation or burial of bodies
Next , the activity best suited for the situation must be selected, following two lines of
action: one to flatten the epidemic curve- essentially through preventive measures- and the
other to reduce mortality, by curative measures.
Determining whether outside aid is necessary
The local health –care services do not always have the necessary resources to cope with an
epidemic. This includes material resources, the technical expertise for making the initial
diagnosis, and logistic support. International aid may prove necessary.
Gaining support for the recommended measures
Support from the political authorities
Epidemics are a sensitive subject for health and political authorities. The political authorities
must understand the health care personnel’s proposals before they will agree to assist the
institution of control measures. The Political authorities tend to minimise or deny the
existence of an epidemic because of the negative image that such news projects to the
outside world, or because of its repercussions on tourism. Where refugee populations are
concerned , an epidemic may serve as an argument of reinforcing coercive measures against
them.
Support from the population
The population, too, must be clearly informed about:
 The clinical manifestations of the disease which permit early detection;

 The risks involved;
 Precautions that may be taken;
 Means of treatment.
Evaluation and adapting the measures already implemented
The set of measures already being implemented, adapted to the problem and the urgency
of the epidemic, constitutes a type of surveillance system in itself, particularly in terms of
organisation.
PART IV COMPUTER APPLICATIONS

4.0 COMPUTER APPLICATIONS
Introduction to statistics and data processing using SPSS
What is SPSS?
Statistical Programme for Social Sciences (SPSS).

1. A spreadsheet where one may enter, exit and document data
2. Procedure for data processing
Record existing variables

Calculate new variables
Merge files with data adding either new variables or new cases
Select specific cases for processing/ analysis.
3. Procedures for statistical analysis
 Descriptive methods
 Graphical
 Numerical
 Statistical inferences
 Analysis of two way and multiway tables
 Comparison of mean values
 Regression analysis
 Non parametric tests, etc
4. A flexible output editor
5. `A friendly graphical user interface
6. A syntax editor, where one may write, save and execute SPSS-programmes for more
complex computations and analysis.
Why SPSS?
1. A fairly comprehensive set of statistical procedures(version 9.0 has some procedures of

importance for epidemiology)
2. Very easy to work with
Disadvantages of SPSS
1. Not a proper data base system

2. Not enough statistical procedures.
Variable name
Variable label
Variable type
Numerical
Text
SPSS –Data
Variables
Cases
The data matrix
Differences between Excel and SPSS
Excel has only one window data view window while SPSS has two windows i.e. data view
variable view window.
Excel you need to write formula where as in SPSS formulas are inbuilt
Excel the columns are identified by alphabetical letters whereas in SPSS columns are identified
by variables (var)
SPSS variable: Name, Type, Width, Decimal, Lable, Value, Missing, Column, Align,
Measure
PART V DEMOGRAPHY
5.0 DEMOGRAPHY
5.1 Definition of Demography

Demography is the scientific study of human population. It focuses its attention on three readily
observable human phenomena: (a) changes in population size (growth or decline) (b) the
composition of the population and (c) the distribution of the population in space.
It deals with five “demographic processes”, namely (i) Fertility, (ii) mortality, (iii) marriage, (iv)
migration and (v) social mobility
The main sources of demographic statistics are population censuses, national sample surveys,
registration of vital events, and ad-hoc demographic studies.
5.2 Population distribution
Table 3.1 Typical distribution of population by age group for a district of a developing
country.
Age Group (years) Percentage District

population
Less than 1 4 8 000
1-4 14 28 000
5-14 26 52 000
15-44 43 86 000
45+ 13 26 000
Total 100 200 000
Source: Manual of epidemiology: 23
5.3 Population Pyramid
A population pyramid gives the history of population fertility. It is constructed by plotting the
various percentages of the total population in each age groups in the X - axis and the age groups
in age in years in the Y-axis.
Where fertility is high the pyramid will have a broad base and a narrow tip. This is typical of
developing countries
Where fertility is low you have a smaller base and a wider tip comprising mainly of more adults
and fewer young ones this is typical of developed countries.
5.4 Census
The census is an important source of health information. It is taken in most countries of the
world at regular intervals, usually of 10 years. A census is defined by the United Nations as “the
total process of collecting, compiling and publishing demographic, economic and social data
pertaining at a specified time or times, to all persons in a country or delimited territory”
5.5 Demographic rates
5.5.1 Crude Birth Rate
The crude birth rate (CBR) is usually estimated from census or special
demographic surveys and is given by this formula:
CBR = total births in one year .X 1000
Total midyear population (all ages, same year)
The rates are usually available for each district, and by applying them to the district population
we can estimate the total number of births per year.
Example in a district of 200 000 people with a CBR of 45 births per 1000, there would be about
9000 births per year, or about 170 per week.
Total births = CBR/1000 x Population = 45/1000 x 200 000 = 9000 per year.
If the health information system reports that about 80 births per week are attended by trained
health workers, the coverage can be estimated to be about 50%(i.e. 80/170 x100 = 47%). How
well is the district doing?
5.5.2 Fertility Rate
Fertility is meant the actual bearing of children. Fertility depends upon a number of factors these
include: age at marriage, Duration of marriage life, Spacing of children, education, Economic
status, cast and religion, nutrition, family planning other factors e.g. Physical, social and cultural
factors etc. Fertility related statistics.
The fertility rate (FR) is an age –sex specific rate usually derived from the census or special
demographic surveys. This rate is a measure of how frequently women in the fertile age range
(15-44 years) are having babies, so where the CBR is high the FR will also be high. Developing
country population s with an average fertility might have a rate of about 100-150 births per 1000
women aged 15-44 years per year; in high fertility populations it might be around 200 per 1000.
5.5.3 Crude Death Rate
The crude death rate (CDR) is calculated as:
CDR = total deaths in one year/total mid-year population (all ages, same year) x 1000
The CDR commonly ranges from 10 deaths per 1000 people per year in more developed areas to
more than 20 deaths per 1000 in poor populations.
2.5.4 Infant Mortality
The infant mortality rate (IMR)- which is the proportion of live born infants who die in the first
twelve months of life- is commonly considered a good measure of health status. It is usually
calculated from the census or special demographic surveys. There are many technical problems
in calculating accurate IMRs and health workers should not rely on the accuracy of their
estimates unless there is a very good vital registration system in operation. The following formula
is commonly used:
IMR=total infant (aged <1year) deaths during one year/total births in same year x 1000
Most of the infant deaths occur during the first month of life; these deaths are called neonatal
deaths. The total number of expected infant deaths can be calculated as follows:
No. of infant deaths = IMR/1000 x no. of births
In a district with a population of 200 000 , 9000 births per year and an IMR of 100, the estimated
number of infants deaths would be :
100/1000 x 9000 = 900 per year, or approximately 17 per week
5.5.5 Child Mortality
The child mortality rate (CMR) is based on the deaths between 1 and 4 years of age and is
important because malnutrition and infectious diseases are common in this age group. It is usually
calculated from a census or special surveys since it is not easily calculated with sufficient
accuracy from district health information.
5.5.6 Maternal Mortality Rate
A neglected death rate is the Maternal mortality rate(MMR), partly because it is difficult to
calculate accurately. An approximate rate for many developing countries is 1-5 maternal deaths
per 1000 births per year, which means that a district with a population of 200 000 and a CBR of
40 per 1000 might expect between 8 and 40 maternal deaths per year. In this case it is more
important to know the true numbers than the rate, since the actual numbers are so small. The use
of births as a denominator, instead of the number of women of child- bearing age, may give the
impression that the problem of maternal deaths in developing countries is less serious than it is in
reality. For example, even the fact that MMR may be 5 per 1000in Africa compared to 5 per 100
000 in Europe does not adequately reflect the much greater risk of mothers dying from pregnancy
related causes in Africa. This is because the average number of births per women is also much
higher in Africa and therefore the risk of a particular woman dying of pregnancy complications is
today about 400 times greater in many developing countries than in developed areas.
MMR = maternal pregnancy-related deaths in one year/ total births in same year x factor
The factor is usually 1000 or 100 000
FOR A DISTRICT OF 200 000 PEOPLE THE FOLLOWING RATES AND TOTALS
MIGHT BE EXPECTED:
CBR 20-45 per 1000 or 80-170 live births per week
CDR 10-20 per 100 or 40-80 deaths per week
IMR 60-150 per 1000 or 5-25 infant deaths per week
MMR 1-5 per 1000 or 1-3 maternal deaths per month.
Mortality indicators:
(i) crude death rate: defined as the number of deaths per 1000 population per year in
given community
(ii) Expectation of Life: Life expectancy at birth is the average number of years that
will be lived by those born alive into a population if the current age specific
mortality rates persist.
(iii) Infant Mortality rate: The ratio of deaths under one year of age in a given year to
the total number of live births in the same year; usually expressed as a rate per
1000live births. It is one of the most universally accepted indicators of health
status not only of infants but also of whole population and of the social economic
conditions under which they live.
(iv) Child mortality rate: another indicator related to the overall health status is the
early childhood (1-4years) in a given year , per 1000children in that age group at
the midpoint of the year concerned. It thus excludes infant mortality.
(v) Under-5 mortality rate: It is the proportion of total deaths occurring in the under-
5 age group. This rate can be used to reflect both infant and child mortality rates.
(vi) Maternal mortality rate: Maternal (puerperal) mortality accounts for the greatest
proportion of deaths among women of reproductive age in most of the
developing world although its importance is not always evident from official
statistics.
(vii) Disease specific mortality: mortality rates can be computed for specific diseases.
(viii) Proportional mortality rate: the simplest measure s of estimating the burden of
disease in the community is proportional mortality rate. ,i.e. , the proportion of all
deaths currently attributed to it.
5.5.7 Migration
Urbanization movement of people from rural areas to urban areas usually in search for better life
and employment.
Age pyramids
Such a representation is called an age Pyramid. A vivid contrast may be seen in the age
distribution of men and women in India and in UK.
The age pyramid of India is typical of under-developed countries, with a broad base and a
tapering top. In the developed countries, as in UK , the pyramid generally shows a bulge in the
middle , and has a narrower base.
POPULATION GROWTH
The population growth in a district depends on the balance between the number of births and
people migrating into the district on the one hand and the number of deaths and people
migrating out on the other hand. Occasionally, a district’s population may actually be
declining, but this is usually due to migration away from the area, and not because of deaths
outnumber births.
The rate of natural increase, which excludes migration, is commonly between 1% and 3% per
year in many developing countries and is calculated as follows:
Rate of natural increase=CBR minus CDR
This rate largely determines how fast the district population will grow as shown in table 3.2
Table 3.2Natural growth in the district population
Rate of natural present district Population increase
increase population in 10 years in 20 years
. total % Increase total %

Increase
1% 200 000 220 900 10 244 000 22
2% 200 000 243 800 22 297200 49
3% 200 000 268 800 34 361 200 81
Figures calculated to the nearest 100 people and percentage to the nearest whole number
Source: Vaughan and Morrow 1994: 28

Estimates of population growth can be derived from the size of the population a t two or more
points in time. The simplest way of estimating population growth is to obtain the difference
between the population size at two points in time and then to divide the difference by the years’
interval between them. This yields the average growth in the number of persons per year.
Example : if the population in an area was estimated to be 7830 on 31 March 1985 and 8450 on
30 September 1989, then the average increase per year is estimated to be (8450-7830) divided
by 4.5 =138 people. The estimated population on 30 September 1990 is therefore 8450 +
138=8588.
This method assumes that the increase in number of people per year is constant. However when
used for projections over a longer period of time, this method tends progressively to
underestimate the total population, as population tend to grow at a constant rate of growth rather
than by a constant absolute increase per year (as illustrated in table 3.2). With a 3% natural
growth rate the population will double in 25 Years.
Sources of population information
A knowledge of the number of people living in the district, with additional information on their
age , sex and geographical distribution, is necessary for several aspects of planning and
evaluation of health services. DHMTs will need Population estimates for the district to provide:
 Total population by age and sex groups and other relevant criteria
 Total number of expected live births and deaths per year.
The published sources for such information are:
1. Reports on census
2. Reports from other studies
3. Other sources authorities concerned with the provision of other services in the area e.g.
education housing, law enforcement and public utilities.
Two quick ways to estimate small local populations are:
 To ask local leaders how many people they are responsible for and to add all the
responses to obtain a total. However, beware of being misled for various reasons such
as fear of taxation.
 Ask for the total number of households, or count them, and multiply by the known
average household size for your district (Vaughan an Morrow 1989:27-28).
Activity
Define the following terms
Demography
Census
Write short notes on
a) Population growth
b) Crude birth rate
c) Crude death rate

Biostatistics Lecture Notes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Biostatistics Lecture Notes

Uploaded by

Copyright:

Available Formats

PART I - DEFINITIONS

1.1 STATISTICS IN HEALTH

1.2 Definition of statistics

1.3 Statistical method can be described as:

1.4 Statistical investigations

a) Definition of target population and objectives of the survey.

Organisation of manpower and resources to collect the data

f) Copying, collation and other organisation

1.5 Sampling and data collection

Bias can be defined as the tendency of a pattern of errors to influence data in an

The main types of bias are

1.12 Sampling frames

1.13 Sampling techniques

1.14 Random sampling numbers

NB: Read on sampling methods advantages and disadvantages of each method

1.15 Sample size

Some general factors involved in determining the sample size include

(a) Money and time available

1.17 Questionnaire Design

a) the questionnaire should be as short as possible

Secondary are generally used when:

Remember to acknowledge the sources of your secondary data to avoid plagiarism.

Sichone, C, (2013), Biostatistics Module, (2nd Ed) Cavendish University, Lusaka

i. Data quality might be questionable

Sources of secondary data and their use

Two categories internal and external

a) the results of undertaken within the health sector

a) Annual abstract statistics

1.18 Data and their accuracy

a) By source – primary or secondary

Discrete data: data that can be measured precisely

Rounding and its conventions

a) Data are normally rounded for one or two reasons.

Errors and their causes

Two main causes

(i) Incomplete or incorrect records

(i) Measuring continuous data

Methods of describing errors

An estimate with a maximum absolute error – real value.

X+Y lies in the range [a + c, b + d]

X x Y lies in the range [a x c, b x d]

X - Y lies in the range [a - d, b - c]

X/Y lies in the range [a/d, b /c]

Fair and biased rounding

Compensating and systematic errors

Avoiding errors when using percentages

1.19 Frequency distribution

Grouped frequency distributions

Group Number of patients

Rules and practices for compiling group frequency distributions

Formation of grouped frequency distributions

Calculate the range

Construct the distribution classes

A histogram is a chart consisting of a set of vertical bars.

Frequency polygons and curves

Cumulative frequency distributions

Cumulative frequency polygons and curves

1.20 General Charts and graphs

Purpose of statistical diagrams

(i) Present data in an attractive and colourful way

Types of charts and graphs

Some features of pictograms