Professional Documents
Culture Documents
Biostatistics Lecture Notes
Biostatistics Lecture Notes
1.0 BIOSTATISTICS
This topic serves as an introduction to the whole module. It describes the main areas covered
under the heading statistics in health and introduces the idea of statistical investigations.
A particular problem for health managers is that most decisions need to be taken in the light of
incomplete information. That is not everything is known about the current health processes and
very little (if anything) will be known about the future situations.
The techniques described in this module enable structures to be built up which help management
to alleviate this problem.
A collection of techniques and methods that may be used to solve problems that arise when one
wants to draw general conclusions from data from epidemiological and other types of empirical
studies.
Statistics is the science of collecting, summarizing, presenting and interpreting data, and of using
them to test hypothesis.
(a) the selection, collection and organisation of basic facts into meaningful data and then
(b) the summarising , presentation and analysis of data into useful information (Francis
2004:1)
The gap between facts as they are recorded anywhere in the health institutions/environment) and
Information which is useful to management is usually a large one. (a) and (b) above describe the
process that enable this gap to be bridged.
1.3 Probability can be thought of as the ability to attach limits to areas of uncertainty.
Health Management decisions are based on numerous pieces of information obtained from
different sources. They may have used, one some or all of the techniques which have been
described as statistical methods or probability. What the decisions will all have in common
however is that they are the final product of a general structure (or set of processes) known as
investigation or survey.
Some significant factors are listed as follows:
a) Investigations can be fairly trivial affairs, such as looking at the patient attendance at
the health institution.
b) Investigations can be carried in isolation or in conjunction with others.
c) Investigations can be regular (routine or ongoing) or ‘one off’.
d) Investigations are carried out on population. A population is the entirety of people or
items (technically known as members) being considered.
Stages in an investigation
However small or large an investigation is, there are certain landmarks in identifiable
stages, through which it should normally pass. These are listed as follows:
There are different methods used in choosing the subjects for an investigation and the different
ways of collecting data.
1.6 Data refers to facts, opinions and statistics that have been collected together and recorded
for reference or for analysis. Refer to Francis introduction to computers
There are two main sources of data namely Primary data and secondary data
1.7 Primary data is the name given to data that are used for the specific purpose for which
they were collected. (E.g. Censuses and samples)
1.8 Secondary data is the name given to data that are being used for some purpose other
than that for which they were originally collected.
1.9 Censuses
A census is the name given to a survey which examines every member of a population.
The Zambian government statistical office carries out many official censuses. Some of them are
1. A Population census is taken every ten years , obtaining information such as age, sex,
relationship to head of household, occupation, education, number of rooms in a place of
dwelling for the whole population of Zambia
2. Demographic Health Surveys (DHS)
3. Economic surveys etc
A census has the advantage of completeness and being accepted as representative but has the
disadvantages of cost and delays in the collection and release of results taking up to two years
before results come out.
1.10 Samples
In practice due to various limitations including time financial resources, most of the
information obtained by organisations about any population will come from examining a
small, representative subset of the population. This is called a sample.
Advantages of sampling are usually: small cost, time and resources, results can be out within
a short time.
A general disadvantage is natural resistance by the lay man in accepting the results as
representative. Other disadvantages depend on the particular method of sampling used.
1.11 Bias
a) Selective bias. This can occur if the sample is not truly representative of the population.
b) Structure and wording Bias. This could be obtained from badly worded questions
c) Interview Bias. If the subjects of an investigation are personally interviewed, the
interviewer might project biased opinions or an attitude that might not gain full
cooperation of the subjects.
d) Recording bias this could result from badly recorded answers or clerical errors made by
an untrained workforce.
Certain sampling methods require each member of the population under consideration to be
known and identifiable. The structure which supports this identification is called a sampling
frame.
The sampling techniques most commonly used in health surveys can be split into three
categories.
a) Random sampling. This ensures that each and every member of the population under
consideration has an equal chance of being selected as part of the sample.
Two types of random sampling used are:
(i) Simple random
(ii) Stratified (random) sampling
b) Quasi-random sampling (Quasi means ‘almost’ or ‘nearly’) this type of technique, while
not satisfying the criterion given in (a) above, is generally thought to be as representative
as random sampling under certain conditions. It is used when random sampling is either
not possible or too expensive to consider.
Two types that are commonly used are :
(i) Systematic sampling
(ii) Multistage sampling
c) Non random sampling. This is used when neither of the above techniques are possible or
practical. Two well used types are :
(i) Cluster sampling
(ii) Quota sampling.
The two types of random sampling, above (random and quasi-random) normally require the use
of Random sampling numbers. These consist of the ten digits from 0 to 9, generated in a random
fashion (normally from a computer (note that some scientific calculators also perform this
function)) and are arranged in groups for reading convenience. The term ‘generated in a random
fashion’ can be interpreted as ‘the chance of any one digit occurring in any position in the table is
no more or less than the chance of any other digit occurring’.
There is no universal formula for calculating the size of a sample. However, as a starting point,
there are two facts that are well known from the statistical theory and should be remembered.
1. The larger the size of sample, the more precise will be the information given about the
population.
2. Above a certain size, little extra information is given by increasing the size. This entails
that a sample need only be large enough to be reasonably representative of the
population.
Data collection is the means by which information is obtained from the selected subjects of an
investigation. There several methods the common ones being
(a) Individual (personal) interview. This method is probably the most expensive, but has the
advantage of completeness and accuracy.
Other factors involved are
i. interviewers need to be trained;
ii. interviews need arranging
iii. can be used to advantage for pilot surveys, since questions can be thoroughly
tested;
iv. uniformity of approach if only one interviewer is used;
v. an interviewer can see or sense if a question has not been fully understood and it
can be followed up on the spot
vi. this form of data can be used in conjunction with random or quasi random
sampling.
b) Postal questionnaire
c) Street(informal) interview
d) Telephone interview
e) Direct observation it is normally considered to be the most accurate form of data
collection, but is very laborious intensive and cannot be used in many situations.
Andre Francis (2004:16) states that “If a questionnaire is used in a statistical survey, its
design requires careful consideration. A badly designed questionnaire can cause many
administrative problems and may cause incorrect deductions to be made from statistical
analyses of the results. One major reason for the pilot surveys are carried out is to check
typical responses to the question. Some important factors in the design of questionnaires
include the following:
a useful check on the adequacy of the design of a questionnaire can be given by conducting a
pilot survey”
1.18 The use of secondary data
a) The time, manpower and resources necessary for your own survey are not available or
b) It already exists and provides most, if not all , of the information required.
The advantages of using secondary data are saving on time, manpower and resources in sampling
and data collection. In other words, somebody else has done the “spade work” already.
Surname, First name (Year), Title of Article (2nd Edition) Publisher; Place of Publication
The disadvantages of using secondary data can be formidable and careful examination of sources
of data is essential problems include the following
Internal
a) patient file
b) patient registers
c) using information on raw materials for the institution
External
Without doubt the most important external secondary data sources are official statistics
supplied by the central statistical office and other government departments. These include:
This topic is concerned with the forms that data can take, how data are measured and the
errors and approximations that are often made in their description.
Data classification
Continuous data: data that cannot be measured precisely; their values can only be
approximated to.eg dimensions length, height, weights areas volumes, temperatures; times.
Unpredicted errors
An Internal- range
Rounding errors
Errors in expressions
Error rules the range of error for the addition, subtraction, multiplication and division of two
numbers , each of which is subject to error is given by the following.
Range of error
If the value of the number X lies in the range [a, b] and Y lies in the range [c, d] then:
This topic deals with the organisation and presentation of numeric (universal) data. It
describes how numeric data can be organised into frequency distributions of various types,
their graphical presentation and their interpretation.
Raw statistical data: Before the data obtained from a statistical survey or investigations have
been worked on they are referred to as raw data.
Data arrays. One way of extracting some information from data is by arranging them into
size order. This is known as a data array.
Simple frequency distributions: a simple frequency distribution consists of a list of data
values, each showing the number of items having that value (called the frequency).
Tally chart is a chart made of rows and columns for data value, tally marks and the total.
This information may then be recorded in a frequency distribution table.
A group frequency distribution summarises data into groups of values, each showing the
number of items having values in this group. E.g. hospital attendance by age groups
0-1 200
2-5 100
6-16 50
20-25 20
Note: individual data values cannot be identified with this type of structure.
a) All data values being represented must be contained within one class overlapping classes
must not occur.
b) The classes of the distribution must be arranged in size order
c) There should be normally 8-10 classes in total with not less than five or more than 15.
d) Class descriptions should be easy to estimate with ranges that naturally describe the data
being presented.
Divide range by 10 and adjust figure upward to obtain a standard class width
Tally
Frequency table
Histograms
The types of diagrams described in this topic include pictograms, various types of bar charts, pie
charts, line and strata graphs, Z-charts, Ghant charts and semi-logarithm graphs. Since all the
above graphs and charts display either non-numeric frequency distributions or time series, the
topic begins with the describing these types of data structure.
Non-Numeric frequency distributions take on a similar form to their numeric counterparts, except
that the groups (or classes ) describe qualitative (i.e. Non numeric) characteristics.
Time series
The name given to data describing the values of some variables over successive time period is a
time series e.g. years months
Pictograms
Pie charts
A pie chart shows the totality of the data being represented using a single circle (a ‘pie’). The
circle is split into sectors the size of each being drawn in proportion to the class frequency. Each
sector can be shaded or coloured differently if desired.
Line diagrams
A line diagram plots the values of a time series as a sequence of points joined by straight lines
Others
Strata charts
ACTIVITY 1
Define statistics and state the importance of statistics in the health sector.
Descriptive methods only describe variations in data. They do not solve problems that were raised
in connection with the examples. Descriptive statistics enable you to describe (and compare)
variables numerically. Statistics to describe a variable focus on two aspects:
a) Central Tendency
The three ways of measuring central tendency most used in research are the
As well as describing the central tendency for a variable, it is important to describe how the data
values are dispersed around the central tendency. This is only possible for quantifiable data. Two
of the most frequently used ways of describing the dispersion are the:
Extent to which values differ from the mean (standard deviation) Saunders p 351-367
From the fore going we have the data before us our next step from the statistical point of view is
now to deal with basic analysis of univariate data (data obtained from measuring just one
attribute). Statistical measure is the name given to describe this type of analysis, the measures
themselves being split into various groups.
Measure of location: commonly called averages, are the most well-known measures of numeric
data.
Measures of dispersion: Describe how spread out a set (or distribution) of values is
Measures of skewness: shows how evenly a set of items is distributed, and these, together with
various measures of dispersion, are dealt with later.
That is, the arithmetic mean =the sum of all the values/the number of values.
Where n is the number of items in the set, this notation is really just a compact way of saying:
‘The 1st X-value, the 2nd x-value, and ... and so on.
X1+x2+x3+x4+…+xn
Is written as ∑ x
∑is the Greek symbol for capital S (for Sum and ∑ x can be simply translated as add up all the x
values under consideration. For the example above
∑ x =4+5+6+12+2= 29
x= x1+x2+…+xn
= ∑x
X 10 12 13 14 16 19
f 2 8 17 5 1 1
fx 20 96 221 70 16 19
n 34
∑f
Weighted Mean x = ∑ n x
Summary
Activity
What is a measure of location?
How is arithmetic mean defined?
Why is the special notation x1 +x2, …xn used?
What does ∑ fx mean?
Why is the formula for the arithmetic mean of a frequency distribution different from that
for the mean of a set?
2.2 Median
The median of set of data is the value of that item which lies exactly half way along the set
(arranged into size order).
For a set with an odd number (n) of items, the median can be precisely identified as the value
n+1/2th item. Thus in a size ordered set of 15 items, the median would be the 15+1/2 th =8 th item
along.
Median for a simple frequency distribution
X 10 12 13 14 16 19
f 2 8 17 5 1 1
fx 20 96 221 70 16 19
=34+1/2
=17.5
X F F fx
10 2 2
12 8 10
13 17 27
14 5 32
16 1 33
19 1 34
34 =∑f=n
Step 3 Find that F value which first exceeds ∑f +1
=34+1/2th item
=17.5
Step 4 the median is that x –value corresponding to the F value identified in step 3.
Activity
2.3 Mode
The mode of a set of data is that value which occurs most often or, equivalently, has the largest
frequency.
The mode of the set 2,1,3,3,1,1,2,4 is 1, since the value occurs most often.
X 4 5 6 7 8 9 10
f 2 5 21 18 9 2 1
Measurement of dispersion
Measurement of dispersion describe how spread out or scattered a set or distribution of numeric
data is. There are different bases on which the spread of data can be measured.
a) spread about the mean. This is concerned with the measuring the distance between the
items and their common mean. There are two measures of this type used:
i. the mean deviation
ii. the standard deviation (a measure so important and widely used )
b) Central percentages spread of items: these measures have links with the median .
i. the 10 to 90 percentile range
ii. the quartile deviation
c) overall spread of items this measure is called the range
2.4 Range
The range is defined as the numerical difference between the smallest and largest values of the
items in a set or distribution.
Range =9-2=7
Characteristics of a range
The mean deviation is a measure of dispersion that gives the average absolute difference (i.e.
ignoring “minus’ signs) between each item and the mean.
The mean deviation (md) is calculated through the use of the formulae
Mean deviation
∑f
NOTE: the modulus symbol I….I means ‘the absolute value of’ and simply ignores the sign of
the expression inside it.
e.g I -6 I =I 6 I = 6
Example The mean deviation for a set
Question
Calculate the mean deviation of 43, 75, 48, 39, 51, 47, 50, 47
Answer
md = ∑ x- x
= 6.5
In other words, each value in the set is, on average, 6.5 units away from the common mean.
Question
The data in table 1 following relates to the number of successful sales made by the salesmen
employed by a large microcomputer firm in a particular quarter.
Calculate the mean and the mean deviation of the number of sales.
Answer:
The standard layout and calculation s are shown in table 2. The mean is calculated first, then used
to find the mean deviation.
0 to 4 1 2 2 13.3 13.3
5 to 9 14 7 98 8.3 116.2
= 15.3(1D)
= I-13.3I
=13.3
=I-8.3 I
=8.3
=I 3.3 I
=3.3
=I1.7I
=1.7
For 20-24 group I x- x I =I22-15.3I
=I6.7I
=6.7
=I11.7I
=11.7
1(13.3)+14(8.3)+23(3.3)+21(1.7)+15(6.7)+6(11.7) = 411.8 =
80 80
∑f
=411.8/80
= 5.1 sales
a) The mean deviation can be regarded as a good representative measure of dispersion that
is not difficult to understand. It is useful for comparing the variability between
distributions of like nature.
b) Its practical disadvantage is that it can be complicated and awkward to calculate if the
mean is anything other than a whole number.
c) Because of the modulus sign, the mean deviation is virtually impossible to handle
theoretically and thus not used in more advanced analysis.
The standard deviation is defined as the root of the mean of the squares of deviation from the
common mean ‘of the set of values.
A procedure for calculating the standard deviation is described and at the same time,
demonstrated using the set of values, 2, 4, 6 and 8.
It is calculated as follows:
STEP 1 Calculate the mean
x= 2+4+6+8 = 20/4 =5
STEP 2 Find the sum of the squares of deviations of items from the mean.
=9 + 1+1 + 9
=20
Step3 Divide this sum by the number of items and take the square root.
√20/4 = √5 = 2.24(2D)
S= ∑(x- x )2
Step 2 Divide the sum by the number of values n or (n-1) in case the sample size is less than 30.
120/4 = 30
S= ∑(x2 - x 2 )
∑f x 2 - ∑f x 2
∑f ∑f
Question
The data in Table 2 bellow relates to the number of successful sales made by the salesmen
employed by a large microcomputer firm in a particular quarter. Calculate the mean and standard
deviation of the number of sales.
Answer
The standard layout and calculations are shown in table 3 and the subsequent text.
0 to 4 1
5 to 9 14
10 to 14 23
15 to 19 21
20 to 24 15
25 to 29 6
0 to 4 1 2 2 4
5 to 9 14 7 98 686
10 to 14 23 12 276 3312
15 to 19 21 17 357 6069
20 to 24 15 22 330 7260
25 to 29 6 27 162 4374
∑fx2 - ∑fx 2
∑f ∑f
s=
21707-1225 2
80 80
= √271.33-(15.3)2
=√271.33-234.47
=√36.86
= 6.1 sales
b) Use of a calculator. When calculating (and at the same time accumulating) values of f x
and f x 2, use the type of technique described in the previous example. It will ensure the
least number of keystrokes for all the information required in the table. for example, in
the case of fx values , the calculator procedure is :
1x2M + and write down value (2)
14x7M+ and write down value (98)
………… etc, down to:
6x27 M+ and write down value (162)
RM gives the value of ∑fx(1225)
It is sometimes necessary to compare two different distributions with regard to variability. For
example, if two machines were engaged in the production of identical components, it would be of
considerable value to compare the variation of certain critical dimension of their output.
However, the standard deviation is used as a measure for comparison only when the units in the
distribution are the same and the respective means are roughly comparable.
In the majority of cases where distributions need to be compared withy respect to variability, the
following measure, known as the coefficient of variation, is much more appropriate and is
considered as the standard measure of relative variation.
Coefficient of variation
Mean
In words, the coefficient of variation calculates the standard deviation as a percentage of the
mean.
Since the standard deviation is being divided by the mean, the actual units of measurements
cancel each other out, leaving the measure unit free and thus very useful for relative
comparison.
Over a period of three months the daily number of components produced by two comparable
machines was measured, giving the following statistics.
=8.4%
=8.2%
Thus, although the standard deviation for machine B is higher in absolute terms, the dispersion
for machine A is higher in relative terms.
Skewness was described in the previous chapter and it was shown that the degree of skewness
could be measured by the difference between the mean and the mode. However, for most
practical purposes, it is usual to require a measure of skewness to be unit free ( i.e. a coefficient )
and the following expression, known as parsons’ measure of skewness (Psk) is of this type.
=3x(mean-Median)/Standard deviation
Thus the skewness of two different sets of employees remuneration can be compared if, perhaps,
one is given in terms of weekly wages and other in terms of annual salary. Not that:
The greater the value of Psk (positive or negative) , the more the distribution is skewed.
Question
The data in Table 2 bellow relates to the number of successful sales made by the salesmen
employed by a large microcomputer firm in a particular quarter. Calculate the mean and standard
deviation of the number of sales.
The mean and standard deviation (earlier calculated) are 15.3 and 6.1 respectively.
Estimate the value of the mode and thus calculate Pearson’s measure of skewness.
Answer
The modal class is 10-14(since it has the largest frequency) with a lower class bound of
9.5 the difference between the largest frequency and those either side are: D1 =23-14 = 9; D2=
23-21= 2 and the modal class width is 5(i.e. 10, 11, 12, 13, and 14)
Standard deviation
= 15.3-13.6
6.1
= +0.28(2D)
This value of Psk demonstrates a small degree of right skew, which can be confirmed by
inspecting the given frequency distribution table.
Summary
a) The standard deviation of a set of values is defined as ‘the root of the mean of the
deviation from the mean’.
b) This most important measure of dispersion is always paired with the arithmetic mean
because :
i. it is defined in terms of the mean
ii. it is relatively easy to handle theoretically
iii. both measures are needed when dealing with normal distribution s
c) the standard deviation is usually , calculated using the ‘computational formula’.
d) For distributions that are approximately normal the standard deviation should cover
approximately one-sixth of the range.
e) The coefficient of variation is an alternative (unit free) measure to the standard deviation
when comparing distributions. It is found by calculating the standard deviation as a
percentage of the mean.
A quartile is a name given to an item that lies at some proportional way along a size –ordered set
or distribution.
They are normally considered in groups, the members of which split up the data set into equal
portions.
a) The median, as we already know, is the middle item of a size-ordered set. Thus, by
definition it can be said to split a set (for frequency distribution) into two equal portions.
In other words, it is a particular example of the quartile.
b) Quartiles are best suited to types of business data that:
(i) Are particularly susceptible to extremes: wages of employees; turnover of
companies; value of customer orders; in fact, any distribution that has at least a
moderate amount of skew.
(ii) Have distributions that have either open-ended classes or data that are difficult,
expensive or impossible to obtain at extremes.
Remember that this is just the type of data that was described in chapter 7 as being best suited for
analysis using the median.
c) A particular problem at this stage is the fact that no measure of dispersion so far
introduced can be paired naturally with the median, in order to represent data of the type
described above.
Both the mean deviation and the standard deviation are defined in terms of the mean and thus
unsuitable.
Thus it is necessary to find measures of dispersion and skewness, based on the idea of splitting
size –ordered data into equal portions, that will naturally partner the median. Such a measure, the
quartile deviation, is developed in the following sections.
Quartiles
A (size-ordered) set of data can be split up into four equal portions. The three values that do this,
lying respectively one-quarter, one-half, and three-quarters of the way along the set, are known as
‘quartiles’.
For example the set 7,4,5,3,3,9,8 can be size-ordered and the quartiles identified as follows:
3,3, 4,5,7,8,9
For this particular set of data, the quartiles are easily identified as 3, 5 and 8 respectively and are
shown above. (Note that the middle quartile is the median.)
To summarise:
The three quartiles of an ordered set or frequency distribution are those values that lie one-
quarter, one-half and three-quarters of the way along the group, and are respectively called, the
lower, middle and upper quartiles (Q1, Q2 and Q3).
The median is the middle quartile (Q2).
For an ordered set of data, just as the median can be identified as the value of the n+1/2 th item,
the other two quartiles can be identified as follows:
Note that although the median is, by definition, the middle quartile, the term ‘quartiles’ is often
used to describe only the lower and the upper quartiles, Q1 and Q3 respectively.
The identification of the quartiles enables the measure of dispersion to be defined. This is known
as the quartile deviation and is defined as half the range of the middle 50% of items (i.e. the
difference between the lower and the upper quartiles divided by 2). It is thus sometimes referred
to as the ‘semi-interquartile range’.
To summarise:
qd = Q3-Q1
It was stressed earlier that, because of the importance of the median, there is a need for a measure
of dispersion to pair with it. This measure is the quartile deviation. Q3-Q1 gives what is called the
inter quartile range which is the range covered by the central 50% of items, and dividing by 2
gives (what can only be described as) the average distance between the median and the quartiles.
Thus, approximately, it can be considered that approximately 50% of all items lie within one
quartile deviation either side of the median. Generally, from now on the median and quartile
deviation will be calculated and referred to as a linked pair.
a) The quartiles have the property that they split a distribution into four equal segments,
which means effectively that the area under the frequency curve is divided into four equal
parts.
25%
Q2 Q1
Median
b) when calculating the median and quartiles for a simple discrete frequency
distribution( consisting of discrete values which are not grouped), the technique used is
essentially the same as that for a set. However, the form of the data requires the same
approach as that used for the calculation of the median (chapter 7, section5 and 6).
Example 3, which follows demonstrates this technique for the calculation of the median
and quartiles.
c) For a grouped frequency distribution, the quartiles can only be estimated (as was the case
for the mean, standard deviation and median). The technique normally used is graphical.
The procedure is described in section 11 and demonstrated in example 4.
The normal distribution is the name given to a type of distribution of continuous data that
occurs frequently in practice. It is a distribution of ‘natural phenomena’, such as:
a) It has a symmetric (frequency) curve at about the mean of the distribution. In other words
one half of the curve is a mirror image of the other.
b) The majority of the values tend to cluster about the mean, with the greatest frequency at
the mean itself.
c) The frequencies of the values tapper away
A confidence interval is an estimated range of values about a point estimate that indicates the
degree of statistical precision that describe the estimate. The level of confidence is set
arbitrary, but for any given level of confidence the width of the interval expresses the
precision of the measurement: a wider interval implies less precision, and a narrower interval
implies more precision. The upper and lower boundaries of the interval are the confidence
limits. (K.j Rothmans 1998:124)
The study of normal distributions help in the process of estimating certain population
measures based only on the results of small samples. Two measures of interest (for the
syllabuses covered in this text) are the mean of a population and a population
proportion.
Confidence Limits specify a range of values within which some unknown population
value (mean or proportion) lies with a stated degree of confidence. They are always based
on the results of a sample.
The following statement is typical of what needs to be able to be calculated and
understood:
‘95% confidence limits for the lead time of particular type of stock item order are 4.1 to
7.4 working days’
The above would have been calculated on the basis of a representative sample and is
stating that there is a 95% probability that the true (unknown) Mean lead time lies
between 4.1 and 7.4 working days.
The following gives a technique for evaluating a confidence interval for an unknown
population mean.
Confidence interval for mean
Given a mean is:
x ± z s/ √n
Where: x = sample mean
s = the sample standard deviation
n = the sample size
z = the confidence factor (1.64 for 90%: 1.96 for 95%: 2.58 for
99%).
Note: s/√n is known as the standard error (se) of the mean
As an example of the use of these confidence units, suppose a sample of 100 invoices yielded a
mean gross value of $45.50 and standard deviation US$ 3.24.
=45.50 + (1.96)(3.24/√100)
=(44.865, 46.135)
In words, there is a 95% probability that the mean of the complete population of invoices (from
which the sample was taken) is between 44.9 and 46.1.
The following gives the technique for evaluating the confidence interval for unknown population
proportion.
Given a random sample from some population, a confidence interval (CI) for the unknown
population mean is:
p± z p(1-p)
z= confidence factor (1.64 for 90%; 1.96 for 95%; 2.58 for 99%).
An example of the use of these confidence limits, suppose 4 faulty components are discovered in
a random sample of 20 finished components taken from a production line. What statement can we
make about the defective rate of all finished components?
n 20
=0.2 ± (1.96)(0.089)
=0.2 ± 0.175(3D)
=(0.025, 0.375)
Therefore we can state that there is a 95% chance that the defective rate of finished
components lies between 0.025 and 0.375.
Test of significance are directly connected to confidence limits and are based on normal
distribution concepts.
Suppose we suspect that the value of type A customer monthly orders has changed from last
year. Last year’s type A customer average monthly order was $234.50. We now take a random
sample of 20 customers and calculate a mean of $241.52 and standard deviation $13.92. Is the
difference significant? That is, is the value 241.52 far enough from 234.50 to suspect that things
have not changed? The following statement gives a structure of answering the above question.
To test whether a sample of size n, with mean x and standard deviation s could be considered as
having been drawn from a population with mean µ the test statistic:
z= x- µ
s/√n
In the test, we are looking for evidence of a difference between x and µ. The evidence is found if
z lies outside the above stated limits. If z lies within the limits we say ‘there is no evidence’ that
the sample mean is different to the population mean.
For example, in the above situation, we have x =241.52: µ=234.50; s=13.92; n=20
s/√n 13.92/√20
Thus there is evidence of a difference, since z lies outside the range -1.96 to +1.96. This can now
be translated as ‘there is evidence that the value of type A customer monthly orders has
changed’.
ACTIVITY
Descriptive statistics look at three measures namely Measure of Location, measure of spread
and skewness.
Write short notes on the three giving the parameters used to measure each.
a) Mean
b) Mode
c) Median
d) Mean deviation and
e) standard deviation
f) confidence interval for the mean
In part one we defined statistics as: A collection of techniques and methods that may be used to
solve problems that arise when one wants to draw general conclusions from data from
epidemiological and other types of empirical studies. We further stated that Statistics is the
science of collecting, summarising, presenting and interpreting data, and of using them to test
hypothesis.
Analysis is defined as the ability to break down data and to clarify the nature of the component
parts and the relationship between them.(Saunders2003: 472.)
Example 1/. Spread of infectious diseases, Data on children exposed to siblings with measles,
chicken pox or mumps.
1. To which degree are we permitted to generalize the information on attack rates to other
exposed children? Are the results reliable?
2. Can we be sure that attack rates are higher for measles and lowest for mumps?
Mean Weight
Treatment Before Treatment After Treatment n
Statistical problems
The examples have so far illustrates some routine descriptive statistical techniques
Histograms (Example 2)
Mean (Example 2)
Descriptive methods only describe variations in data. They do not solve problems that were raised
in connection with the examples.
Analysis is defined as the ability to break down data and to clarify the nature of the component
parts and the relationship between them.(Saunders2003: 472.)
What is sampling?
Sampling involves the selection of a number of study units from a defined study population.(Anita
1995:205)Whatever your research question(s) and objectives you will need to collect data to
answer them. If you collect and analyse data from every possible case or group member this is
termed a census. However for many research questions and objectives it will be impossible for
you either to collect or to analyse all the data available to you owing to restrictions of time,
money and often access. Sampling techniques provide a range of methods that enable you to
reduce the amount of data you need to collect by considering only data from a subgroup rather
than all possible cases or elements.
Population
X x x x x x x x x xx
x Xx
xx
X x x x x x xxx x x x x x
‘A’ Random sample
Case or element
Sampling strategies
The sampling techniques available to you can be divided into two types
Probability sampling involves random selection procedures to ensure that each unit of the sample
is chosen on the basis of chance. All units of the study population should have an equal or at least
a known chance of being included in the sample. Anita p. 208)
With probability samples the chance, or probability, of each case being selected from the
population is known and is usually equal for all cases. This means that it is possible to answer
research questions and to achieve objectives that require you to estimate statistically the
characteristics of the population from the sample. (Saunders 2003:152)
This is the simplest form of probability sampling. To select a simple random sample you need to:
-Make a numbered list of all the units in the population from which you want to draw a sample;
-select the required number of sampling units, using a ‘lottery ‘ method or a table of random
numbers.
HOW TO USE RANDOM NUMBER TABLES
1. First, decide how large a number you need. Next count if it is a one, 2, or larger digit
Number.
For example, if your sampling frame consists of ten units, you must choose from
numbers 1-10(inclusive). You must use Two digits to ensure that 10 has an equal chance
of being included.
2. You also use two digits for a sampling frame consisting of 0-99 units. If, however, your
sampling frame has 0-999 units, then you obviously need to choose from three digits. In
this case, you take an extra digit from the table to make up the required three digits. For
example, the number in columns 10, 11, row 27:43, would become 431; going down, the
next numbers would be 107, 365 etc.
You would do the same if you needed a four digit number, for a sampling frame 0-9999
units.
In our example of the number on columns 10, 11,12, row 27 0f the table: 431, this would
now become 4316,the next down 1075, and so on.
3. Decide before hand whether you are going to go across the page to the right, down the
page. Across the page to the left L, or up the page.
4. Without looking at the table, and using a pancil, pen, stick, or even your finger, pin-point
a number.
5. If this number is within the range you need , take it. If not , continue to the next number
in the direction you chose before –hand (across, up or down the page), until you find a
number that is within the range you need.
For example, if you need a number between 0-50 and you began at column 21, 22,row 21
you get 74 which is obviously too big. So you could go down (having decided before
hand to go down) to 97, also too big, to 42, which is acceptable, and select it.
In Systematic Sampling individuals are chosen at regular intervals from the sampling frame.
Ideally we randomly select a number to tell us where to start selecting individuals from the list.
A systematic sample is to be selected from 1200 students of a school. The sample size selected is
100. The sampling fraction is: 100(sample size)/1200(study population) = 1/12
The sampling interval is therefore, 12. The number of the first student to be included is chosen
randomly, for example by blindly picking one out of twelve pieces of paper, numbered 1 to 12. If
the number 6 is picked, then every twelfth student will be included in the sample, starting with the
student number 6, until 100 students are selected: the numbers selected would be 6,18,30,42, etc
3.1.1.3 Stratified Sampling
The simple random sampling method described above does not ensure that the proportion of
individuals with certain characteristics in the sample will be the same as those in the whole study
population.
If it is important that the sample includes representative groups of study units with specific
characteristics (for example, residents from urban and rural areas, or different age group), then
the sampling frame must be divided into groups, or strata, according to these characteristics.
Random sampling or systematic samples of a predetermined size will then have to be obtained
from each group (stratum). This is called stratified Sampling. (Anita 1995:209)
Case control studies use stratified sampling from subpopulations with or without a specific
disease.
The selection of groups of study units (clusters) instead of the selection of study units
individually is called Cluster sampling.
A multi-stage sampling procedure is carried out in phases and usually involves more than one
sampling method.
Probability sampling is most commonly associated with survey based research where you need to
make inferences from your sample about a population to answer your research question(s) or to
meet your objectives. The process of probability sampling can be divided into four stages:
n’=n/1+(n/N)
where n’ is the adjusted minimum sample size
n is the minimum sample size (as calculated above)is
N is the total population.(Saunders 203:158,466,467).
3. Select the most appropriate sampling technique and select the sample
4. Check that the sample is representative of the population.
For populations of less than 50 cases Henry (1990) advises against probability sampling. He
argues that you should collect data on the entire population as the influence of a single extreme
case on subsequent statistical analysis is more pronounced than for larger samples.(Saunders
2003: 153)
Stratified sampling probability sampling from sub population (definition on page 209 Anita).
Case control studies use stratified sampling from sub populations with or without a specific
disease.
If individuals are selected randomly and independently with each individual having the same
chance of being selected, then the probability of selecting an individual with which the property
or characteristic appears in the population. The population frequency = Probability.
Epidemiological concepts like risk, rate, prevalence, incidence etc may all be regarded as
probabilities.
Risks, rates, prevalence and incidence calculated for a random sample of individuals are to be
regarded as estimates of probabilities and or population proportions
A range of non-probability sampling techniques is available that should not be discounted as they
can provide sensible alternatives.
At one end of this range is quota sampling, which, like probability samples, tries to represent the
total population. Quota sampling has similar requirements for sample size as probability sampling
techniques. At the other end of this range are techniques based on the need to obtain samples as
quickly as possible where you have little control over the content and there is no attempt to
obtain a representative sample. These include convenience, and self-selection sampling
techniques. Purposive sampling and snowball sampling techniques lie between these extremes.
For these techniques the issue of sample size ambiguous. Unlike quota and probability samples
there are no rules. Rather it depends on your research questions and objectives – in particular
what you need to find out, what will be useful, what will have credibility and what can be done
within your available resources (Patton, 2002).
Quota sampling is entirely non-random and is usually used for interview surveys. It is based on
the premise that your sample will represent the population as the variability in your sample for
various quota variables is the same as that in the population. Quota sampling is therefore a type of
stratified sample in which selection of cases within strata is entirely non-random (Barnett,1991).
To select a quota sample you:
Purposive or judgmental sampling enables you to use your judgment to select cases that will best
enable you to answer your research question(s) and to meet your objectives. This form of sample
is often used when working with very small samples such as in case study research and when you
wish to select cases that are particularly informative. Such samples cannot however be considered
to be statistically representative of the total population. The logic on which you base you’re your
strategy for selecting cases for purposive sampling should be dependent on your research
question(s) and objectives.
Snowball sampling is commonly used when it is difficult to identify members of the desired
population, for example people who are working while claiming unemployment benefit. You
therefore need to:
Self-selecting sampling occurs when you allow a case, usually an individual, to identify their
desire to take part in the research. You therefore:
1. Publicise your need for cases, either by advertising through appropriate media or by
asking them to take part.
2. Collect data from those who respond
Cases that self-select often do so because of their feelings or opinions about the research
question(s) or stated objectives.
Convenient or haphazard sampling involves selecting haphazardly those cases that are easiest
to obtain for your sample, such as the person interviewed at random in a shopping centre for a
television programme. The sample selection process is continued until your required sample
size has been reached. Although this technique of sampling is widely used it is prone to bias
and influences that are beyond your control, as the case only appear in the sample because of
the ease of obtaining them. Often the sample is intended to represent the total population, for
example managers taking an MBA course as a surrogate for all managers! In such instances
the choice of sample is likely to have biased the sample, meaning that the subsequent
generalizations are likely to be at best flawed. These problems are less important where there
is little variation in the population, and such samples often serve as pilots to studies using
more structured samples.
Confidence interval
A p% - confidence interval about an estimate is an interval that with probability p contains the
true value.
The 95% -confidence interval may in many cases (e.g. for means and proportions) be
calculated by the following formula:
=0.80+(1.96*0.025)
=0.80+0.049
Example 1.
Example 2
Control
1. Can the difference between estimated attack rates be due to random variation
(Example 1)
2. Can the difference between weights before and after treatment for the treated girls be
due to random variation
Can the difference between the control and treatment groups be due to random
variation
(a) The population parameters are the same, meaning that the difference between
estimates must be due to random variation
(b) The population parameters are different
Designate 1a as the null-hypothesis HO and 1b as the alternative Ha.
Null hypothesis always assumes that there is no difference between e.g. exposure and
non-exposure to a disease.
2. Define a test –statistic measuring the discrepancy between H o and the observed data
A small value of this statistic should indicate a good fit between H o and data and
therefore lead to acceptance of Ho
A large value of the test statistic indicates misfit between H o and data. Ho will therefore
be rejected
3. Assume that T is the Test statistic and that to is the observed value.
The test –probability or the p-value associated with T is the probability that T≥ to
calculated under the assumption that Ho is true.
P=P(T≥to / Ho
4. A small p-value means that we have observed something that is improbable under H o.
Something that is difficult to explain as a random event if Ho is true.
Two tests
Ho in both cases is that there is no association between variables (no association between
exposure and disease or no association between treatment and outcome/ weight)
Significance testing
It is becoming more and more recognised by medical researchers and statisticians that the most
and more informative way to indicate the statistical significance of a given value is to also present
the confidence interval.
In the examples for OR’s and RRs in the previous chapters two things are immediately obvious
from a confidence interval:
1. If the 95% interval does not include 1 (the entire interval is either above or below 1), then
we know that there is a good probability that the risk factor studies is really associated
with disease, and that this is not just a chance finding.
2. The width of a confidence interval gives a feeling for how precisely the OR or RR was
measured in the study. If the 96% Confidence interval for an RR was found to be from
1.3to 15 , then we would not really know if this was a very important risk factor for the
disease (high RR) or a relatively minor one.
The X 2 –Test
Observed infection
Exposure Infection
Yes No
Mumps 140.3 77.7
Chicken pox 153.2 84.8
Measles 161.5 89. 5
Residuals =Observed-Expected
Expected
Exposure Infection
Yes No
Mumps 24.22 43.74
Chicken pox 2.30 4.17
Measles 9.66 17.43
X2 = ∑ (obs-exp) 2
= 101.51
Exp
Accept HO
0 Critical value
Reject H0
If the p-value, P(X2 > X 2 obs) is small then the observed X2 –value lies in a critical probability if
Ho is correct.
Assume that the mean values, x 1, and x 2, have been calculated for two independent groups from
two different subpopulations.
x 1, and x 2, are estimates of the expected values (pop.means) for the two subpopulations.
If Ho is correct
Then E( x 1 - x 2) =0
n1 n2
t= x 1, and x 2,
se0
Accept HO
Reject H0 Reject H0
If the p-value P T ≥ t obs is small then the observed t-value lies in a critical area with a
small probability under Ho therefore reject HO
1. A t-test assuming different variances in two populations. Variances and standard deviations
must each be estimated for each group.
2. A t-test assuming equal variances. A common variance is estimated and used for the
calculation of the denominator of the t-value.
A statistical test (linene’s t- test) exists for the hypothesis of equal variances.
Family therapy and Anorexic girls
Before treatment
t=-0.962
p=0.342
No evidence against equality of means (HO) →the study does not seem to be biased
After treatment
T=-4.157
P=0.00039
Some comments
P-values are in most cases approximate values requiring large samples to be reliable.
P-values for t-tests are exact if the distribution of the variable is normal
Many other test –statistics are available. The test procedure is however always the same after the
test statistic has been calculated.
There are two main reasons for using rates as opposed to whole numbers
1. To make comparisons between two different populations that may have different
numbers of people at risk, by standardizing for population size.
2. To calculate the number of expected cases. By using a known rate the approximate
number of cases that are expected to occur in the population can be calculated.
The basic epidemiological concept of comparing risks is introduced, some confusing definitions
are discussed, and we meet attack rates.
A not uncommon situation for a practicing physician is when several members of a family, a day
care group, or a school class fall ill at almost the same time. Often, the disease is some kind of
gastroenteritis, and the patients as well as their doctor wonder if it might have been something
they ate. The answer to that question is complicated by the fact that it is not always possible to
single out the responsible meal, and even if one could, there are almost always several different
food items served during a meal.(Geiseche 1994:17)
The situation becomes simpler if a group of people who do not ordinarily eat together share a
meal, and some of them become ill afterwards. The following example shows how one could
analyse such a situation by calculating risk and relative risk.
Example
Fifteen people had New Years dinner together. Within 24 hours, five of them fell ill with
gastroenteritis. The dinner had consisted of several courses and food items, and the participants
had not all eaten the same things. How could the cause of their disease be assessed?
All guests were sent a list of food that had been served and asked what they had eaten. As the list
came back, their replies were recorded in double table, with the ones who had been ill on the left,
and the ones who remained well on the right (see table 3.1)
Table 3.1 Table filled out from questionnaires given to 15 people during an outbreak of
gastroenteritis
Gastroenteritis No gastroenteritis
______________________________________________________________________________
Cheesecake IIII I
Chocolate cake I II
What we want to know is what was ones chance of being ill if one had eaten each of the foods.
Rearranging the table we have table 3.2
Table 3.2 Number of subjects in table 3.1 who became ill out of total who ate each item.
Cheesecake 4 5 0.8
RISK
The Risk associated with some potentially harmful factor is defined as: The proportion who
become ill out of all those exposed to it.
For the risk refer to the risk column in table 3.2 above.
Now comes the centre of this chapter: we must also look at the risk of being ill in those who did
not eat the items on the list. We know that almost half of those who had Swiss roll were ill, but
what conclusion would we draw if we found the same proportion ill in those who skipped the
Swiss roll? The people at the dinner ate many different things, and for most of the items there will
be a mixture between those who happened to eat the infected food, and those who did not. If the
item was innocent of causing an illness, we would expect the same risk of being ill regardless of
whether one ate or not.
This way of thinking is basic to all epidemiology: if an exposure has nothing to do with a disease,
then the proportion who are ill after having had this exposure should be the same as in those who
had not had the exposure.
We thus proceed to list the outcome according to what people did not eat. (Table 3.3). Looking at
table 3.1, we can see that three of the ill people did not eat quiche, and that two of the well people
also did not eat quiche, and so on:
Tale 3.3 Number of those subjects in table 3.1 who became ill out of the total who did not eat
each item.
Quiche 3 5 0.6
Cheesecake 1 10 0.1
RELATIVE RISK
A simple way of comparing the risk in those exposed versus those not exposed is to divide them
(always putting the risk in the exposed on top). This gives the relative risk (also called risk
ratio) RR
Table 3.4 relative risk (RRs) of illness associated with each item during an outbreak of
gastroenteritis.
Food RR
Quiche 0.33
A relative risk around 1 means that the risk of disease was nearly equal in exposed and un
exposed, and that this item is unlikely to have caused disease. A high RR points to this item
being associated with the disease, and a RR close to 0 would indicate that the item is in some
way protective- the risk of disease is then much higher in those not exposed.
From our calculations we find that the RR of causing illness is clearly highest for cheesecake, and
we can conclude with some certainty that this was the item responsible for the gastrointestinal
illness.
Table 3.5 number of infected out of siblings exposed to three childhood infections. Source :Hope
Simpson
From these figures we can calculate the infectivity of each of these disease within a
family.
ATTACK RATE
The basic measure of infectivity is the attack rate. The definition is: the attack rate of a disease
is the number of cases, divided by the number of susceptible exposed, which is really the
same as the definition of risk above. The difference is that there we use the infectious disease
definition of exposure, by counting only the people who really were exposed to the microbe.
Thus, four out of every five children exposed to measles in the family will themselves
contract measles, etc. there is obviously appreciable differences in attack rate for these three
diseases.
Example 1
=P x (1-P)/N
=0.8 x 0.2/251
=0.16/251
=6.3745019920318725099601593625498e-4
=0.00063745
=0.025
=0.80+ 0.049
= 0.849
=0.850
Or
0.80-(1.96X0.025)
=0.80-0.049
= 0.751
Often one wants to estimate the proportion of the population that have some characteristics, such
as the proportion who have antibodies to disease A, or the proportion of the population who have
had a test for disease B. it is seldom possible to test or ask every one, so one would probably do
this by collecting a random sample.
Assuming that this sample is truly random, with no selection biases involved, one would want to
know how the proportion measured in the sample relates to the true prevalence in the population.
This is exactly the same reasoning followed regarding confidence interval for OR*: if one only
wants to make a statement about the sample just studied there is no need for confidence intervals.
If for example 308 out of 1000 sampled were found to have antibodies,
Example 2
The studies we have been looking at in the previous examples have all analysed an
epidemiological pattern after the event has occurred. Sometimes one may have to plan an
epidemiological study a little better in advance. Such studies where a defined group of people
are followed overtime, are probably more familiar to most clinicians than case control
studies.
In a study of HIV infection and Tuberculosis in New York, 513 intravenous drug users were
initially tested for HIV antibody. Two hundred and fifteen were HIV positive and 298 HIV
negative. They were then followed for any signs of active tuberculosis during an average of two
years. The results of the study were :
HIV HIV
Seropositive Seronegative
initially Initially
Developed TB 8 0 8
The risk of developing tuberculosis in this group was thus 8/215 =0.037 for those who were
seropositive at entry, and 0/298=0 for those who were seronegative.
This type of study, where one first defines and measures the risk factor one wants to evaluate (in
this case HIV status) in a defined group , and then follows this group over time to see who
develops disease (in this case TB)is called the cohort study.
Another example: There has been much discussion about whether sexually transmitted diseases,
and especially genital ulcerative diseases, increase the risk of HIV transmission. In a study from
Nairobi, Cameron, et el. Followed 291 men who presented at an Sexually Transmitted Diseases
(STD) clinic. They are reported to have had sexual intercourse with women from a group of
prostitutes where HIV infection was known to be common. About half the men presented with an
ulcerative disease, the rest with urethritis. After the first visit, the men were tested repeatedly for
three months, to see also if they had sero-converted in an HIV antibody test(which may take
several weeks to become positive after the actual transmission).
Presented Presented
with genital ulcers with another condition
Sero-converted 21 3 24
= 21/149
= 0.14
Risk = number of ill persons in the unexposed /total number of people unexposed
= 3/144
= 0.02
RR= 0.14/0.02 = 7
Confidence interval for RR is almost the same as for an OR. We first calculate the error factor:
= e 2x√(1/21+1/3)
= e 2x0.62
=3.46
Divide and multiply the RR with this value (the error factor) to get the lower and upper bounds,
respectively
The confidence interval for a ratio is calculated as for an RR, i.e. by first getting the error factor
The confidence interval thus seems to be well above RR=1, but since the formula above
requires that both a and b are at least 10, this approximate interval cannot be trusted
entirely.
And the lower and upper bound for our confidence interval are found by dividing and multiplying
the RR by the error factor respectively.
In a cohort study, we can use risk and RRs for our comparisons, since we have started by defining
the total group of people that we want to study
Case control studies use stratified sampling from sub populations with or without a specific
disease.
In dealing with risk, relative risk and attack rates we have knowledge of the entire
population, i.e. we could count precisely how many were exposed and how many were infected.
In real life, and especially when studies are based in the community rather than in the clinic, we
will often only have information about some of all exposed and ill people. Such situations require
slightly different methods.
For one to calculate risk you need that the denominator should include the total number of
persons who had eaten the item.
In this case we do not know how many people were exposed, nor do we know who was exposed
to which item. Also it is unlikely that we identified all the cases.
This type of epidemiological analysis is the basic form of case control study, where risk factors
for disease are ascertained by comparing different exposures (in this case type of food
eaten)between people who were ill(=the cases) and people who were not (=the controls). In
contrast to the previous chapter, we do not have knowledge of all cases, nor of all controls; what
we do have are two samples of people. Note also that we are doing analysis ‘backwards’ in time,
starting from a number of cases that we diagnosed, then identifying a number of controls, and
after that looking at possible causes of the disease.
Table 4.1 Results from questionnaires to 37 cases and 58 controls in an outbreak of gastroenteritis
in a large office block: Source salmon et el
Gastroenteritis No gastroenteritis
Item eaten Not eaten eaten Not eaten
Lunch 22/1 6 31 9 48
Lunch 23/1 18 19 14 43
Salad 12 24 5 52
Sandwiches 16 21 14 44
Chicken 4 33 4 54
ODDS
Comparison of risk factors in case control studies most often make use of the term odds. These
build on a similar idea as risks but instead of dividing the number of people who were ill by total
number exposed, which we do not know, we divide by something we do know, namely the
number of people in our study who did not become ill. The odds associated with each item
on our list is thus:
As an example, the odds for the chicken in the table above would be 4/4 =1
ODDS RATIO
Just like with risks, we want to compare odds for those exposed to the odds for those not exposed.
The odds for salmonella infection if one had not eaten chicken are 33/54=0.61
The definition of odds and OR are very similar to the ones for risk and RR ( so similar, in fact,
that they are easily confused, which happens not infrequently in epidemiological literature).
Their advantage is that they can be calculated in situations where one does not have knowledge
about the entire population.
The disadvantage is that they have less intuitive meaning than the words risk and relative risk .
Odds do not really mean anything, they can just be compared to see which ones are greater.
Cases a b a+b
Controls c d c+d
where
For infectious disease with very high infectivity, there would not be any people in squares c and
b: All the ill people would be exposed (=a), and all the health would be unexposed (=d).
For factors ‘having had lunch in the canteen on 22nd January’, our 2x2 table would be:
Ill 6 31 37
well 9 48 57
15 79 94
From this table it is easy to calculate the odds and the OR associated with having lunch on 22 nd.
The odds of illness in those who had lunch is 6/9= 0.67, and the odds of illness in those who
didn’t eat in the canteen that day is 31/48= 0.65. The OR for the illness for the factor ‘lunch in the
canteen on 22nd’ is 0.67/0.65= 1.03.
An OR of 1 is equivalent to equal odds for disease in those exposed and not exposed to the
factor , which is the same as saying that an OR of one suggests that this factor is not associated
with disease.
Ill 18 19 37
Well 14 43 57
32 62 94
Here, the odds for illness are 18/14 = 1.29 and 19/43 = 0.44 for the exposed and unexposed
Respectively. That is: among the 94peoplewho answered this question, there was a considerably
Greater chance of having had lunch on 23rd for those who were infected than for those who were
Well.
In the same way , ORs can be calculated for the three different menu items, and the total list
becomes :
Table 4.2 odds ration (OR) for illness associated with each risk factor for the study given in Table
4.1.
Item OR
Lunch 22 1.03
Lunch 23 2.93
Salad 5.2
Sandwiches 2.39
Chicken 1.64
An OR of 1 is equivalent to equal odds for the disease in those exposed and not exposed to the
factor,
Which is the same as saying that an OR of 1 suggests that this factor is not associated with the
disease.
The formula for odds ratio can be manipulated a little to give an easier way to calculation:
OR = a/c =ad
b/d bc
or in words : ‘multiply upper left hand number by lower right hand, and divide by the upper right
multiplied by the lower left’.
The probable range of the true OR can be calculated rather easily from the 2x2 table. For each of
our five ORs we first calculate something called the error factor, which is defined as
The formula might seem complicated but it can be calculated on an ordinary calculator. As an
Example, for the exposure on lunch on the 23rd’ the calculation would be:
1. First divide 1 by each of the four numbers in the 2x2 table, add the result to the memory
of the calculator each time
4. And finally raise e to this number: e 0.90= 2.46, which is our error factor
5. The lower bound of the probable range for OR for ‘lunch on the 23 rd ‘ is now defined by
dividing our calculated OR in the list above by the error factor:
6. The higher bound is given by multiplying our calculated OR with the error factor:
CONFOUNDING
Even if a factor is significantly associated to disease, this may just be a statistical finding, where
the division according to exposure also divides people into high risk and low-risk groups
according to some real factor. This is called confounding. The concept of confounding is closely
coupled to the concept of cause in epidemiology.
Odds ratios
Confounding
Stratification
Conditional independence
Effect notification
Mortality seemed to be higher in the western American states then in the eastern American states.
Are there two different types –a western type and an eastern type of typhus fever?
Ill 6 31 37
Well 9 48 57
15 79 94
c d
bc
Assume that
→A prospective or cross-sectional study will contain very few cases with the disease
Case control studies = stratified sampling of all or many cases with the disease and some controls
without the disease
P(exposure=Yes/Case)
P(exposure=yes/ control)
The problem
P(case/Exposure =yes)
P(case/Exposure=No)
But we only have information on
P(Exposure=yes/case)
P(exposure=yes/Control)
P(Exposure) P(Exposure)
P(Exposure)
P(control/Exposure)
= P(exposure/case) . P(case)
P(exposure/control) P(control)
= odds (exp/case)
Odds(exp/control)
= ORexp
The retrospective information from case control studies will give us information on the
prospective odds –ratios that we are interested in.
Birth weight
Cases 51 52 103
One conceptual difference between the two types of study has to do with time:
In cohort study we start with a number of subjects who are free from disease, and follow them
over time to see who becomes a case and who does not.
In a case control study the events have already happened before the study started, and we collect
the cases and try to find appropriate, disease free controls.
Cohort studies usually require carefully planned investigations, whilst a case control study can
quite often be performed quickly from a number of cases already collected.
For diseases with very low incidence, cohort studies may not be practical, or even feasible.
Another principal conceptual difference concerns the measures of strength of association in the
two types of studies; the ORs and the RRs. As earlier pointed out, an odds ratio does not have any
direct interpretable meaning; it just tells us how strongly an exposure and an outcome seem to be
related. However for rare diseases, the OR often provides a good approximation of the RR.
Activity
In the second of the two situations above we did not have continuous measurements of some
variable for the two groups, but instead numbers of people belonging to different categories. It
then becomes a bit strange to talk about the ‘average sex’ in a group of patients. The basic
situation is just our familiar friend the 2x2 table, which could be for example:
However, the 2x2 table could just as easily be extended to a table with more columns and/ or
rows if there were more categories of exposure or outcome or both.
In this situation, the subject would only belong to the two categories ‘vaccinated’ or ‘not
vaccinated’, and to either of the categories ‘ill’ or ‘well’. There is no meaningful way of giving an
average of health in the vaccinated group, or an average vaccination status in the well group.
This type of data is thus quiet different from the t test situation above and is usually called
categorical as opposed to continuous data.
In chapter 4 we saw how to calculate an OR for such a table, and also a confidence interval for
this value. If we now want to perform a significance test instead, the question to ask is: what is
the probability that the 150 subjects of the study will divide this way into ‘ill’ and ‘well’ just by
chance? A very low probability of such a chance would give increased weight to our hypothesis
that the vaccine has effect.
The way to reason is as follows: there are 50 people who fall ill and 100 who remain well. If the
vaccine was totally inefficient, we would assume that it did not matter whether or not a subject
was vaccinated. Since one third of the total group fell ill , this would be the expected proportion
in each of the individual groups. In the vaccinated group of 90 people, we would expect 30 to fall
ill, and in the unvaccinated group of 60 , 20 . the expected 2x2 table if the vaccine did not work at
all would be:
Ill 30 20 50
Well 60 40 100
90 60 150
The general way of calculating the expected value for a cell in a 2x2 (or 3x3, or 5x3 or……)
table is to multiply the column sum at the bottom of the corresponding column by the row sum to
the right, and dividing this number by the total in the lower right corner. For the first cell in our
example this would be 90x50/150= 30, just as above. (In these calculations, one often gets
fractions of people in the cells of expected table but that does not affect the analysis at all.)
We can now compare the numbers in the expected tables to the actual ones to see if they are very
different. One way is just to take the difference between the numbers in the corresponding cells
(10-30) for the first Cell, 40-20 for the second and so on). If the vaccine had no effect we would
expect those differences to be small, and the larger they are, the more the result of our study
deviates from what would be expected just by chance distribution of the cases. The X2 test would
now consist of squaring all these differences, dividing each square by its expected value (from the
table above), and then adding them. The higher this number, the less chance that the distribution
of ill and health subjects according to vaccination status would have occurred by chance. For a
2x2 table like this, a X2 value above 3.84 indicates there is less than 5% probability that the result
occurred by chance.
We can see that there is a very small chance indeed that the high X2 value would arise by
chance , and we can state that there is statistical support that the vaccine does have a protective
effect. In fact, the probability that this would be a chance finding can be calculated to be p
<0.0001.
When you lookup a X2 table, you will find that they mention something call ‘degrees of freedom’.
For tables such as the above, this has to do with the number of rows and columns (categories for
exposure and outcome). The number of degrees of freedom is just (number of rows -1) x(number
of columns -1), and thus for a 2x2 table (2-1)x(2-1) =1. For a 3x4 table (three different outcomes,
four different exposures), there would be (3-1)x(4-1) =6 degrees of freedom, and you would have
to refer to this table for the X2 test. (Degrees of freedom is often abbreviated ‘d.f’.)
A quick way to calculate the X2 value from the general 2x2 table from chapter 4
Exposed Not Exposed
Cases a b a+b
Controls c d c+d
a+c b+d N
X2 = (ad-bc)2 x N
(a+c)(b+d)(a+b)(c+d)
where the parentheses in the denominator are just the column and row sums.
Since the X2 is so easy to perform it can often be used for an initial check even for continuous
data where one would otherwise use at test. If one for example wants to compare temperatures in
two groups of patients one could just choose the value that seems to be in the middle of all
temperature readings form both groups, and count the number of subjects in each group who have
a temperature above or bellow this value. The four figures thus attained are entered in a 2x2 table,
and the X2 calculated. If this X2 figure yields a low p value then you can also be confident that the
t test will also yield a low p value.
There are however some important restrictions on when the X2 test can be used. It is an
approximate method that gets more and more valid the larger the size of the study.
If neither of these conditions are fulfilled, one must use Fishers Test, which is the subject of the
next section.
One of the perpetual dreams of mankind has always been to be able to predict the future.
The regular recurrence of epidemics and the similar shapes of consecutive epidemics of a
disease have for a long time tempted people with a mathematical inclination to make
some kind of model.
The potential for a contagious disease to spread from person to person in a population is called
reproductive rate. It depends not only on risk of transmission in a contact, but also on how
common contacts are: a person with measles who meets no-one will not transmit the infection.
In a similar way the rate of acquisition of new sexual partners will influence the spread of
sexually transmitted diseases. The principle determinants of reproductive rate are:
Ro =is the average number of persons directly infected by an infectious case during his entire
infectious period, when he enters a totally susceptible population.
Ro=βxKxD
Point 2 above is the most interesting from the epidemiological point of view, and also the
most frequently overlooked. The spread of infectious diseases not only depends on the
properties of the pathogen or the host , but in at least equal degree on the contact patterns
in the society- who meets whom? How often? What kind of contact do they have?
If a new disease enters a population, what is its probability of spreading? From the above
it may happen that:
In order to prevent epidemics of a disease, the proportion of the population that must be
vaccinated is higher than 1 minus the inverse of the basic reproductive rate.
Example: Ro for measles has been shown to be around 15 i.e. every case of measles will
infect 15 other people, on average.
Then the formula predicts that if we want to prevent measles epidemics, more than 1-1/15
=0.94 or 94% of the population must be vaccinated.
The level of immunity in the population, which prevents epidemics, (even if some
transmission may still occur), is called herd immunity.
Where:β is the risk of transmission per contact (i.e. basically the attack rate),
k is the number of potential infectious contacts that the average person in the
population has per time unit, and
Many public health measures to prevent the spread of infections aim at decreasing β, such
as using a condom, wearing a face mask, or washing ones hands
The first is that every person in the population will meet every other with equal
probability
The second unrealistic assumption is that only one kind of contact exists, with a given β
(Giesecke 1994:108-123)
The three important factors to be characterized in an outbreak analysis are time, place
and person. In order to identify the cases for this analysis, one needs a case definition,
and this is used to actively search for more cases than the ones who present themselves.
The epidemic curve can give indication on type of exposure: point source, extended
source, or person to person. In a point source outbreak it is often possible to estimate the
common time of exposure, if the disease and its incubation time are known, or
conversely, diagnose the disease if time of exposure is known.
Plots should be made not only for time course of the outbreak but also for sex, and age
distribution, and often for the geographical location of the cases.
After the cases have been identified, the probable cause of the outbreak can be searched
by mare analytical methods, most often starting with case control study. In many
instances, the cause will be clear from the outset, since an unexpected increase in rate of
diagnosis of a certain pathogen in the microbiological laboratory will often be what
triggered the investigation. However, if the pathogen was previously unknown,
epidemiology and microbiology must often work hand in hand to reveal the cause.
Epidemic Threshold
An epidemic may be defined as an indisputable increase in the number of cases of a disease
compared to its usual rate. This definition should reflect the norms for individual disease
prevalence in a given geographical area.
The alarm signal
The alarm signal may be sounded by:
the population itself
the surveillance system
Rumors of unknown origin to the effect that people are dying.
The reliability of such an alarm depends on its origin. By definition information furnished by the
surveillance system is more credible than rumors. Whatever the source of the alarm , however,
an investigation will have to be undertaken to confirm or disprove the initial reports.
Data analysis
Epidemiological data
Persons
What group or groups are affected? The rate of infection must be measured for
different population groups. Similarly, the rate of mortality specific to the disease
in question must be determined for each of the groups affected.
Place/Space
Where did the epidemic begin? Which regions are most affected? Both the rate of
infection and mortality rate must be determined by geographical area.
Time
When were the first cases identified? An epidemic curve should be constructed to
show the number of cases in relation to time.
Risk factors
Are these factors strong enough to contribute to the outbreak of the epidemic?
The strategy for action consists in reviewing all the stages of the communicable disease cycle
and making a list of possible actions. Examples include;
Vector control
-distruction of vectors
-action to eliminate breeding grounds
Active protection
-immunisation
Passive protection
-chemoprophylaxis
Early screening for cases
- putting health facilities on alert
-Promoting public awareness (through the media)
-Actively seeking out cases
Treatment of diagnosed cases
-reinforcement of health care personnel’s technical expertise
-provision of necessary equipment
Removal and cremation or burial of bodies
Next , the activity best suited for the situation must be selected, following two lines of
action: one to flatten the epidemic curve- essentially through preventive measures- and the
other to reduce mortality, by curative measures.
The local health –care services do not always have the necessary resources to cope with an
epidemic. This includes material resources, the technical expertise for making the initial
diagnosis, and logistic support. International aid may prove necessary.
Epidemics are a sensitive subject for health and political authorities. The political authorities
must understand the health care personnel’s proposals before they will agree to assist the
institution of control measures. The Political authorities tend to minimise or deny the
existence of an epidemic because of the negative image that such news projects to the
outside world, or because of its repercussions on tourism. Where refugee populations are
concerned , an epidemic may serve as an argument of reinforcing coercive measures against
them.
The set of measures already being implemented, adapted to the problem and the urgency
of the epidemic, constitutes a type of surveillance system in itself, particularly in terms of
organisation.
What is SPSS?
Descriptive methods
Graphical
Numerical
Statistical inferences
Analysis of two way and multiway tables
Comparison of mean values
Regression analysis
Non parametric tests, etc
6. A syntax editor, where one may write, save and execute SPSS-programmes for more
complex computations and analysis.
Why SPSS?
Disadvantages of SPSS
Variable name
Variable label
Variable type
Numerical
Text
SPSS –Data
Variables
Cases
Excel has only one window data view window while SPSS has two windows i.e. data view
variable view window.
Excel you need to write formula where as in SPSS formulas are inbuilt
Excel the columns are identified by alphabetical letters whereas in SPSS columns are identified
by variables (var)
SPSS variable: Name, Type, Width, Decimal, Lable, Value, Missing, Column, Align,
Measure
PART V DEMOGRAPHY
5.0 DEMOGRAPHY
It deals with five “demographic processes”, namely (i) Fertility, (ii) mortality, (iii) marriage, (iv)
migration and (v) social mobility
The main sources of demographic statistics are population censuses, national sample surveys,
registration of vital events, and ad-hoc demographic studies.
Table 3.1 Typical distribution of population by age group for a district of a developing
country.
1-4 14 28 000
5-14 26 52 000
15-44 43 86 000
45+ 13 26 000
A population pyramid gives the history of population fertility. It is constructed by plotting the
various percentages of the total population in each age groups in the X - axis and the age groups
in age in years in the Y-axis.
Where fertility is high the pyramid will have a broad base and a narrow tip. This is typical of
developing countries
Where fertility is low you have a smaller base and a wider tip comprising mainly of more adults
and fewer young ones this is typical of developed countries.
5.4 Census
The census is an important source of health information. It is taken in most countries of the
world at regular intervals, usually of 10 years. A census is defined by the United Nations as “the
total process of collecting, compiling and publishing demographic, economic and social data
pertaining at a specified time or times, to all persons in a country or delimited territory”
The crude birth rate (CBR) is usually estimated from census or special
demographic surveys and is given by this formula:
The rates are usually available for each district, and by applying them to the district population
we can estimate the total number of births per year.
Example in a district of 200 000 people with a CBR of 45 births per 1000, there would be about
9000 births per year, or about 170 per week.
Total births = CBR/1000 x Population = 45/1000 x 200 000 = 9000 per year.
If the health information system reports that about 80 births per week are attended by trained
health workers, the coverage can be estimated to be about 50%(i.e. 80/170 x100 = 47%). How
well is the district doing?
Fertility is meant the actual bearing of children. Fertility depends upon a number of factors these
include: age at marriage, Duration of marriage life, Spacing of children, education, Economic
status, cast and religion, nutrition, family planning other factors e.g. Physical, social and cultural
factors etc. Fertility related statistics.
The fertility rate (FR) is an age –sex specific rate usually derived from the census or special
demographic surveys. This rate is a measure of how frequently women in the fertile age range
(15-44 years) are having babies, so where the CBR is high the FR will also be high. Developing
country population s with an average fertility might have a rate of about 100-150 births per 1000
women aged 15-44 years per year; in high fertility populations it might be around 200 per 1000.
5.5.3 Crude Death Rate
CDR = total deaths in one year/total mid-year population (all ages, same year) x 1000
The CDR commonly ranges from 10 deaths per 1000 people per year in more developed areas to
more than 20 deaths per 1000 in poor populations.
The infant mortality rate (IMR)- which is the proportion of live born infants who die in the first
twelve months of life- is commonly considered a good measure of health status. It is usually
calculated from the census or special demographic surveys. There are many technical problems
in calculating accurate IMRs and health workers should not rely on the accuracy of their
estimates unless there is a very good vital registration system in operation. The following formula
is commonly used:
IMR=total infant (aged <1year) deaths during one year/total births in same year x 1000
Most of the infant deaths occur during the first month of life; these deaths are called neonatal
deaths. The total number of expected infant deaths can be calculated as follows:
In a district with a population of 200 000 , 9000 births per year and an IMR of 100, the estimated
number of infants deaths would be :
The child mortality rate (CMR) is based on the deaths between 1 and 4 years of age and is
important because malnutrition and infectious diseases are common in this age group. It is usually
calculated from a census or special surveys since it is not easily calculated with sufficient
accuracy from district health information.
A neglected death rate is the Maternal mortality rate(MMR), partly because it is difficult to
calculate accurately. An approximate rate for many developing countries is 1-5 maternal deaths
per 1000 births per year, which means that a district with a population of 200 000 and a CBR of
40 per 1000 might expect between 8 and 40 maternal deaths per year. In this case it is more
important to know the true numbers than the rate, since the actual numbers are so small. The use
of births as a denominator, instead of the number of women of child- bearing age, may give the
impression that the problem of maternal deaths in developing countries is less serious than it is in
reality. For example, even the fact that MMR may be 5 per 1000in Africa compared to 5 per 100
000 in Europe does not adequately reflect the much greater risk of mothers dying from pregnancy
related causes in Africa. This is because the average number of births per women is also much
higher in Africa and therefore the risk of a particular woman dying of pregnancy complications is
today about 400 times greater in many developing countries than in developed areas.
MMR = maternal pregnancy-related deaths in one year/ total births in same year x factor
FOR A DISTRICT OF 200 000 PEOPLE THE FOLLOWING RATES AND TOTALS
MIGHT BE EXPECTED:
Mortality indicators:
(i) crude death rate: defined as the number of deaths per 1000 population per year in
given community
(ii) Expectation of Life: Life expectancy at birth is the average number of years that
will be lived by those born alive into a population if the current age specific
mortality rates persist.
(iii) Infant Mortality rate: The ratio of deaths under one year of age in a given year to
the total number of live births in the same year; usually expressed as a rate per
1000live births. It is one of the most universally accepted indicators of health
status not only of infants but also of whole population and of the social economic
conditions under which they live.
(iv) Child mortality rate: another indicator related to the overall health status is the
early childhood (1-4years) in a given year , per 1000children in that age group at
the midpoint of the year concerned. It thus excludes infant mortality.
(v) Under-5 mortality rate: It is the proportion of total deaths occurring in the under-
5 age group. This rate can be used to reflect both infant and child mortality rates.
(vi) Maternal mortality rate: Maternal (puerperal) mortality accounts for the greatest
proportion of deaths among women of reproductive age in most of the
developing world although its importance is not always evident from official
statistics.
(vii) Disease specific mortality: mortality rates can be computed for specific diseases.
(viii) Proportional mortality rate: the simplest measure s of estimating the burden of
disease in the community is proportional mortality rate. ,i.e. , the proportion of all
deaths currently attributed to it.
5.5.7 Migration
Urbanization movement of people from rural areas to urban areas usually in search for better life
and employment.
Age pyramids
Such a representation is called an age Pyramid. A vivid contrast may be seen in the age
distribution of men and women in India and in UK.
The age pyramid of India is typical of under-developed countries, with a broad base and a
tapering top. In the developed countries, as in UK , the pyramid generally shows a bulge in the
middle , and has a narrower base.
POPULATION GROWTH
The population growth in a district depends on the balance between the number of births and
people migrating into the district on the one hand and the number of deaths and people
migrating out on the other hand. Occasionally, a district’s population may actually be
declining, but this is usually due to migration away from the area, and not because of deaths
outnumber births.
The rate of natural increase, which excludes migration, is commonly between 1% and 3% per
year in many developing countries and is calculated as follows:
This rate largely determines how fast the district population will grow as shown in table 3.2
Figures calculated to the nearest 100 people and percentage to the nearest whole number
Example : if the population in an area was estimated to be 7830 on 31 March 1985 and 8450 on
30 September 1989, then the average increase per year is estimated to be (8450-7830) divided
by 4.5 =138 people. The estimated population on 30 September 1990 is therefore 8450 +
138=8588.
This method assumes that the increase in number of people per year is constant. However when
used for projections over a longer period of time, this method tends progressively to
underestimate the total population, as population tend to grow at a constant rate of growth rather
than by a constant absolute increase per year (as illustrated in table 3.2). With a 3% natural
growth rate the population will double in 25 Years.
A knowledge of the number of people living in the district, with additional information on their
age , sex and geographical distribution, is necessary for several aspects of planning and
evaluation of health services. DHMTs will need Population estimates for the district to provide:
Total population by age and sex groups and other relevant criteria
Total number of expected live births and deaths per year.
1. Reports on census
2. Reports from other studies
3. Other sources authorities concerned with the provision of other services in the area e.g.
education housing, law enforcement and public utilities.
To ask local leaders how many people they are responsible for and to add all the
responses to obtain a total. However, beware of being misled for various reasons such
as fear of taxation.
Ask for the total number of households, or count them, and multiply by the known
average household size for your district (Vaughan an Morrow 1989:27-28).
Activity
Demography
Census
Write short notes on
a) Population growth
b) Crude birth rate
c) Crude death rate