Professional Documents
Culture Documents
LN Biostat HEW
LN Biostat HEW
Biostatistics
For Health Extension Workers
Getu Degu
University of Gondar
In collaboration with the Ethiopia Public Health Training Initiative, The Carter Center,
the Ethiopia Ministry of Health, and the Ethiopia Ministry of Education
November 2004
Funded under USAID Cooperative Agreement No. 663-A-00-00-0358-00.
Produced in collaboration with the Ethiopia Public Health Training Initiative, The Carter
Center, the Ethiopia Ministry of Health, and the Ethiopia Ministry of Education.
All rights reserved. Except as expressly provided above, no part of this publication may
be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage and retrieval system,
without written permission of the author or authors.
This material is intended for educational use only by practicing health care workers or
students and faculty in a health care field.
Acknowledgments
Recognizing the importance of and the need for the preparation of the
lecture note for the Training of Health Extension workers THE
CARTER CENTER (TCC) ETHIOPIA PUBLIC HEALTH TRAINING
INITIATIVE (EPHTI) facilitated the task for Gondar University to write
the lecture note in consultation with the Health Extension
Coordinating Office of the Federal Ministry of Health.
i
Table of Contents
Acknowledgements .........................................................................i
Table of contents ........................................................................... ii
Introduction ....................................................................................1
ii
UNIT FOUR: Demographic Methods .......................................53
References ..................................................................................82
iii
Biostatistics
Introduction
While preparing this lecture note, care was taken not to go beyond
the level of the health extension students. However, by taking
account of the shortage of reference material of this sort, a few topics
which may seem a little bit difficult for the students were added to
assist their biostatistics instructors. To make things easy and to
facilitate the teaching/learning process, a detailed explanation is
given throughout the lecture note.
The first unit deals with the basic concepts and definitions of statistics
in general and health service statistics in particular. Units two and
three cover the methods of data collection and presentation. Unit four
focuses on demographic methods with greater emphasis given to
censuses and vital statistics. A relatively comprehensive description
of measures of central tendency and variation is given in unit five.
1
Biostatistics
2
Biostatistics
UNIT ONE
Introduction to Statistics
1.2. Introduction
This section deals with the basic concepts and definitions of statistics
and related terms.
Definition: The term statistics is used to mean either statistical data or
statistical methods.
3
Biostatistics
5
Biostatistics
The facts given in part ‘b’ are definite and more convincing.
6
Biostatistics
c) etc.
7
Biostatistics
8
Biostatistics
10
Biostatistics
11
Biostatistics
Exercises
12
Biostatistics
UNIT TWO
Collection of Statistical Data
Among the various nominal data, the simplest types consist of only two
possible categories. These types of data are called dichotomous. That
is, either the patient lives or the patient dies, either he/she has some
particular attributes or he/she does not. An attribute is a characteristic
14
Biostatistics
Dead 7 17
Alive 38 29
Total 45 46
The above table presents data from a clinical trial of the drug
propranolol in the treatment of myocardial infarction. There were two
group of patients with MI. One group received propranolol; the other did
not and was the control. For each patient the response was
dichotomous; either he/she survived the first 28 days after hospital
15
Biostatistics
With nominal scale data the most useful descriptive summary measure
is the proportion or percentage of subjects who show the attribute.
Thus, we can see from the above table that 84 percent of the patients
treated with propranolol survived, in contrast with only 63% of the
control group.
16
Biostatistics
Ratio Data:- The data values in ratio data do have meaningful ratios,
for example, age is a ratio data, someone who is 40 is twice as old as
someone who is 20.
Both interval and ratio data involve measurement. Most data analysis
techniques that apply to ratio data also apply to interval data. Therefore,
in most practical aspects, these types of data (interval and ratio) are
grouped under metric data. In some other instances, these type of data
are also known as numerical discrete and numerical continuous.
17
Biostatistics
a) Sources of data
The statistical data may be classified under two categories,
depending upon the sources.
1) Primary data and Secondary data
Primary Data: are those data, which are collected by the investigator
himself/herself for the purpose of a specific inquiry or study. Such
data are original in character and are mostly generated by surveys
conducted by individuals or research institutions.
18
Biostatistics
On the other hand, such data must be used with great care, because
such data may also be full of errors due to the fact that the purpose
of the collection of the data by the primary agency may have been
different from the purpose of the user of these secondary data.
Moreover, there may have been bias introduced, the size of the
sample may have been inadequate, or there may have been
arithmetic or definition errors, hence, it is necessary to critically
investigate the validity of the secondary data.
The most important thing that should be noted is that the work of
collecting facts should be undertaken in a planned manner.
19
Biostatistics
20
Biostatistics
21
Biostatistics
22
Biostatistics
the questionnaire, ask them questions and record their replies. This
can be done using telephone or face-to-face interviews.
Questions may take two general forms: they may be “open ended”
questions, which the subject answers in his/her own words, or
“closed” questions, which are answered by choosing from a number
of fixed alternative responses.
b) self-administered questionnaire: The respondent reads the
questions and fills in the answers by himself/herself
(sometimes in the presence of an interviewer who “stands by”
to give assistance if necessary).
23
Biostatistics
24
Biostatistics
Exercises
1) Identify the type of data (nominal, ordinal, interval and ratio)
represented by each of the following. Confirm your answers by giving
your own examples.
a) Blood group
b) Temperature (Celsius)
c) Ethnic group
d) Job satisfaction index (1-5)
e) Number of heart attacks
f) Serum uric acid (mg/100ml)
g) Number of accidents in 3 - year period
h) Number of cases of each reportable disease reported by a health
worker
25
Biostatistics
26
Biostatistics
UNIT THREE
Methods Of Data Presentation
3.2. Introduction
The data collected in a survey is called raw data. In most cases,
useful information is not immediately evident from the mass of these
raw data. Collected data need to be organized in such a way as to
condense the information they contain in a way that will show
patterns of variation clearly. Therefore, for the raw data to be more
easily appreciated and to draw quick comparisons, it is often useful to
present the data in the form of:
Parts of a table
a) Title : it explains
• what the data are about
• from where the data are collected
• time period when the data are collected
• how the data are classified
28
Biostatistics
b) Captions
The headings of the columns are given in captions. In case there
is a sub-division of any column, there would be sub-caption
headings too.
N.B. The title, body and captions/stubs are present in all tables while
the presence of the other parts depends upon the type of data and
specific purpose.
29
Biostatistics
31
Biostatistics
Example: The following data (age records in years) were taken from
the population of a rural village X who was called for a meeting on
one of the health days. Construct a grouped frequency distribution of
the given data.
33
Biostatistics
29 30 35 37 40 47 50 65 78 87
71 75 64 51 47 41 38 35 37 26
63 60 52 53 48 49 42 43 31 39
44 33 36 38 27 25 36 32 36 35
54 55 56 69 57 47 47 45 46 31
47 48 59 58 21 23 22 32 31 30
48 49 32 33 34 39 33 34 23 22
40 41 49 48 42 43 47 46 43 44
Total 80 100.0
34
Biostatistics
Note that the unit of measure is 1. You can find this value by taking
the difference of two consecutive numbers (i.e., consider any two
numbers with the smallest difference, like 42-41, 22-21, etc.).
In the case of large size quantitative variables like weight, height, etc.
measurements, the groups are formed by amalgamating continuous
values into classes of intervals. There are, however, variables which
have frequently used standard classes. One of such variables, which
have wider applications in demographic surveys, is age. The age
distribution of a population is described based on the following
intervals:
<1 20-24 45-49
1-4 25-29 50-54
5-9 30-34 55-59
10-14 35-39 60-64
15-19 40-44 65+
Based on the purpose for which the table is designed and the
complexity of the relationship, a table could be either of simple
frequency table or cross tabulation.
35
Biostatistics
Species distribution
Year Total mixed
positives P.f. P.v. P.m. (P.f.+P.v.)
1973 21969 15214 6519 23 213
36
Biostatistics
Construction of tables
Although there are no hard and fast rules to follow, the following
general principles should be addressed in constructing tables.
• totals should be shown either in the top row and the first
column or in the last row and last column.
3. If data are not original, their source should be given in a foot note.
4. Overall, tables should be clearly labelled and the reader should be
able to determine without difficulty precisely what is tabulated.
Figures are not always interesting, and as their size and number
increase they become confusing and uninteresting to such an extent
that no one (unless he/she is specifically interested) would care to
study them.
38
Biostatistics
The choice of the particular graph among the different possibilities will
depend largely on the type of the data.
• Bar chart and pie chart are commonly used for qualitative or
quantitative discrete data.
• Histograms are used for quantitative continuous data.
There are, however, general rules that are commonly accepted about
construction of graphs.
39
Biostatistics
1. Bar graph
Bar diagrams are used to represent and compare the frequency
distribution of discrete variables and attributes of categorical series.
40
Biostatistics
When we represent data using bar diagram, all the bars must have
equal width and the distance between bars must be equal.
41
Biostatistics
Example 1
The most common causes of morbidity as reported by health centers in the Amhara region,
1997
70000
61974
Number of sick persons who visited the health centers
60000
50000 48750
42433 42389
40000
32185
30000 28625
25799
21455
20000
15095
11452
10000
0
1 2 3 4 5 6 7 8 9 10
Type of disease
42
Biostatistics
Key
N.B. You can also write the type of disease in place of the number if
enough space is available.
43
Biostatistics
25000
21969
20000
10859
10000
5343
4988
5000 4318
498
0
1973 1974 1975 1976 1977 1978
Year
44
Biostatistics
b) Multiple bar chart: In this type of chart the component figures are
shown as separate bars adjoining each other. The height of each bar
represents the actual value of the component figure. It depicts
distributional pattern of more than one variable
Example
14000
Nu m b er o f p erso n s w ith m alaria p arasites
12000
P.f. Malaria
others
8000
6000
4000
2000
0
1973 1974 1975 1976 1977 1978
Year
45
Biostatistics
The line graph is especially useful for the study of some variables
according to the passage of time. The time, in weeks, months or years
is marked along the horizontal axis; and the value of the quantity that is
being studied is marked on the vertical axis. The distance of each
plotted point above the base-line indicates its numerical value and these
points are joined by a line. The line graph is suitable for depicting a
consecutive trend of a series over a long period.
5.5
5.0
4.5
4.0
3.5
Rate (% )
3.0
2.5
2.0
1.5
1.0
0.5
0.0
1967 1969 1971 1973 1975 1977 1979
Year
46
Biostatistics
681, 1%
P.f. malaria
15633,
P.v. malaria
33%
others
31661,
66%
47
Biostatistics
48
Biostatistics
20 30 30 27
31 33 55 29
32 38 33 29
49 46 35 59
49 42 23 75
84 29 35 58
49 35 40 64
21 25 24 70
22 35 40 67
47 33 29 35
20 21 22 23 24 25 27 29 29 29
30 30 31 32 33 33 33 35 35 35
35 35 38 40 40 42 42 46 47 49
49 49 55 58 59 64 67 70 75 84
49
Biostatistics
10
2
Std. Dev = 15.70
Mean = 40.5
0 N = 40.00
20.0 30.0 40.0 50.0 60.0 70.0 80.0
25.0 35.0 45.0 55.0 65.0 75.0 85.0
With 40 scores, the median is the mean of the 20th and 21st scores.
50
Biostatistics
Exercises
51
Biostatistics
A 10,000 200
B 6,000 180
C 8,000 224
D 2,000 100
E 5,000 175
numberofdeaths
Death rate = X 1000 (this calculation is done for
populationsize
each village).
5. Visit the nearby health facility and show the top ten diseases of the
area graphically.
52
Biostatistics
UNIT FOUR
Demographic Methods
4.2. Introduction
Demography is a science that studies human population with respect to
size, distribution, composition, social mobility and its variation with
respect to all the above features and the causes of such variation and
the effect of all these on health, social, ethical, and economic
conditions.
54
Biostatistics
a) The Census
In modern usage, the term “census” refers to a nation-wide counting of
population. It is obtained by interviewing each person or household. A
census is a large and complicated undertaking. There are two main
different schemes for enumerating a population in a census.
De jure: The enumeration ( or count) is done according to the usual or
legal place of residence.
De facto: The enumeration is done according to the actual place of
residence on the day of the census.
55
Biostatistics
de facto:
advantage:
• offers less chance of double counting
disadvantage:
• Population figures may be inflated or deflated by tourists, travelling
salesmen, and other transients.
Information to be collected
Sex, age, marital status, educational status, economic characteristics,
place of birth, language, fertility, mortality , citizenship ( nationality),
living conditions (e.g. house-ownership, type of housing and the like),
religion, etc..
57
Biostatistics
Census Operation
The entire census operation has 3 parts (stages)
Uses of a census
gives complete and valid picture of the population composition
and characteristics
serves as a sampling frame ( a sampling frame is the list of
individuals, households, kebeles, etc)
provides with vital statistics of the population in terms of fertility
and mortality.
Census data are utilized in a number of ways for planning the
welfare of the people
Eg. To ascertain food requirements, to plan social welfare schemes like
schools, hospitals, houses, orphanages, pensions, etc.
b) Surveys
A survey is a technique based on sampling methods by means of
which we get specific information from part of the population which is
considered to be representative of the whole. Surveys are made at a
given moment, in a specific territory; sporadically and without periodicity
for the deep study of a problem.
58
Biostatistics
Exercises
1. What is demography ?
2. What are the sources of demographic data?
59
Biostatistics
60
Biostatistics
UNIT FIVE
Measures Of Central Tendency And Variation
5.2. Introduction
61
Biostatistics
i) ∑xi = 2+5+1+4+10-5+8 = 25
ii) (∑xI)2 = (25)2 = 625
iii) ∑xI2 = 4 + 25 + 1 + 16 + 100 + 25 + 64 = 231
1) ∑(xi +yi) = ∑xi + ∑yi , where the number of x values = the number
of y values.
2) ∑Kxi = k×∑xi , where K is a constant.
3) ∑K = n×K, where K is a constant.
62
Biostatistics
denoted by X .
63
Biostatistics
3.0 1.9 4.0 3.6 2.9 3.5 3.0 2.8 3.2 2.9 2.2 3.0
70 22 66 69 and 73
mean = (70+22+66+69+73) /5 = 60
64
Biostatistics
b) Disadvantages
It may be greatly affected by extreme items and its
usefulness as a “summary of the whole” may be
considerably reduced.
When the distribution has open-end classes, its
computation would be based on assumption, and therefore
may not be valid.
2. Median
65
Biostatistics
1.9 2.2 2.8 2.9 2.9 3.0 3.0 3.0 3.2 3.5 3.6 4.0
66
Biostatistics
22 66 69 70 73
Hence, the median is 69. In this case, the median is a better measure
of central tendency.
a) Advantages
It is easily calculated and is not much disturbed by extreme
values.
The median may be located even when the data are incomplete,
e.g, when the class intervals are irregular and the final classes
have open ends.
67
Biostatistics
b) Disadvantages
The median is not so well suited to algebraic treatment as the
arithmetic and geometric means.
It is not so generally familiar as the arithmetic mean.
3. Mode
Find the modal values for the following data ( part ‘a’ is given in
years while ‘b’ is given in kg)
b) 3.0, 1.9, 4.0, 3.6, 2.9, 3.5, 3.0, 2.8, 3.2, 2.9, 2.2, 3.0 ( modal
value = 3.0 kg)
a) Advantages
Since it is the most typical value it is the most descriptive average
Since the mode is usually an “actual value”, it indicates the
precise value of an important part of the series.
68
Biostatistics
b) Disadvantages:-
Unless the number of items is fairly large and the distribution
reveals a distinct central tendency, the mode has no significance
In a small number of items the mode may not exist.
15, 1 7, 41, 42, 43, 43, 44, 44, 45, 45, 46, 46, 46, 48, 50.
69
Biostatistics
19, 20, 20, 22, 22, 22, 23, 24, 25, 25, 26, 26, 27, 72, 77.
A few extreme large values ( i.e., 72 and 77) are scattered at the right
end and the mean will shift towards these scores. Therefore, the
mean is not an appropriate measure of central tendency. You can see
that the mean value is 30 which is greater than the values of 13 of the
observations. This shows that the mean is not a good summary
measure of the central tendency of the above distribution. We say
that this distribution is skewed to the right (i.e., positively skewed). In
this type of distribution the median is a preferable measure of central
tendency. The median (middle value) is 24.
70
Biostatistics
19, 21, 22, 22, 23, 23, 24, 24, 24, 25, 25, 26, 26, 27, 29.
71
Biostatistics
42
GM = 7 x8 x3x...x90 x1x 237
log Gm = 1/42 (log 7+log8+log3+..+log 237)
= 1/42 (.8451+.9031+.4771 +…2.3747)
= 1/42 (41.9985)
= 0.9999 ≈ 1.0000
72
Biostatistics
The anti-log of 0.9999 is 9.9992 ≈10 and this is the required geometric
mean. By contrast, the arithmetic mean, which is inflated by the high
values of 440, 237 and 200 is 39.8 ≈ 40.
a) Advantages
since it is less affected by extreme values it is a more preferable
average than the arithmetic mean.
It is based on all values given in the distribution.
b) Disadvantages
Its computation is relatively difficult.
It cannot be determined if there is any negative value in the
distribution, or where one of the items has a zero value.
73
Biostatistics
The two data sets given above have a mean of 50, but obviously set 1
is more “spread out” than set 2. How do we express this numerically?
The object of measuring this scatter or dispersion is to obtain a single
summary figure which adequately exhibits whether the distribution is
compact or spread out.
1. Range
The range is defined as the difference between the highest and smallest
observation in the data. It is the crudest measure of dispersion. The
range is a measure of absolute dispersion and as such cannot be
usefully employed for comparing the variability of two distributions
expressed in different units.
74
Biostatistics
2. Variance
The variance is a summary figure which measures the average
variability from the arithmetic mean. It is a very useful measure of
variability because it uses the information provided by every observation
in the sample and also it is very easy to handle mathematically. Its main
disadvantage is that the units of variance are the square of the units of
the original observations. Thus, if the original observations were, for
example, heights in cm then the units of variance of the heights are cm2.
The easiest way around this difficulty is to use the square root of the
variance (i.e., standard deviation) as a measure of variability.
75
Biostatistics
= 2502/14
76
Biostatistics
3. Standard Deviation
Σ ( X i − Χ) 2
n
S = i =1
= sample variance = sample standard
n −1
deviation
σ= ∑(X − μ)2
i
= population standard deviation (Note that , σ²
N
= Population variance)
77
Biostatistics
= 2502/14
S
Definition: The coefficient of variation (CV) is defined by: CV =
X
x 100%
78
Biostatistics
Example: a) Compute the CV for the birth weight data when they
are expressed in either grams or ounces.
If the data were expressed in ounces, Χ =111.71 oz, S=15.7 oz, then
S 15.7
CV = 100% × = 100% × = 14.1%
Χ 111.71
79
Biostatistics
8.67
CV (pulse rate) = x 100% = 12.6%
68.7
1.03
CV (serum uric acid level) = x 100 % = 19%
5.41
Hence, the serum uric acid levels are relatively more spread out than
pulse rates.
Exercises
1. Define arithmetic mean, median, mode and geometric mean
and explain their strengths and limitations.
2. Define range, variance and standard deviation and explain
their advantages and disadvantages.
3. When values of the three measures of central tendency
(mean, median and mode) of a certain distribution are equal,
this distribution is said to be __________ .
80
Biostatistics
81
Biostatistics
References
82