Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 101


• At the end of sessions, students will able
– Define Statistics and Biostatistics
– Define variables and identify its categories
– Identify the four scales of measurement
– Identify data collection methods and

• What is statistics?
• Statistics: A field of study concerned with:
– collection, organization, analysis,
summarization and interpretation of numerical
data, &
– the drawing of inferences about a body of data
when only a small part of the data is observed.

• Statistics helps us use numbers to

communicate ideas
 Biostatistics: The application of statistical
methods to the fields of biological and
medical sciences.
 Concerned with interpretation of biological
data & the communication of information
derived from these data
 Has central role in medical investigations

• The numbers must be presented in such a
way that valid interpretations are possible

• Statistics are everywhere – just look at any

newspaper or the current medical and
public health literature.

Uses of biostatistics
• Provide methods of organizing information
• Assessment of health status
• Health program evaluation
• Resource allocation
• Magnitude of association
– Strong vs weak association between
exposure and outcome

Uses of biostatistics
• Assessing risk factors
– Cause & effect relationship
• Evaluation of a new vaccine or drug
– What can be concluded if the proportion of
people free from the disease is greater among
the vaccinated than the unvaccinated?
– How effective is the vaccine (drug)?
– Is the effect due to chance or some bias?
• Drawing of inferences
– Information from sample to population 7
What does biostatistics cover?
Research Planning

Design The best way to

Biostatistical learn about
thinking biostatistics is to
Execution (Data collection)
contribute in follow the flow of a
every step in a research from
Data Processing
research inception to the
final publication
Data Analysis


Types of Statistics
1. Descriptive statistics:
• Ways of organizing and summarizing data
• Helps to identify the general features and
trends in a set of data and extracting
useful information
• Also very important in conveying the final
results of a study
• Example: tables, graphs, numerical
summary measures
Types of Statistics
2. Inferential statistics:
• Methods used for drawing conclusions
about a population based on the
information obtained from a sample of
observations drawn from that population
• Example: Principles of probability,
estimation, confidence interval,
comparison of two or more means or
proportions, hypothesis testing, etc.
Population and Sample
• Population:
– Refers to any collection of objects/ persons
• Target population:
– A collection of items that have something in
common for which we wish to draw conclusions at
a particular time.
• E.g., All hospitals in Ethiopia
– The whole group of interest

Population and Sample
Study (Sampled) Population:
• The subset of the target population that has at
least some chance of being sampled
• The specific population group from which
samples are drawn and data are collected

Population and Sample
. A subset of a study population, about
which information is actually obtained.
. The individuals who are actually measured
and comprise the actual data.

• Role of statistics
in using information
from a sample to make
inferences about the



E.g.: In a study of the prevalence
of HIV among adolescents in
Ethiopia, a random sample of
adolescents in Lideta Kifle
Ketema of AA were included.

Sample Target Population: All

adolescents in Ethiopia
Study Population Study population: All
adolescents in Addis Ababa
Target Population
Sample: Adolescents in Lideta
Kifle Ketema who were included
in the study

• Is a two-stage procedure:
• We need to be able to generalize from:
– the sample to the study population, &
– then from the study population to the target
• If the sample is not representative of the
population, the conclusions are restricted to
the sample & don’t have general
Draw conclusions
Collect information
about a rather
from a relatively
LARGE population
SMALL sample

Parameter and Statistic
• Parameter: A descriptive measure
computed from the data of a population.
– E.g., the mean (µ) age of the target population
• Statistic: A descriptive measure computed
from the data of a sample.
– E.g., sample mean age ( )

• To each sample statistic there corresponds a
population parameter.
• We use X̅ , S2 , S , p, etc. to estimate μ, σ2, σ,
P (or π), etc. respectively.
Sample Statistics are Estimators of Population Parameters

Sample mean ( ), µ
Sample variance, S2 2
Sample proportion ( ), P or π
Sample SD, S 
Sample Odds Ratio, OŔ OR
Sample Relative Risk, RŔ RR
Sample correlation coefficient, r ρ
02/08/24 19
• Variable: A characteristic which takes
different values in different persons, places,
or things.
• Any aspect of an individual or object that is
measured (e.g., BP) or recorded (e.g., age,
sex) and takes any value.
• There may be one variable in a study or
• E.g., A study of treatment outcome of TB
• Variables can be broadly classified
– Categorical (or Qualitative) or
– Quantitative (or numerical variables).
• Categorical variable: A variable or
characteristic which can not be measured in
quantitative form but can only be sorted by
name or categories

• Not able to be measured as we measure

height or weight

• The notion of magnitude is absent or implicit.

• Quantitative variable: A variable that can
be measured (or counted) and expressed

• Height, wt, # of children, etc.

• Has the notion of magnitude.

Quantitative variable is divided into two:
1. Discrete: It can only have a limited number of
discrete values (usually whole numbers).
– E.g., the number of episodes of diarrhoea a child has
had in a year. You can’t have 12.5 episodes of diarrhoea
• Characterized by gaps or interruptions in the
values (integers).
• Both the order and magnitude of the values matter.
• The values aren’t just labels, but are actual
measurable quantities.
2. Continuous variable: It can have an
infinite number of possible values in any
given interval.
• Both the magnitude and the order of the
values matter
• Does not possess the gaps or interruptions
• Weight is continuous since it can take on
any number of values (e.g., 34.575 Kg).


of Qualitative Quantitative
variables or categorical measurement

Nominal Ordinal Discrete Continuous

(not ordered) (ordered) (count data) (real-valued)
e.g. ethnic e.g. response e.g. # of e.g. height
group to treatment admissions

Measurement scales
Scales of measurement
• All measurements are not the same.
• Measuring weight = eg. 40kg
• Measuring the status of a patient on scale
= “improved”, “stable”, “not improved”.
• There are four types of scales of
1. Nominal scale:
• The simplest type of data, in which the values
fall into unordered categories or classes
• Consists of “naming” observations or
classifying them into various mutually
exclusive and collectively exhaustive
• Uses names, labels, or symbols to assign each
– Examples: Blood type, sex, race, marital status, etc.
Example of nominal Scale:

1. Black • The numbers have NO
2. White meaning
3. Latino • They are labels only
4. Other
• If nominal data can take on only two
possible values, they are called
dichotomous or binary.
• So sex is not just nominal, it is
dichotomous (male or female).
• Yes/no questions
– E.g., cured from TB at 6 months of Rx
2. Ordinal scale:
• Assigns each measurement to one of a
limited number of categories that are
ranked in terms of order.
• Although non-numerical, can be
considered to have a natural ordering
– Examples: Patient status, cancer stages,
social class, etc.
Example of ordinal scale:

• Pain level: • The numbers have

1. None LIMITED meaning
2. Mild 4>3>2>1 is all we
3. Moderate know apart from their
utility as labels
4. Severe
3. Interval scale:
- Measured on a continuum and differences between
any two numbers on a scale are of known size.
Example: Temp. in oF on 4 consecutive days
Days: A B C D
Temp. oF: 50 55 60 65
For these data, not only is day A with 50 o cooler than
day D with 65o, but is 15o cooler.
- It has no true zero point. “0” is arbitrarily chosen and
doesn’t reflect the absence of temp.
4. Ratio scale:
- Measurement begins at a true zero point
and the scale has equal space.
- Examples: Height, age, weight, BP, etc.
• Note on meaningfulness of “ratio”-
– Someone who weighs 80 kg is two times as
heavy as someone else who weighs 40 kg.
This is true even if weight had been measured
in other measurements.

Data collection methods
• Data collection techniques allow us to
systematically collect data about our
objects of study
– people, objects, and phenomena and
– about the setting in which they occur.

• If data are collected haphazardly, it will be

difficult to answer our research questions
in a conclusive way.
Data collection methods
Various data collection techniques can
be used such as:
• Using available information (record review)
• Observing
• Interviewing
• Administering written questionnaires
• Sending An E-mail or telegraph, fax

Data collection methods
• Focus group discussions
• Other data collection techniques,
measuring height, length, weight, BMI,
MUAC, chest circumference, head
circumference, blood pressure, Hgb, Hct,

Data collection methods
1. Using available information (record

• There is a large amount of data that has

already been collected by others.

• Locating these sources and retrieving the

information is a good starting point in any
data collection effort.
Data collection methods
2. Observation
• OBSERVATION is a technique which
involves systematically selecting,
watching and recording behaviors and
characteristics of living beings, objects or

Data collection methods

• Observation of human behaviors is a

much used data collection technique. It
can be under taken in two different ways:
1. Participant observation: the observer
takes part in the situation he or she
2.Non-participant observation: the
observer watches the situation, openly or
concealed, but does not participate.
Data collection methods
3. Interviewing
• An INTERVIEW is a data collection
technique that involves oral questioning of
respondents, either individually or as

Data collection methods
• Interviews can be conducted with varying
degrees of flexibility. The two extremes,
high and low degree of flexibility, are
described below:
1.High degree of flexibility
2.Low degree of flexibility

Data collection methods
1. High degree of flexibility
• A structured or loosely structured method
of asking questions can be used for
interviewing individuals as well as groups
of key informants.

• A flexible method of interviewing is useful

if a researcher has as yet little
understanding of the problem or
situation under investigation
Data collection methods
• The interviewer may ask additional
questions on the spot in order to gain as
much useful information as possible.

• Questions are open ended: the

respondent is unrestricted in what and
how he answers.

Data collection methods

2. Low degree of flexibility

• Less flexible methods of interviewing are
useful when the researcher is relatively
knowledgeable about expected answers or
when the number of respondents being
interviewed is relatively large.

Data collection methods
• Example: Interviews using a
questionnaire with a fixed list of questions
in a standard sequence, which have
mainly fixed or pre-categorized answers.

Data collection methods
4. Self administered written
QUESTIONNAIRE: is data collection
tools in which written questions are
presented that are to be answered by the
respondents in written form.

• Before examining the steps in designing a
questionnaire, we need to review the types
of questions used in questionnaires.
• Depending on how questions are asked
and recorded we can distinguish two major
possibilities - Open –ended questions, and
closed questions.


1. Open-ended questions
• Open-ended questions permit free
responses that should be recorded in the
respondent’s own words.
• The respondent is not given any possible
answers to choose from.

Open-ended questions
• Such questions are useful to obtain
information on:
 Facts with which the researcher is not very
 Opinions, attitudes, and suggestions of
informants, or
 Sensitive issues.

Open-ended questions
For example
• Can you describe exactly what the
traditional birth attendant did when your
labour started?”
• What do you think are the reasons for a
high drop-out rate of village health
committee members?”

Closed Questions
• Closed questions offer a list of possible
options or answers from which the
respondents must choose.
• When designing closed questions one
should try to:
 Offer a list of options that are exhaustive
and mutually exclusive
 Keep the number of options as few as
Closed Questions
• Closed questions are useful if the range of
possible responses is known.

For example
“What is your marital status?
1. Single
2. Married/living together
3. Separated/divorced/widowed

Closed Questions
“Have your every gone to the local village
health worker for treatment?
1. Yes
2. No

Data collection methods
5. Focus Group Discussion
• Used to collect information from a group through
guided discussions of the study topic
• Eight to ten individuals with similar background are
brought together to discuss their problems
• One modulator and one time keepers are needed
to facilitate and recorded the discussion
• The discussion will stop when idea saturation

Data collection methods
Problems in gathering data
• Language barriers
• Lack of adequate time
• Expense
• Inadequately trained and experienced staff
• Invasion of privacy

Selecting data collection method
depends on…

• The nature of the investigation whether the

study is qualitative or quantitative
• The resources available and its Relevance of
the information
• Acceptability and Accuracy of the method
• The research interest to focus on and cover on
• Familiarization of the procedure
• The characteristics of the study population are
under the influencing factors.
Methods of Data Organization
and Presentation
Frequency Distributions (Tables)
• Ordered array: A simple arrangement of individual observations in the
order of magnitude.
• Very difficult with large sample size

12 19 27 36 42 59
15 22 31 39 43 61
17 23 31 41 44 65
18 26 34 41 54 67
• The actual summarization and organization
of data starts from frequency distribution.

• Frequency distribution: A table which

has a list of each of the possible values
that the data can assume along with the
number of times each value occurs.
• For nominal and ordinal data, frequency
distributions are often used as a summary.
• Example:

• The % of times that each value occurs, or

the relative frequency, is often listed
• Tables make it easier to see how the data
are distributed
• For both discrete and continuous data,
the values are grouped into non-
overlapping intervals, usually of equal
a) Qualitative variable: Count the number of
cases in each category.

- Example1: The intensive care unit type of 25

patients entering ICU at a given hospital:
1. Medical
2. Surgical
3. Cardiac
4. Other
Frequency Relative Frequency
ICU Type (How often) (Proportionately often)
Medical 12 0.48
Surgical 6 0.24
Cardiac 5 0.20
Other 2 0.08
Total 25 1.00
Example 2:
A study was conducted to assess the
characteristics of a group of 234 smokers by
collecting data on gender and other variables.
Gender, 1 = male, 2 = female

Gender Frequency (n) Relative Frequency

Male (1) 110 47.0%
Female (2) 124 53.0%
Total 234 100%
b) Quantitative variable:
- Select a set of continuous, non-overlapping
intervals such that each value can be placed
in one, and only one, of the intervals.
- The first consideration is how many intervals
to include
For a continuous variable
(e.g. – age), the frequency
distribution of the individual
ages is not so interesting.
• We “see more” in frequencies
of age values in “groupings”.
Here, 10 year groupings make
• Grouped data frequency
To determine the number of class intervals and the
corresponding width, we may use:

Sturge’s rule:
K  1  3.322(logn)
K = number of class intervals n = no. of observations
W = width of the class interval L = the largest value
S = the smallest value
– Leisure time (hours) per week for 40 college
23 24 18 14 20 36 24 26 23 21 16 15 19 20
22 14 13 10 19 27 29 22 38 28 34 32 23 19
21 31 16 28 19 18 12 27 15 21 25 16
K = 1 + 3.22 (log40) = 6.32 ≈ 6
Maximum value = 38, Minimum value = 10
Width = (38-10)/6 = 4.66 ≈ 5
Time Relative Cumulative
(Hours) Frequency Frequency Relative
10-14 5 0.125 0.125
15-19 11 0.275 0.400
20-24 12 0.300 0.700
25-29 7 0.175 0.875
30-34 3 0.075 0.950
35-39 2 0.050 1.00
Total 40 1.00
• Cumulative frequencies: When frequencies
of two or more classes are added.

• Cumulative relative frequency: The

percentage of the total number of
observations that have a value either in that
interval or below it.

• Mid-point: The value of the interval which lies

midway between the lower and the upper
limits of a class.
• True limits: Are those limits that make an
interval of a continuous variable continuous
in both directions

• Used for smoothening of the class intervals

• Subtract 0.5 from the lower and add it to the

upper limit
(Hours) True limit Mid-point Frequency
10-14 9.5 – 14.5 12 5
15-19 14.5 – 19.5 17 11
20-24 19.5 – 24.5 22 12
25-29 24.5 – 29.5 27 7
30-34 29.5 – 34.5 32 3
35-39 34.5 - 39.5 37 2
Total 40
Simple Frequency Distribution
• Primary and secondary cases of syphilis
morbidity by age, 1989
Age group Cases
(years) Number Percent

0-14 230 0.5

15-19 4378 10.0
20-24 10405 23.6
25-29 9610 21.8
30-34 8648 19.6
35-44 6901 15.7
45-54 2631 6.0
>44 1278 2.9
Total 44081 100
Two Variable Table
• Primary and secondary cases of syphilis
morbidity by age and sex, 1989
Age group Number of cases
(years) Male Female Total

0-14 40 190 230

15-19 1710 2668 4378
20-24 5120 5285 10405
25-29 5301 4306 9610
30-34 5537 3111 8648
35-44 5004 1897 6901
45-54 2144 487 2631
>44 1147 131 1278
Total 26006 18075 44081
Tables can also be used to present more than
three or more variables.
Variable Frequency (n) Percent
Age (yrs)
Guidelines for constructing tables
• Keep them simple,
• Limit the number of variables to three or less,
• All tables should be self-explanatory,
• Include clear title telling what, when and where,
• Clearly label the rows and columns,
• State clearly the unit of measurement used,
• Explain codes and abbreviations in the foot-note,
• Show totals,
• If data is not original, indicate the source in foot-note.
Diagrammatic Representation

• Pictorial representations of numerical data

Importance of diagrammatic representation:

1. Diagrams have greater attraction than

mere figures.
2. They give quick overall impression of the

3. They have great memorizing value than
mere figures.
4. They facilitate comparison
5. Used to understand patterns and trends
• Well designed graphs can be powerful
means of communicating a great deal of

• When graphs are poorly designed, they not

only ineffectively convey message, but they
are often misleading.
Specific types of graphs include:
• Bar graph Nominal, ordinal
• Pie chart data

• Histogram
• Stem-and-leaf plot
• Box plot Quantitative
• Scatter plot data
• Line graph
• Others
1. Bar charts (or graphs)
• Categories are listed on the horizontal axis
• Frequencies or relative frequencies are
represented on the Y-axis (ordinate)
• The height of each bar is proportional to
the frequency or relative frequency of
observations in that category
Bar chart for the type of ICU for 25 patients
Method of constructing bar chart
• All the bars must have equal width
• The bars are not joined together (leave
space between bars)
• The different bars should be separated
by equal distances
• All the bars should rest on the same line
called the base
• Label both axes clearly
2. Sub-divided bar chart
• If there are different quantities forming
the sub-divisions of the totals, simple
bars may be sub-divided in the ratio of
the various sub-divisions to exhibit the
relationship of the parts to the whole.
• The order in which the components are
shown in a “bar” is followed in all bars
used in the diagram.
– Example: Stacked and 100% Component
bar charts
Example: Plasmodium species distribution for
confirmed malaria cases, Zeway, 2003

100 Mixed
P. vivax
80 P. falciparum




August October December
3. Multiple bar graph
• Bar charts can be used to represent the
relationships among more than two
• The following figure shows the
relationship between children’s reports
of breathlessness and cigarette
smoking by themselves and their
Prevalence of self reported breathlessness among school
childeren, 1998

Breathlessness, per cent

Neither One Both
Parents smooking

Child never smoked smoked occassionaly child smoked one/week or more

We can see from the graph quickly that the prevalence of the symptoms
increases both with the child’s smoking and with that of their parents.
There’s no reason why the bar chart can’t be
plotted horizontally instead of vertically.

Type of source


Training femal



0 10 20 30 40 50

Figure 1. Source of information on the complications of FGM and participation in RH

programs, Jijiga, 2004*. * FGMC = female genital mutilation committee; CAT= community
action team; HC = health centre; CHA= community health agent
4. Pie chart
• Shows the relative frequency for each
category by dividing a circle into sectors, the
angles of which are proportional to the
relative frequency.
• Used for a single categorical variable
• Use percentage distributions
Steps to construct a pie-chart
• Construct a frequency table

• Change the frequency into percentage (P)

• Change the percentages into degrees,

where: degree = Percentage X 360o

• Draw a circle and divide it accordingly

Example: Distribution of deaths for females, in England
and Wales, 1989.

Cause of death No. of death

Circulatory system 100 000
Neoplasm 70 000
Respiratory system 30 000
Injury and poisoning 6 000
Digestive system 10 000
Others 20 000
Total 236 000
Distribution fo cause of death for females, in England and Wales, 1989

Digestive System
Injury and Poisoning

Circulatory system
Respiratory system

5. Histogram
• Histograms are frequency distributions with
continuous class intervals that have been
turned into graphs.
• To construct a histogram, we draw the interval
boundaries on a horizontal line and the
frequencies on a vertical line.
• Non-overlapping intervals that cover all of the
data values must be used.
• Bars are drawn over the intervals in such a
way that the areas of the bars are all
proportional in the same way to their
interval frequencies.

• The area of each bar is proportional to the

frequency of observations in the interval
Example: Distribution of the age of women at the time of marriage

Age 15-19 20-24 25-29 30-34 35-39 40-44 45-49

Number 11 36 28 13 7 3 2
Age of women at the time of marriage




No of women




14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group
Histogram for the ages of 2087 mothers with <5
children, Adami Tulu, 2003







100 Std. Dev = 6.13

Mean = 27.6
0 N = 2087.00
15.0 20.0 25.0 30.0 35.0 40.0 45.0 50.0 55.0

Two problems with histograms
1. They are somewhat difficult to construct
2. The actual values within the respective
groups are lost and difficult to reconstruct

 The other graphic display (stem-and-

leaf plot) overcomes these problems
Other Methods of Data Organization
and Presentation
• Stem-and-Leaf Plot
• Frequency polygon
• Percentiles (Quartiles)
• Box and Whisker Plot


You might also like