Statistical analysis

H E A LT H ( E P I D E M I O L O G Y )
Introduction: Why Biostatistics
• Knowledge of biostatistics helps you to critically
appraise a scientific publication, know which test to
run for your research.
• .To do this you should know whether the right test
has been used and how to interpret the resulting
Learning outcome
• Understand the meaning of biostatistics

• Define variables and the types/categories

• Understand visual and numerical summaries of data

• Understand which statistical methods are suitable to a given

type of data
What is biostatistics
• Statistics is the science that deals with collecting, organizing,
summarizing, analyzing and interpreting data.

• When the data being analyzed is derived from the biological

sciences ,and medicine, we use the term Bio-statistics.

• Biostatistics therefore is the branch of statistics applied to

biological or medical and health sciences.
Descriptive Statistics
Statistics which describe a given set of data are generally referred to as
Descriptive Statistics. These include Percentages, Mean, Median,
Mode and Standard deviation

Descriptive Statistics are the first tools used to explore the data, getting
some important indications of what the data set “looks like” [ measures
of central tendency, measures of dispersion, summation and
presentation of data]
Inferential statistics
• Inferential statistics are used to make inference or draw
conclusions from our data. It allows us to infer findings
from a smaller group (sample) to a larger group

• Statistical software packages : Excel, SPSS, EPI INFO,

Numerical Categorical
(quantitative (qualitative
• Variables – is a general term for any feature /characteristics of the unit or subject,
which is observed or measured, may vary from object to object
• Numerical variable/ Data include counts such as number of children of a specific
age and measurements such as height, weight. It is divided into
• (1) Continuous variable(also called quantitative or measurement variABLE)/ data is
measured and can take any value (within a range) its infinite eg height, time, weight,
SBP,, CD4 cell count, age,
• (2) Discrete variable/ data is counted, and can only take certain values, its finite,
usually in whole numbers / integer eg no of students in a class, you can’t have half a
student; number of episodes of diarrhea a child has had for one year.
Numerical Variables
Discrete Continuous
Categorical variables/ Data
• Categorical variables/ Data are the result of classifying. For instance, individuals can
• be classified into categories according to their blood group; A, B, O, or AB. ; nominal and
ordinal variables are types of categorical variable.

• Nominal variables have distinct levels that have no inherent ordering eg hair colour and
sex,tribe, occupation,race, religion, lga. State of origin, country etc
• Ordinal variables have levels that do follow a distinct ordering eg position in the office,
year of study etcOrdinal variable could be either dichotomous ( binary ) variables ; here
there are only two categories in which the two observations can fall in eg. Sex ( male or
female), death or alive

• Or polychotomous variables is the variables where there are more than two categories in
which the observations can follow eg ethnic groups , occupation, religion
Categorical variables
Nominal Dichotomous
Classification of Variables
Independent Dependent [outcome

[exposure] variable variables

Confounding Variables
Independent and dependent variables
• Independent (exposure) variable – it is a factor that is measured or manipulated
by the researcher to determine its relationship to an observed phenomenon (ie the
dependent variable). It causes a change in dependent (outcome) variable eg age,
sex, occupation, marital status, educational level, SES etc.

• Dependent ( outcome) variable – is the factor which is observed and measured to

determine the effect of the independent variable.

• Eg use of bednet (exposure} / no of malaria episodes { outcome/ dependent}

• Rate of Smoking ( exposure) / lung cancer ( outcome/dependent); use of

contraceptive {independent/ exposure}/ occurrence of STD; Age [ independent] /
heart disease[ dependent/outcome]
Confounding variables
• it is an extraneous variable that is independently associated with the disease and the
risk factors. Confounding occurs when the effects of two exposures (risk factors) have
not been separated and the analysis concludes that the effect is due to one variable
rather than the other. It therefore has influence on the relationship between
independent and dependent variable.

• For a variable to be a confounder, it must, in its own right, be a determinant of the

occurrence of disease (i.e.a risk factor) and associated with the exposure under
investigation. Eg relationship between coffee drinking (exposure) and heart disease
(outcome) and a third variable tobacco use. Sex and exercise, age confounding
factor, smoking and lung cancer, confounding alcohol.
• Percentages are mainly used in the tabulation of data in order to
give the reader a scale on which to assess or compare the data.

• To calculate a percentage, divide the number of items or patients in

the category by the total number in the group and multiply by 100.

• The Mean is used when the spread of the data is

fairly similar on each side of the mid-point, that is,
when the data are “normally distributed”.

• The mean, otherwise known as the arithmetic mean

or average, is the sum of all the values, divided by
the number of values.

• The Median is used to represent the average when the data

are not symmetrical or normally distributed, that is, when the
data are “skewed”.

• The Median is the point which has half the values above,
and half below.

• Consider a set of six women patients aged 52, 55, 56, 58, 59 and 92
years, there are two “middle” ages, 56 and 58. The median is halfway
between these, i.e. 57 years.

• This gives a better idea of the mid-point of this skewed data than the
mean of 62.

• The median value of 57 years indicates that half the women are older
than 57 years, while half the women are younger than 57 years.
Standard Deviation
• Standard deviation (SD) is used for data which are “normally
distributed” to provide information on how much the data vary or
cluster around their mean.

• SD indicates how much a set of values is spread around the average.

• A range of one (1) SD above and below the mean (abbreviated to ± 1

SD) includes 68.2% of the values, ±2 SD includes 95.4% of the data
and ±3 SD includes 99.7%.
Summarization and presentation of
• Histogram

• Bar chart

• Pie chart

• Line graphs

Chart Title




[1, 5] (5, 9] (9, 13] (13, 17] (17, 21] (21, 25]
• A histogram are the appropriate graphical display for ordinal

• It is a graphical display of data using vertical bars or columns of

different heights. It has equal rectangles but with no space
between the values of the observation. Histogram are a great way
to show results of continuous data such as weight, height, time etc

• The Y-axis ( vertical axis generally represents the frequency count,

while the X –axis ( Horizontal axis)
Bar chart
Series 1






Category 1 Category 2 Category 3 Category 4
Bar chart
• A bar chart are the appropriate graphical display for
categorial variables ,

• it is made up of columns plotted on a graph. The columns

are separated and each represents an individual category or

• The height of the column if vertical represents the size of

group or frequency of each category.
Pie chart

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Pie chart

• Is a way of summarizing a set of categorical data. It is a

circular chart divided into sections, each segment shows the
relative size of each value. A pie chart uses percentages to
compare information
Line graph
Chart Title



Category 1 Category 2 Category 3 Category 4

Series 1 Series 2 Series 3

Line graph
• This is one of the most common tools used to present data.
It shows related information by drwing a continuous line
below all the points on a grid.

• It compares two variables, one is plotted along the X axis

and the other the Y axis. Y axis usually indicates quantity eg.
Dollars, liters, percentage. While the X axis often measures
unit of time .
Socio demographic characteristics of respondents
Variables Frequency (n = 400) Percent (%)
18 – 24
25 – 34 59 14.8
35 – 44 134 33.5
45 – 54 84 21
55 – 64 54 13.5
65 and older 29 7.3
40 10
Female 184 46
216 54
Marital status
Married 113 28.3
Divorced 179 44.8
Widowed 28 7
Separated 31 7.8
Co – habiting 13 3.3
36 9
Civil/Public servant
Farmer 97 24.3
Trading/business 77 19.3
Student 86 21.5
• This is the classical symmetrical bell shaped curve. The mean , median and
mode coincide at the central peak, the area under the curve helps determine
measures of spread and confidence interval

• A distribution that has a central location to the left and a tail off to the right is
said to be positively skewed or skewed to the right

• A distribution that has a central location to the right and a tail off to the left is
said to be negatively skewed or skewed to the left
Normal distribution
Skewed distributions
characteristics of normal curve
• Bell shaped curve

• Mean = median = mode

• Symmetrical

• Coefficient of skewness = 0 = mean –median/SD

• Limits are called confidence limits and the range between the two is called
confidence interval

• It is unimodal in the mean and SD

• It has points of inflection on both sides of the mean

• Degree of kurtosis ( peaked or flattened) =3

• Statistics which test confidence are statistics which helps a
researcher to make confidence statements about the
estimated parameters or statistics.

• Two important statistics to test confidence in medical and

health sciences are Confidence intervals and P values.
Confidence Intervals
• Confidence intervals (CI) are typically used when, instead of simply wanting the
mean value of a sample, we want a range that is likely to contain the true
population value.
• The CI gives the range in which the true value (say, the mean change in BP if we
treated an infinite number of patients) is likely to be.

• Notice that the Standard Deviation (SD) tells us about the variability or
spread of data values around the mean value in a sample. However, the
Confidence Interval (CI) tells us the range in which the true value (say, the
mean if the sample were infinitely large) is likely to be.
P values
• The P (probability) value is used when we wish to see how likely it is that
a hypothesis is true. The hypothesis, known as the “null hypothesis”, is
usually that there is no significant difference between two treatments.

• The P value gives the probability of any observed difference having

happened by chance.

• The lower the P value, the less likely it is that the difference happened
by chance and so the higher the significance of the finding.
P Value
• In most cases, an event might be considered to be “unlikely” to
occur if the chance of occurrence is less than 0.05 (or 1 in 20).

• Consequently, the null hypothesis would be rejected if the

calculated P value is less than 0.05 (i.e., P < 0.05). @ 95%
confidence interval
Test statistics



5 5
Parametric & non parametric
• Student independent T test Mann Whitney U

• Pair t test Wilcoxon signed rank test,

• ANOVA Kruskal Wallis test

Regression Chi-square test

Correlation Spearman
• Non-parametric statistics are used when the data are not normally
distributed and so are not appropriate for “parametric” tests , sample
size is small.
Parametric test statistics
• The t tests (also known as Student’s t tests) are typically used to compare just two samples. They
test the probability that the samples come from a population with the same mean value.

• Analysis of variance (ANOVA) is a group of statistical techniques used to compare the means of
two or more samples to see whether they come from the same population

• The chi-squared test is used when the researcher wants to test for the difference between actual
and expected frequencies of two independent categorical variables

• Correlation analysis is used when the researcher wants to know if there is a linear relationship
between two variables that are not necessarily dependent on one another
Thank you for your attention
Quick assessment
• Characterize the following variables and classify them as
qualitative/categorical or quantitative/numerical. If qualitative
/categorical, can the variable be ordered? If quantitative /numerical, is
the variable discrete or continuous? In each case define the values of
the variable: (1) race, (2) date of birth, (3) systolic blood pressure, (4)
intelligence quotient, (5) Apgar score, (6) white blood count, (7) weight,
and (8) quality of medical care.

• What statistics could be used to summarize such a sample?

Quick assessment 2

• Give three examples of frequency distributions from areas

of your own research interest. Be sure to specify (1) what
constitutes the sample, (2) the variable of interest

