Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 64



Department of Epidemiology and
Biostatistics, SPH, KNUST
Learning outcomes
After finishing this course, you should be able to:
•Recognize and give examples of different types of data arising in clinical studies
•Demonstrate a solid understanding of confidence interval estimation and
hypothesis testing; interpret and explain a p-value
•Choose and apply appropriate statistical methods for analyzing data
•Use technology (SPSS/STATA) to perform descriptive and inferential data
analysis for one or two variables
•Interpret results of commonly used statistical analyses in written summaries
•Demonstrate statistical reasoning skills correctly and contextually
Course Outline for Biostatistics
1. Article Review example
4. Estimation of confidence interval (CI)
2. Introduction P-value (level of significance)*
• Types of data distribution Standard error
• Measures of central tendency CI for mean
• Measurement of the spread of data CI for proportion

5. Hypothesis testing
3. Probability Distributions (Discrete &
Continuous) 6. Parametric and Non-parametric statistics
Sampling distribution
• Normal distribution 7. Introduction to statistical software
• Standard normal distribution (STATA)
Article for Review

1 - 10
1 - 11
1 - 12
1 - 13
1 - 14
1 - 15
Introduction to Biostatistics
• It is the science which deals with development and application of the
most appropriate methods for the:

Collection of data.
Analysis and interpretation of the results.
Making decisions on the basis of such analysis

Application of statistical methods (summarizing data and drawing valid

inferences based on limited information) to biological systems, more
particularly, to humans and their health problems.

The tools of statistics are employed in many fields:

business, education, psychology, agriculture, economics, …

When the data analyzed are derived from the biological

science and medicine, we use the term biostatistics to
distinguish this particular application of statistical tools and
Two areas of statistics:

Descriptive Statistics: collection, presentation, and description of

sample data.

Inferential Statistics: making decisions and drawing conclusions about

populations. The goal of inferential statistics is to draw conclusions
from a sample and generalize them to a population.
Role of statisticians
• To guide the design of an experiment or survey prior to data

• To analyze data using proper statistical procedures and techniques

• To present and interpret the results to researchers and other

decision makers
Sources of Data:

Sources of data

Records Surveys Experiments

Comprehensive Sample
Sources of Data

1- Routinely kept records:

For example:
- Hospital medical records contain immense amounts of information on patients.

2- External sources:
The data needed to answer a question may already exist in the form of published
reports, commercially available data banks (e.g. GenBank), or the research literature, i.e.
someone else has already asked the same question.

3- Surveys:
The source may be a survey, if the data needed is about answering certain questions.

4- Experiments:
Frequently the data needed to answer a question are available only as the result of an
Methods of presentation of data
1. Numerical presentation

2. Graphical presentation

3. Mathematical presentation
1- Numerical presentation
Tabular presentation (simple – complex)

Simple frequency distribution Table

Name of variable
(Units of variable) Frequency %

- Categories

Table : Distribution of 50 patients at the surgical department of KATH
in May 2010 according to their ABO blood groups

Blood group Frequency %

A 12 24
B 18 36
AB 5 10
O 15 30
Total 50 100
Complex frequency distribution Table

Table: Distribution of 20 lung cancer patients at the chest department of KATH and 40 controls in May 2***

Lung cancer
Smoking Cases Control
No. % No. % No. %
Smoker 15 75% 8 20% 23 38.33
smoker 5 25% 32 80% 37 61.67

Total 20 100 40 100 60 100

2- Graphical presentation
• Line graph
• Frequency polygon
• Frequency curve
• Histogram
• Bar graph
• Scatter plot
• Pie chart
Line Graph

Figure (1): Maternal mortality rate of (country), 1960-2000

Frequency polygon
Age Sex Mid-point of interval
Males Females

20 - 3 (12%) 2 (10%) (20+30) / 2 = 25

30 - 9 (36%) 6 (30%) (30+40) / 2 = 35
40- 7 (8%) 5 (25%) (40+50) / 2 = 45
50 - 4 (16%) 3 (15%) (50+60) / 2 = 55
60 - 70 2 (8%) 4 (20%) (60+70) / 2 = 65
Total 25(100%) 20(100%)
Frequency polygon Age

20- (12%) (10%) 25

30- (36%) (30%) 35
40- (8%) (25%) 45
50- (16%) (15%) 55
60-70 (8%) (20%) 65

Figure (2): Distribution of 45 patients at (place) , in (time) by

age and sex
Frequency curve

Figure (2): Distribution of 100 cholera patients at (place) , in

(time) by age
Bar chart
Bar chart
Pie chart
Doughnut chart
3-Mathematical presentation
Summary statistics
 Measures of location
1- Measures of central tendency
2- Measures of non central locations
(Quartiles, Percentiles )
 Measures of dispersion
Introduction to Basic Terms
• Population: A collection, or set, of individuals or objects or events whose properties are to be

• Sample: A subset of the population.

• Variable: A characteristic about each individual element of a population or sample.

• Data (singular): The value of the variable associated with one element of a population or sample.
This value may be a number, a word, or a symbol.

• Data (plural): The set of values collected for the variable from each of the elements belonging to
the sample.

• Parameter: A numerical value summarizing all the data of an entire population.

• Statistic: A numerical value summarizing the sample data.

Example: A college dean is interested in learning about the average age of
faculty. Identify the basic terms in this situation.

The population is the age of all faculty members at the college.

A sample is any subset of that population. For example, we might select 10
faculty members and determine their age.
The variable is the “age” of each faculty member.
One data would be the age of a specific faculty member.
The data would be the set of values in the sample.
The experiment would be the method used to select the ages forming the
sample and determining the actual age of each faculty member in the sample.
The parameter of interest is the “average” age of all faculty at the college.
The statistic is the “average” age for all faculty in the sample.

Two kinds of variables:

Qualitative, or Attribute, or Categorical Variable: A variable that categorizes or

describes an element of a population.
Note: Arithmetic operations, such as addition and averaging, are not
meaningful for data resulting from a qualitative variable.

Quantitative, or Numerical Variable: A variable that quantifies an element of a

Note: Arithmetic operations such as addition and averaging, are meaningful for
data resulting from a quantitative variable.
Example: Identify each of the following examples as attribute (qualitative) or
numerical (quantitative) variables.

1. The residence hall for each student in a statistics class. (Attribute)

2. The amount of gasoline pumped by the next 10 customers at the local
Unimart. (Numerical)
3. The amount of radon in the basement of each of 25 homes in a new
development. (Numerical)
4. The color of the baseball cap worn by each of 20 students. (Attribute)
5. The length of time to complete a mathematics homework assignment.(Numerical)
6. The state in which each truck is registered when stopped and inspected at
a weigh station. (Attribute)
Qualitative and Quantitative variables may be further subdivided:

Qualitative Binary
Nominal Variable: A qualitative variable that categorizes (or describes, or
names) an element of a population.

Ordinal Variable: A qualitative variable that incorporates an ordered position,

or ranking.

Discrete Variable: A quantitative variable that can assume a countable number

of values. Intuitively, a discrete variable can assume values corresponding to
isolated points along a line interval. That is, there is a gap between any two

Continuous Variable: A quantitative variable that can assume an uncountable

number of values. Intuitively, a continuous variable can assume any value
along a line interval, including every possible value between any two values.
Example: Identify each of the following as examples of (1) nominal, (2)
ordinal, (3) discrete, or (4) continuous variables:

1. The length of time until a pain reliever begins to work.

2. The number of colors used in a statistics textbook.
3. The brand of refrigerator in a pharmacy shop.
4. The overall satisfaction rating of clients that visited a chemical shop.
5. The number of files on a computer’s hard disk.
6. The number of staples in a stapler.
Descriptive Statistics

Measures of Central Tendency

key words:

Descriptive Statistic, measure of central tendency, statistic,

parameter, mean (μ), median, mode.
Descriptive statistic

• A summary statistic that quantitatively describes or summarizes features

of a collection of information or data.

• Determine if the sample is normally distributed (bell curve).

• Are displayed as tables, charts, percentages, frequency distributions and

as measures of central tendency.

• Includes information about a sample based on measures of variability,

measures of central tendency, skewness, kurtosis, etc
Statistic and Parameter
• A Statistic:
It is a descriptive measure computed from the data of a

• A Parameter:
It is a descriptive measure computed from the data of a
Since it is difficult to measure a parameter from the
population, a sample is drawn of size n, whose values are 
1 ,  2 , …,  n. From this data, we measure the statistic.
Measures of Central Tendency

A measure of central tendency is a measure which indicates where the

middle of the data is.
The three most commonly used measures of central tendency are:

The Mean (average), the Median (midpoint), and the Mode (most frequently occurring

The Mean:
It is the average of the data.
The Population Mean:

= which is usually unknown, then we use the

sample mean to estimate or approximate it.

The Sample Mean:

Here is a random sample of size 10 of ages, where
1= 42, 2= 28,  3 = 28,  4 = 61,  5 = 31,
6 = 23,  7 = 50,  8 = 34,  9 = 32,  10 = 37.

= (42 + 28 + … + 37) / 10 = 36.6

Properties of the Mean:
• Uniqueness: For a given set of data there is one and only one

• Simplicity: It is easy to understand and to compute.

• Affected by extreme values: Since all values enter into the


Example: Assume the values are 115, 110, 119, 117, 121 and 126.
The mean = 118.
But assume that the values are 75, 75, 80, 80 and 280. The mean =
118, a value that is not representative of the set of data as a
The Median:

When ordering the data, it is the observation that divide the set of observations into
two equal parts such that half of the data are before it and the other are after it.
* If n is odd, the median will be the middle of observations. It will be the (n+1)/2 th
ordered observation.
When n = 11, then the median is the 6th observation.
* If n is even, there are two middle observations. The median will be the mean of
these two middle observations. It will be the (n+1)/2 th ordered observation.
When n = 12, then the median is the 6.5th observation, which is an observation
halfway between the 6th and 7th ordered observation.

Properties of the Median:

• Uniqueness: For a given set of data there is one and only one median.
• Simplicity: It is easy to calculate.
• It is not affected by extreme values as is the mean.
The Mode:
It is the value which occurs most frequently.
If all values are different there is no mode.
Sometimes, there are more than one mode.

For the same random sample, the value 28 is repeated two
times, so it is the mode.

Properties of the Mode:

• Sometimes, it is not unique.
• It may be used for describing qualitative data.
Measures of Dispersion
key words:

measure of dispersion, range, variance, coefficient of

Measures of Dispersion
A measure of dispersion conveys information regarding the amount of
variability present in a set of data.
• Note:
1. If all the values are the same
→ There is no dispersion.
2. If all the values are different
→ There is a dispersion:
3.If the values close to each other
→The amount of Dispersion small.
b) If the values are widely scattered
→ The Dispersion is greater.
• ** Measures of Dispersion are :
1. Range (R).
2. Variance.
3. Standard deviation.
4. Coefficient of Variation (C.V).
1.The Range (R)
• Range =Largest value - Smallest value =

• Range concern only onto two values
• Data:
• 43,66,61,64,65,38,59,57,57,50.
• Find Range?
• Range=66-38=28
The Variance (how far the values of x are spread out)

• It measures dispersion relative to the scatter of the values about the mean.
a) Sample Variance ( ) :
• ,where is sample mean
• b)Population Variance ( ) :
• where µ is Population mean

The Standard Deviation:

• is the square root of variance=
a) Sample Standard Deviation = S =
b) Population Standard Deviation = σ =
Normal and Skewed distributions

•If mean = median = mode, the sample shows a perfectly normal distribution

•If mean < median < mode, the sample shows a negatively skewed distribution

•If mean > median > mode, the sample shows a positively skewed distribution
Standard Normal Distribution

Mean +/- 1 SD  encompasses 68% of observations

Mean +/- 2 SD  encompasses 95% of observations
Mean +/- 3SD  encompasses 99.7% of observations
• Kurtosis is a measure of
whether the data are heavy-
tailed or light-tailed relative
to a normal distribution.

• Data sets with high kurtosis

tend to have heavy tails, or
outliers. Data sets with low
kurtosis tend to have light
tails, or lack of outliers
4. Coefficient of Variation (C.V)
• Is a measure used to compare the dispersion in two sets of data
which is independent of the unit of the measurement .

• where S: Sample standard deviation.

• : Sample mean.

You might also like