Foundations of Biostatistics

Module 1: Introduction to statistics and presenting data

Timothy Dobbins
Hybrid learning
This class is includes students in the lecture theatre and online

So all students can participate, please use the classroom microphones

or your laptop microphone so everyone can hear all discussions

If you are an online student, please turn on your camera when asking
questions, presenting or joining the discussion.

This session will be recorded

Aim of this course
Introduce statistical techniques relevant to population health (while minimising
maths!) so you can:
• Identify the most appropriate statistical technique based on the research question and the
type of data
• Use Stata to conduct the statistical analysis
• Interpret and report results to a non-statistical audience
Learning objectives
By the end of this module, you will be able to:
1. Understand the difference between descriptive and inferential statistics
2. Distinguish between different types of variables
3. Present and report data numerically
4. Present and interpret graphical summaries of data using a variety of graphs
5. Compute summary statistics to describe the centre and spread of data
Pre-course survey
What is statistics?
the science of collecting, summarising, presenting and interpreting data ... to
estimate the magnitude of associations and test hypotheses (Kirkwood &
the discipline concerned with the treatment of numerical data derived from
groups of individuals (Armitage, Berry and Mathews)
that act of summarising quantitative data to obtain usable information
Applied statistics
• Biostatistics or medical statistics: the application of statistics to health and medical data
Statistical literacy
1. Data awareness
• relevance and appropriateness; data source and quality; fit for purpose
2. The ability to understand statistical concepts
3. The ability to analyse, interpret and evaluate statistical information
4. The ability to communicate statistical information and understandings
Types of variables
Observations and variables
Kirkwood and Sterne:
• The raw data of an investigation consist of observations made on individuals
• Any aspect of an individual that is measured ... is called a variable
Types of variables
Numeric (or quantitative)
• continuous (e.g. weight, haemoglobin)
• discrete (e.g. number of children)
Categorical (or qualitative)
• binary (e.g. previous heart disease: yes/no)
• categorical (e.g. status: alive, died from cancer, died from other causes)
• ordered categorical (ordinal) (e.g. cancer stage, grade)
Observations and variables

The variable type will help determine the appropriate analysis

What is your sex?
o Male
o Female
Sex vs gender
• understood in relation to sex characteristics
• Sex recorded at birth refers to what was determined by sex characteristics
observed at birth or infancy
• social and cultural differences in identity, expression and experience
Suggested questions
What was your sex recorded at birth?
o Male
o Female
o Another term (please specify)
Suggested questions
How do you describe your gender?
Gender refers to current gender, which may be different to sex recorded at birth
and may be different to what is indicated on legal documents.

Please [tick/mark/select] one box:

o Man or male
o Woman or female
o Non-binary
o I use a different term (please specify)
o Prefer not to answer
Presenting numerical data
Research as storytelling
Tell the reader a simple story
Keep the reader engaged
Constructing tables
One-way frequency tables
Summarises a single characteristic (e.g. age)
Frequency: the number of individuals with a certain characteristic
Relative frequency: the frequency expressed as a percentage or a proportion of
the total frequency
One-way frequency tables

Table 1.2: Frequency distribution of ages of students visiting a gym

Age Frequency Relative frequency (%)

17 1 3
18 5 17
19 5 17
20 7 23
21 5 17
22 2 7
23 4 13
24 1 3
Total 30 100
One-way frequency tables
Cumulative frequency: the number of individuals in a category or below
Cumulative relative frequency: the cumulative frequency expressed as a
percentage or a proportion of the total frequency
One-way frequency tables

Table 1.3: Frequency distribution of ages of students visiting a gym

Relative Cumulative
Age Frequency relative
frequency (%) frequency
frequency (%)
17 1 3 1 3
18 5 17 6 20
19 5 17 11 37
20 7 23 18 60
21 5 17 23 77
22 2 7 25 83
23 4 13 29 97
24 1 3 30 100
Total 30 100
Two-way frequency tables
Summarises two characteristics

Table 1.5: BMI status of students visiting a gym by sex

BMI Status
Normal Overweight Obese
Sex Total
BMI < 25 kg/m2 25 ≤ BMI < 30 kg/m2 BMI ≥ 30 kg/m2
Male 1 9 2 12
Female 11 6 0 17
Total 12 15 2 29
Note: 1 value of BMI was missing
Two-way frequency tables:
column percentages

Calculates relative frequencies within columns

Table 1.5: BMI status of students visiting a gym by sex

BMI Status
Normal Overweight Obese
Sex Total
BMI < 25 kg/m2 25 ≤ BMI < 30 kg/m2 BMI ≥ 30 kg/m2
n % n % n % n %
Male 1 8 9 60 2 100 12 41
Female 11 92 6 40 0 0 17 59
Total 12 100 15 100 2 100 29 100
Note: 1 value of BMI was missing
Two-way frequency tables:
row percentages

Calculates relative frequencies within rows

Table 1.5: BMI status of students visiting a gym by sex

BMI Status
Normal Overweight Obese
Sex Total
BMI < 25 kg/m2 25 ≤ BMI < 30 kg/m2 BMI ≥ 30 kg/m2
Male n 1 9 2 12
% 8 75 17 100
Female n 11 6 0 17
% 65 35 0 100
Total n 12 15 2 29
% 41 52 7 100
Note: 1 value of BMI was missing
Multi-way frequency tables

Australian Institute of Health and Welfare 2015. The health of Australia’s prisoners 2015. Cat. no. PHE 207. Canberra: AIHW
Table presentation guidelines
1. Each table (and figure) should be self-explanatory, i.e. the reader should be
able to understand it without reference to the text in the body of the report
2. Units of the variables should be given and missing records should be noted
3. A table should be visually uncluttered

From Woodward. Epidemiology: Study Design and Data Analysis, Third Edition; 2013
Table presentation guidelines
4. The rows and columns of each table should be arranged in a natural order to
help interpretation
5. Tables should have a consistent appearance throughout the report
6. Consider if there is a particular table orientation that makes a table easier to

From Woodward. Epidemiology: Study Design and Data Analysis, Third Edition; 2013
Presenting data graphically
Bar chart
Simple way to plot frequencies
Bars represent frequency
Horizontal (x) axis is categorical

Source: Australian Institute of Health and Welfare 2019. Cancer in Australia 2019. Cancer series no.119. Cat. no. CAN 123. Canberra: AIHW.
Clustered bar chart

Stacked bar chart

Source: Australian Institute of Health and Welfare 2017. Australia’s welfare 2017. Australia’s welfare series no. 13. AUS 214. Canberra: AIHW.
Stacked bar chart

Source: State of Australian University Research 2015–16: Volume 1 ERA National Report (ERA 2015)
Line chart

Effective way to show changes over time

Pie charts

• Area of pie piece

represents relative
• Useful for broad
• Can be difficult to
compare within and
between pies

Source: Australian Institute of Health and Welfare 2016. Australia’s health 2016. Australia’s health series no. 15. Cat. no. AUS 199. Canberra: AIHW.
Pie charts

Pie charts
Many authors categorically reject pie charts ... Others
defend the use of pie charts in some applications. My
own opinion is that none of these visualizations is
consistently superior over any other. Depending on
the features of the dataset and the specific story you
want to tell, you may want to favor one or the other
Pie charts
Pie charts are evil
I have a well‐documented disdain for pie charts. In short, they
are evil. To understand how I arrived at this conclusion, let’s
look at an example.
Graphical presentation guidelines

1. Figures should be self-explanatory and have consistent appearance through

the report
2. A title should give complete information
3. Axes should be labelled appropriately
4. Units of the variables should be given in the labelling of the axes. Use
footnotes to indicate any calculation or derivation of variables and to indicate
missing values

From Woodward. Epidemiology: Study Design and Data Analysis, Third Edition; 2013
Graphical presentation guidelines
5. If the Y-axis has a natural origin, it should be included, or emphasised if it is
not included.
6. If graphs are being compared, the Y-axis should be the same across the
graphs to enable fair comparison
7. Columns of bar charts should be separated by a space
8. Three dimensional graphs should be avoided unless the third dimension
adds additional information

From Woodward. Epidemiology: Study Design and Data Analysis, Third Edition; 2013
Computing summary statistics
What do we want to know?
What is the average value?
What is the spread, or variability?
Example data
Weight (in kgs) of 30 people:

60.0 62.5 62.5 62.5 65.0

65.0 65.0 67.5 67.5 67.5
67.5 67.5 70.0 70.0 70.0
70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0
75.0 75.0 77.5 77.5 80.0
Measures of central tendency
1: Mean

𝑥̅ =

Weights example:
∑ 𝑥 = 2100
𝑥̅ = 2100 / 30 = 70.0 kg
• uses all data points
• nice mathematical properties
• affected by unusually small or large observations
Measures of central tendency
2: Median

Order data from smallest to largest

Median: the middle observation (if n is odd)
or the mean of the two middle observations (if n is even)
Measures of central tendency
2: Median

Example (n=30):
60.0 62.5 62.5 62.5 65.0
65.0 65.0 67.5 67.5 67.5
67.5 67.5 70.0 70.0 70.0
70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0
75.0 75.0 77.5 77.5 80.0

Median = (70.0 + 70.0) / 2

= 70.0 kg
Measures of central tendency
2: Median

• not unduly affected by unusually small or large observations
• difficult mathematical properties
• ‘wastes’ information
Measures of central tendency
Mean vs Median

Mean: affected by unusually small or large observations

Median: not affected by unusually small or large observations
Example (n=30):
60.0 62.5 62.5 62.5 65.0 65.0 65.0 67.5 67.5 67.5
67.5 67.5 70.0 70.0 70.0 70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0 75.0 75.0 77.5 77.5 80.0
Mean: 70.0 kg
Median: 70.0 kg
Measures of central tendency
Mean vs Median

Mean: affected by unusually small or large observations

Median: not affected by unusually small or large observations
Example (n=31):
60.0 62.5 62.5 62.5 65.0 65.0 65.0 67.5 67.5 67.5
67.5 67.5 70.0 70.0 70.0 70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0 75.0 75.0 77.5 77.5 80.0
Mean: 72.4 kg
Median: 70.0 kg
Measures of central tendency
3: Mode

Most common observation

May not be unique
Seldom used

Weights example: mode = 70.0kg

Measures of variability
1: Range

Minimum to maximum
Maximum − minimum

• wasteful: based on only two observations
• based on the two most unusual observations
Measures of variability
1: Range

Weight (in kgs) of 30 people:

60.0 62.5 62.5 62.5 65.0
65.0 65.0 67.5 67.5 67.5
67.5 67.5 70.0 70.0 70.0
70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0
75.0 75.0 77.5 77.5 80.0

Range: 60.0 to 80.0 kgs, or

range = 20.0 kgs
Measures of variability
2: Standard deviation

Based on the difference between each observation and the mean

Deviation: d = x – x̅

x 60.0 62.5 62.5 62.5 65.0 ... 75.0 75.0 77.5 77.5 80.0
x̅ 70.0 70.0 70.0 70.0 70.0 ... 70.0 70.0 70.0 70.0 70.0
d –10.0 –7.5 –7.5 –7.5 –5.0 5.0 5.0 7.5 7.5 10.0

Could we take the average deviation?

In this example: d̅ = 0
Can be shown that d̅ = 0 for any data set
Measures of variability
2: Standard deviation

Solution: square the deviations before adding them:

d 2 = (x – x̅ )2

x 60.0 62.5 62.5 62.5 65.0 ... 75.0 75.0 77.5 77.5 80.0
x̅ 70.0 70.0 70.0 70.0 70.0 ... 70.0 70.0 70.0 70.0 70.0
d –10.0 –7.5 –7.5 –7.5 –5.0 5.0 5.0 7.5 7.5 10.0
d2 100.0 56.3 56.3 56.3 25.0 ... 25.0 25.0 56.3 56.3 100.0
Measures of variability
2: Standard deviation

The sample variance is defined as

! ∑($%$)!
𝑠 =
with original units squared - e.g. kg2

And the sample standard deviation is

with original units – e.g. kg
Measures of variability
2: Standard deviation

Example weight data

• Variance = 25.43 kg2
• Standard deviation = 5.04 kg
Measures of variability
2: Standard deviation

Low standard deviation → low variability

• observations are close to the mean
High standard deviation → high variability
• observations are more spread out
1. every observation is used
2. units are the same as the observations
3. can be used to measure precision (to come in Module 3)
Measures of variability
3: Interquartile range

The median splits the data into two

The quartiles split the data into four
The interquartile range represents the range within which the middle 50% of
observations lie
Measures of variability
3: Interquartile range

The median splits the data into two

The quartiles split the data into four
Weight data:
60.0 62.5 62.5 62.5 65.0
65.0 65.0 67.5 67.5 67.5 Quartile 1 = 67.5 kg
67.5 67.5 70.0 70.0 70.0 Quartile 2 = 70.0 kg = Median
Quartile 3 = 75.0 kg
70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0
75.0 75.0 77.5 77.5 80.0
Measures of variability
3: Interquartile range

The interquartile range (IQR) represents the range within which the middle 50%
of observations lie: i.e. Q1 to Q3
Weights data: interquartile range is 67.5 to 75.0 kg

Note: quartiles are often presented as percentiles where:

• 1st quartile = 25th percentile
• 2nd quartile = 50th percentile (= median)
• 3rd quartile = 75th percentile
Population vs sample statistics
All our measures have been based on a sample of data (n)
Things change slightly if we consider the entire population (N)

Population mean: 𝜇 = )
! ∑ $%* !
Population variance: 𝜎 =
Population standard deviation: 𝜎 = 𝜎!
Population vs sample statistics
Population mean: 𝜇 = )
! ∑ $%* !
Population variance: 𝜎 =
Population standard deviation: 𝜎 = 𝜎!

We almost never have the entire population

These theoretical quantities can be used to:
• characterise probability distributions (e.g. Normal distribution)
• inform sample size calculations
Graphing continuous data
Graphing continuous data:
Frequency histogram

Similar to a bar chart

• used for continuous data
Area of each rectangle represents frequency
Rectangles are (almost) touching
Can assess skewness (see Module 2)
Figure 1.9: Histograms

Skewness is determined by the tail

and not the peak

• Symmetric distribution • Skewed distribution

• Skewed to the right
Graphing continuous data: Box-plot
Figure: Box-plot of weight of 30 people


Q3 or 75th percentile

Median Interquartile range

(Q2 or 50th percentile)

Q1 or 25th percentile


* as long as there are no "outliers"

Graphing continuous data: Box-plot

Figure: Box-plot of weight of 32 people


Largest "non-outlier"
Q3 or 75th percentile

(Q2 or 50th percentile) Interquartile range

Smallest "non-outlier"
Q1 or 25th percentile

Graphing continuous data: Box-plot
Stata defines an outlier as:
• any observation larger than Q3 + 1.5 × IQR or
• any observation smaller than Q1 – 1.5 × IQR
Do not automatically assume these are incorrect
Check for biological plausibility
Figure 1.10: Box-plots
Reporting results
Summary statistics
Report units of measurement
Don’t forget to report the unit of measure for all your summary statistics
Example weight data:
• Mean = 70.0 kg
• Median = 70.0 kg
• Mode = 70.0 kg
• Range = 60.0 to 80.0kg (= 20 kg)
• IQR = 67.5 to 75.0 kg (= 7.5 kg)
• Standard deviation = 5.04 kg
• Variance = 25.43 kg2
Reporting results: decimal places
In the presentation of results:
• Range, median and interquartile range are based on observed data points, so quote to the
same number of decimal places as the original data
• Mean may be quoted with one more decimal places than the original data
• Variance, standard deviations or standard errors may be quoted to one extra decimal place
than the mean

Do not give greater precision than can be measured by the instrument used to
collect the information
Reporting results: decimal places
The precision of percentages depends on the number of observations in your
For samples of fewer than 100, present no decimal places
For samples of 100 or more, present no more than one decimal place
Rounding decimal places
All decimal points should be retained in intermediate calculations
Rounding should only be carried out at the end of the analysis
• Use the memory function on your calculator
Numbers should always be rounded, not truncated.
E.g. 0.015782 expressed to:
2 decimal places is 0.02:
but the first digit after the second decimal place is ≥ 5,
so the 1 gets rounded up
Rounding decimal places
All decimal points should be retained in intermediate calculations
Rounding should only be carried out at the end of the analysis
• Use the memory function on your calculator
Numbers should always be rounded, not truncated.
E.g. 0.015782 expressed to:
2 decimal places is 0.02
3 decimal places is 0.016
4 decimal places is 0.0158
5 decimal places is 0.01578
Research as storytelling
Tell the reader a simple story
Keep the reader engaged
Present evidence to support your story
Evidence is in the form of tables and figures
Easy to read tables and figures keep readers engaged
• Descriptive vs inferential statistics
• Important to identify the type of a variable
• Appropriate presentation and analysis
• Present and report data numerically
• Introduced different types of graphical summaries
• Compute summary statistics to describe the centre and spread of data
Always happy to answer questions
In or after lectures
On Moodle boards
• Please do not use Moodle messages!
Via email:

