Module 1 - Intro and Presenting Data - 1pp

PHCM97975
Foundations of Biostatistics
Module 1: Introduction to statistics and presenting data
Timothy Dobbins
Hybrid learning
This class is includes students in the lecture theatre and online
So all students can participate, please use the classroom microphones

or your laptop microphone so everyone can hear all discussions
If you are an online student, please turn on your camera when asking
questions, presenting or joining the discussion.
This session will be recorded

Aim of this course
Introduce statistical techniques relevant to population health (while minimising
maths!) so you can:
• Identify the most appropriate statistical technique based on the research question and the
type of data
• Use Stata to conduct the statistical analysis
• Interpret and report results to a non-statistical audience
Learning objectives
By the end of this module, you will be able to:
1. Understand the difference between descriptive and inferential statistics
2. Distinguish between different types of variables
3. Present and report data numerically
4. Present and interpret graphical summaries of data using a variety of graphs
5. Compute summary statistics to describe the centre and spread of data
Pre-course survey
What is statistics?
Statistics
the science of collecting, summarising, presenting and interpreting data ... to
estimate the magnitude of associations and test hypotheses (Kirkwood &
Sterne)
the discipline concerned with the treatment of numerical data derived from
groups of individuals (Armitage, Berry and Mathews)
that act of summarising quantitative data to obtain usable information
Fields of statistics
Mathematical or theoretical statistics
Fields of statistics
Mathematical or theoretical statistics
Applied statistics
• Biostatistics or medical statistics: the application of statistics to health and medical data
Statistical literacy
1. Data awareness
• relevance and appropriateness; data source and quality; fit for purpose
2. The ability to understand statistical concepts
3. The ability to analyse, interpret and evaluate statistical information
4. The ability to communicate statistical information and understandings
https://www.abs.gov.au/ausstats/abs@.nsf/lookup/1307.6feature+article1mar+2009
Scope of statistics: descriptive vs inferential
Descriptive: describe characteristics of a population
https://www.aihw.gov.au/reports/mothers-babies/australias-mothers-babies/contents/demographics-of-mothers-and-babies/maternal-age
https://www.aihw.gov.au/reports/mothers-babies/australias-mothers-babies/contents/demographics-of-mothers-and-babies/key-demographics-and-statistics
https://www.aihw.gov.au/reports/mothers-babies/australias-mothers-babies/contents/demographics-of-mothers-and-babies/key-demographics-and-statistics
Inferential: use a sample of data from the population to make inferences about
the whole population
http://www.healthstats.nsw.gov.au
Types of variables
Observations and variables
Kirkwood and Sterne:
• The raw data of an investigation consist of observations made on individuals
• Any aspect of an individual that is measured ... is called a variable
Types of variables
Numeric (or quantitative)
• continuous (e.g. weight, haemoglobin)
• discrete (e.g. number of children)
Categorical (or qualitative)
• binary (e.g. previous heart disease: yes/no)
• categorical (e.g. status: alive, died from cancer, died from other causes)
• ordered categorical (ordinal) (e.g. cancer stage, grade)
Observations and variables
The variable type will help determine the appropriate analysis

Historically
What is your sex?
o Male
o Female
Sex vs gender
Sex:
• understood in relation to sex characteristics
• Sex recorded at birth refers to what was determined by sex characteristics
observed at birth or infancy
Gender:
• social and cultural differences in identity, expression and experience
https://www.abs.gov.au/statistics/standards/standard-sex-gender-variations-sex-characteristics-and-sexual-orientation-variables/latest-release
Suggested questions
What was your sex recorded at birth?
o Male
o Female
o Another term (please specify)
Suggested questions
How do you describe your gender?
Gender refers to current gender, which may be different to sex recorded at birth
and may be different to what is indicated on legal documents.
Please [tick/mark/select] one box:

o Man or male
o Woman or female
o Non-binary
o I use a different term (please specify)
o Prefer not to answer
Presenting numerical data
Research as storytelling
Tell the reader a simple story
Keep the reader engaged
https://www.amazon.com/Gruffalo-Julia-Donaldson/dp/0803730470
https://www.amazon.com/Matilda-Roald-Dahl/dp/0142410373
https://www.amazon.com.au/Testaments-Handmaids-Tale-Book-ebook/dp/B07KRMV57
Present evidence to support your story
Evidence is in the form of tables and figures
Easy to read tables and figures keep readers engaged
Constructing tables
One-way frequency tables
Summarises a single characteristic (e.g. age)
Frequency: the number of individuals with a certain characteristic
Relative frequency: the frequency expressed as a percentage or a proportion of
the total frequency
Table 1.2: Frequency distribution of ages of students visiting a gym
Age Frequency Relative frequency (%)
17 1 3
18 5 17
19 5 17
20 7 23
21 5 17
22 2 7
23 4 13
24 1 3
Total 30 100
Cumulative frequency: the number of individuals in a category or below
Cumulative relative frequency: the cumulative frequency expressed as a
percentage or a proportion of the total frequency
Table 1.3: Frequency distribution of ages of students visiting a gym

Cumulative
Relative Cumulative
Age Frequency relative
frequency (%) frequency
frequency (%)
17 1 3 1 3
18 5 17 6 20
19 5 17 11 37
20 7 23 18 60
21 5 17 23 77
22 2 7 25 83
23 4 13 29 97
24 1 3 30 100
Total 30 100
Two-way frequency tables
Summarises two characteristics
Table 1.5: BMI status of students visiting a gym by sex
BMI Status
Normal Overweight Obese
Sex Total
BMI < 25 kg/m2 25 ≤ BMI < 30 kg/m2 BMI ≥ 30 kg/m2
Male 1 9 2 12
Female 11 6 0 17
Total 12 15 2 29
Note: 1 value of BMI was missing
Two-way frequency tables:
column percentages
Calculates relative frequencies within columns
BMI Status
Sex Total
n % n % n % n %
Male 1 8 9 60 2 100 12 41
Female 11 92 6 40 0 0 17 59
Total 12 100 15 100 2 100 29 100
Two-way frequency tables:
row percentages
Calculates relative frequencies within rows
BMI Status
Sex Total
Male n 1 9 2 12
% 8 75 17 100
Female n 11 6 0 17
% 65 35 0 100
Total n 12 15 2 29
% 41 52 7 100
Multi-way frequency tables
Australian Institute of Health and Welfare 2015. The health of Australia’s prisoners 2015. Cat. no. PHE 207. Canberra: AIHW
Table presentation guidelines
1. Each table (and figure) should be self-explanatory, i.e. the reader should be
able to understand it without reference to the text in the body of the report
2. Units of the variables should be given and missing records should be noted
3. A table should be visually uncluttered
From Woodward. Epidemiology: Study Design and Data Analysis, Third Edition; 2013
Table presentation guidelines
4. The rows and columns of each table should be arranged in a natural order to
help interpretation
5. Tables should have a consistent appearance throughout the report
6. Consider if there is a particular table orientation that makes a table easier to
read
Presenting data graphically
Bar chart
Simple way to plot frequencies
Bars represent frequency
Horizontal (x) axis is categorical
Source: Australian Institute of Health and Welfare 2019. Cancer in Australia 2019. Cancer series no.119. Cat. no. CAN 123. Canberra: AIHW.
Clustered bar chart
Source: https://www.aihw.gov.au/reports/burden-of-disease/australian-burden-of-disease-study-impact-and-causes-of-illness-and-death-in-australia-2011/contents/highlights
Stacked bar chart
Source: Australian Institute of Health and Welfare 2017. Australia’s welfare 2017. Australia’s welfare series no. 13. AUS 214. Canberra: AIHW.
Stacked bar chart
Source: State of Australian University Research 2015–16: Volume 1 ERA National Report (ERA 2015)
Line chart
Effective way to show changes over time

Pie charts
• Area of pie piece

represents relative
frequency
• Useful for broad
statements
• Can be difficult to
compare within and
between pies
Source: Australian Institute of Health and Welfare 2016. Australia’s health 2016. Australia’s health series no. 15. Cat. no. AUS 199. Canberra: AIHW.
Pie charts
Source: https://budget.gov.au/2019-20/content/overview.htm
Pie charts
Many authors categorically reject pie charts ... Others
defend the use of pie charts in some applications. My
own opinion is that none of these visualizations is
consistently superior over any other. Depending on
the features of the dataset and the specific story you
want to tell, you may want to favor one or the other
approach.
https://serialmentor.com/dataviz/visualizing-
proportions.html#a-case-for-pie-charts
Pie charts
Pie charts are evil
I have a well‐documented disdain for pie charts. In short, they
are evil. To understand how I arrived at this conclusion, let’s
look at an example.
Graphical presentation guidelines
1. Figures should be self-explanatory and have consistent appearance through

the report
2. A title should give complete information
3. Axes should be labelled appropriately
4. Units of the variables should be given in the labelling of the axes. Use
footnotes to indicate any calculation or derivation of variables and to indicate
missing values
Graphical presentation guidelines
5. If the Y-axis has a natural origin, it should be included, or emphasised if it is
not included.
6. If graphs are being compared, the Y-axis should be the same across the
graphs to enable fair comparison
7. Columns of bar charts should be separated by a space
8. Three dimensional graphs should be avoided unless the third dimension
adds additional information
Computing summary statistics
What do we want to know?
What is the average value?
What is the spread, or variability?
Example data
Weight (in kgs) of 30 people:
60.0 62.5 62.5 62.5 65.0

65.0 65.0 67.5 67.5 67.5
67.5 67.5 70.0 70.0 70.0
70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0
75.0 75.0 77.5 77.5 80.0
Measures of central tendency
1: Mean
∑𝑥
𝑥̅ =
𝑛
Weights example:
∑ 𝑥 = 2100
𝑥̅ = 2100 / 30 = 70.0 kg
Advantages
• uses all data points
• nice mathematical properties
Disadvantage
• affected by unusually small or large observations
2: Median
Order data from smallest to largest

Median: the middle observation (if n is odd)
or the mean of the two middle observations (if n is even)
2: Median
Example (n=30):
60.0 62.5 62.5 62.5 65.0
65.0 65.0 67.5 67.5 67.5
67.5 67.5 70.0 70.0 70.0
70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0
75.0 75.0 77.5 77.5 80.0
Median = (70.0 + 70.0) / 2

= 70.0 kg
2: Median
Advantage
• not unduly affected by unusually small or large observations
Disadvantages
• difficult mathematical properties
• ‘wastes’ information
Mean vs Median
Mean: affected by unusually small or large observations

Median: not affected by unusually small or large observations
Example (n=30):
60.0 62.5 62.5 62.5 65.0 65.0 65.0 67.5 67.5 67.5
67.5 67.5 70.0 70.0 70.0 70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0 75.0 75.0 77.5 77.5 80.0
Mean: 70.0 kg
Median: 70.0 kg
Mean vs Median
Mean: affected by unusually small or large observations

Median: not affected by unusually small or large observations
Example (n=31):
60.0 62.5 62.5 62.5 65.0 65.0 65.0 67.5 67.5 67.5
67.5 67.5 70.0 70.0 70.0 70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0 75.0 75.0 77.5 77.5 80.0
145.0
Mean: 72.4 kg
Median: 70.0 kg
3: Mode
Most common observation

May not be unique
Seldom used
Weights example: mode = 70.0kg

Measures of variability
1: Range
Minimum to maximum
or
Maximum − minimum
Disadvantages
• wasteful: based on only two observations
• based on the two most unusual observations
1: Range
Weight (in kgs) of 30 people:

60.0 62.5 62.5 62.5 65.0
65.0 65.0 67.5 67.5 67.5
67.5 67.5 70.0 70.0 70.0
70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0
75.0 75.0 77.5 77.5 80.0
Range: 60.0 to 80.0 kgs, or

range = 20.0 kgs
2: Standard deviation
Based on the difference between each observation and the mean

Deviation: d = x – x̅
x 60.0 62.5 62.5 62.5 65.0 ... 75.0 75.0 77.5 77.5 80.0
x̅ 70.0 70.0 70.0 70.0 70.0 ... 70.0 70.0 70.0 70.0 70.0
d –10.0 –7.5 –7.5 –7.5 –5.0 5.0 5.0 7.5 7.5 10.0
Could we take the average deviation?

In this example: d̅ = 0
Can be shown that d̅ = 0 for any data set
Solution: square the deviations before adding them:

d 2 = (x – x̅ )2
x 60.0 62.5 62.5 62.5 65.0 ... 75.0 75.0 77.5 77.5 80.0
x̅ 70.0 70.0 70.0 70.0 70.0 ... 70.0 70.0 70.0 70.0 70.0
d –10.0 –7.5 –7.5 –7.5 –5.0 5.0 5.0 7.5 7.5 10.0
d2 100.0 56.3 56.3 56.3 25.0 ... 25.0 25.0 56.3 56.3 100.0
The sample variance is defined as

! ∑($%$)!
𝑠 =
'%(
with original units squared - e.g. kg2
And the sample standard deviation is

∑($%$)!
𝑠=
'%(
with original units – e.g. kg
Example weight data

• Variance = 25.43 kg2
• Standard deviation = 5.04 kg
Low standard deviation → low variability

• observations are close to the mean
High standard deviation → high variability
• observations are more spread out
Characteristics:
1. every observation is used
2. units are the same as the observations
3. can be used to measure precision (to come in Module 3)
3: Interquartile range
The median splits the data into two

The quartiles split the data into four
The interquartile range represents the range within which the middle 50% of
observations lie
The median splits the data into two

The quartiles split the data into four
Weight data:
60.0 62.5 62.5 62.5 65.0
65.0 65.0 67.5 67.5 67.5 Quartile 1 = 67.5 kg
67.5 67.5 70.0 70.0 70.0 Quartile 2 = 70.0 kg = Median
Quartile 3 = 75.0 kg
70.0 70.0 70.0 72.5 72.5
72.5 72.5 75.0 75.0 75.0
75.0 75.0 77.5 77.5 80.0
The interquartile range (IQR) represents the range within which the middle 50%
of observations lie: i.e. Q1 to Q3
Weights data: interquartile range is 67.5 to 75.0 kg
Note: quartiles are often presented as percentiles where:

• 1st quartile = 25th percentile
• 2nd quartile = 50th percentile (= median)
• 3rd quartile = 75th percentile
Population vs sample statistics
All our measures have been based on a sample of data (n)
Things change slightly if we consider the entire population (N)
∑$
Population mean: 𝜇 = )
! ∑ $%* !
Population variance: 𝜎 =
)
Population standard deviation: 𝜎 = 𝜎!
Population vs sample statistics
∑$
Population mean: 𝜇 = )
! ∑ $%* !
Population variance: 𝜎 =
)
Population standard deviation: 𝜎 = 𝜎!
We almost never have the entire population

These theoretical quantities can be used to:
• characterise probability distributions (e.g. Normal distribution)
• inform sample size calculations
Graphing continuous data
Graphing continuous data:
Frequency histogram
Similar to a bar chart

• used for continuous data
Area of each rectangle represents frequency
Rectangles are (almost) touching
Can assess skewness (see Module 2)
Figure 1.9: Histograms
Skewness is determined by the tail

and not the peak
• Symmetric distribution • Skewed distribution

• Skewed to the right
Graphing continuous data: Box-plot
Figure: Box-plot of weight of 30 people
Maximum*
Q3 or 75th percentile
Median Interquartile range

(Q2 or 50th percentile)
Minimum*
* as long as there are no "outliers"

Figure: Box-plot of weight of 32 people
"Outlier"
Largest "non-outlier"
Median
(Q2 or 50th percentile) Interquartile range
Smallest "non-outlier"
"Outlier"
Stata defines an outlier as:
• any observation larger than Q3 + 1.5 × IQR or
• any observation smaller than Q1 – 1.5 × IQR
Do not automatically assume these are incorrect
Check for biological plausibility
Figure 1.10: Box-plots
Reporting results
Summary statistics
Report units of measurement
Don’t forget to report the unit of measure for all your summary statistics
Example weight data:
• Mean = 70.0 kg
• Median = 70.0 kg
• Mode = 70.0 kg
• Range = 60.0 to 80.0kg (= 20 kg)
• IQR = 67.5 to 75.0 kg (= 7.5 kg)
• Standard deviation = 5.04 kg
• Variance = 25.43 kg2
Reporting results: decimal places
In the presentation of results:
• Range, median and interquartile range are based on observed data points, so quote to the
same number of decimal places as the original data
• Mean may be quoted with one more decimal places than the original data
• Variance, standard deviations or standard errors may be quoted to one extra decimal place
than the mean
Do not give greater precision than can be measured by the instrument used to
collect the information
Reporting results: decimal places
The precision of percentages depends on the number of observations in your
sample
For samples of fewer than 100, present no decimal places
For samples of 100 or more, present no more than one decimal place
Rounding decimal places
All decimal points should be retained in intermediate calculations
Rounding should only be carried out at the end of the analysis
• Use the memory function on your calculator
Numbers should always be rounded, not truncated.
E.g. 0.015782 expressed to:
2 decimal places is 0.02:
0.015782
but the first digit after the second decimal place is ≥ 5,
so the 1 gets rounded up
Rounding decimal places
All decimal points should be retained in intermediate calculations
Rounding should only be carried out at the end of the analysis
• Use the memory function on your calculator
Numbers should always be rounded, not truncated.
E.g. 0.015782 expressed to:
2 decimal places is 0.02
Present evidence to support your story
Evidence is in the form of tables and figures
Easy to read tables and figures keep readers engaged
Summary
• Descriptive vs inferential statistics
• Important to identify the type of a variable
• Appropriate presentation and analysis
• Present and report data numerically
• Introduced different types of graphical summaries
• Compute summary statistics to describe the centre and spread of data
Questions?
Always happy to answer questions
In or after lectures
On Moodle boards
• Please do not use Moodle messages!
Via email: t.dobbins@unsw.edu.au

Module 1 - Intro and Presenting Data - 1pp

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 1 - Intro and Presenting Data - 1pp

Uploaded by

Copyright:

Available Formats

PHCM97975

Module 1: Introduction to statistics and presenting data

So all students can participate, please use the classroom microphones

This session will be recorded

The variable type will help determine the appropriate analysis

Please [tick/mark/select] one box:

Table 1.2: Frequency distribution of ages of students visiting a gym

Age Frequency Relative frequency (%)

Table 1.3: Frequency distribution of ages of students visiting a gym

Table 1.5: BMI status of students visiting a gym by sex

Calculates relative frequencies within columns

Table 1.5: BMI status of students visiting a gym by sex

Calculates relative frequencies within rows

Table 1.5: BMI status of students visiting a gym by sex

Effective way to show changes over time

• Area of pie piece

1. Figures should be self-explanatory and have consistent appearance through

60.0 62.5 62.5 62.5 65.0

Order data from smallest to largest

Median = (70.0 + 70.0) / 2

Mean: affected by unusually small or large observations

Mean: affected by unusually small or large observations

Most common observation

Weights example: mode = 70.0kg

Weight (in kgs) of 30 people:

Range: 60.0 to 80.0 kgs, or

Based on the difference between each observation and the mean

Could we take the average deviation?

Solution: square the deviations before adding them:

The sample variance is defined as

And the sample standard deviation is

Example weight data

Low standard deviation → low variability

The median splits the data into two

The median splits the data into two

Note: quartiles are often presented as percentiles where:

We almost never have the entire population

Similar to a bar chart

Skewness is determined by the tail

• Symmetric distribution • Skewed distribution

Median Interquartile range

* as long as there are no "outliers"

Figure: Box-plot of weight of 32 people

You might also like