Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 37

Measures of spread / measures of distribution of data within the data set

±1 SD (68.2%)

±2 SD (95.4%)

±3 SD (99.7%)

Normal Distribution – Gaussian, bell shape curve,


Range
Median
Interquartile Range

Skewed data distribution, non Gaussian


distribution, non normal distribution

Measure of
spread
Range
Mean Mean
Standard Deviation Standard Error of the Mean
Confidence Limits of the Range Confidence Limits of the Mean
Not skewed, normal distribution, Gaussian distribution
Range ( minimum - maximum)

2.29 2.24 Minimum = 2.24


2.35 2.25
2.24 2.28
2.25 2.29
2.41 2.32
Range = 2.47 - 2.24 = 0.23
2.28 2.35
2.39 2.36
2.32 2.39
2.47 2.41
2.36 2.47 Maximum = 2.47
Interquartile Range ( the middle 50% of the data)
Order the data low to high

Find the value at 25% - quartile 1 (Q1)

Find the value at 50% - quartile 2 (Q2) This is also the median value for the
data set
Find the value at 75% - quartile 3 (Q3)
Interquartile Range
Interquartile range = (quartile 3 – quartile 1) = (Q3 – Q1)

25% 50% 75%

Q1 Q2 Q3
2.24
2.25
2.28
2.29
2.32
2.35
2.36
2.39
2.41
2.47
2.24 2.25 2.28 2.29 2.32 2.35 2.36 2.39 2.41 2.47

Median = (2.32 + 2.35) / 2 = 2.335 Median = quartile 2 = Q2 = 2.335

2.24 2.25 2.28 2.29 2.32 2.35 2.36 2.39 2.41 2.47

As we have an even number of data we split the data in two

2.24 2.25 2.28 2.29 2.32 2.35 2.36 2.39 2.41 2.47

The median of the lower part of the data is quartile 1, Q1 The median of the upper part of the data is quartile 3, Q3

Interquartile range is quartile 3 – quartile 1 = (Q3 – Q1) = (2.39 – 2.28) = 0.11


2.24 minimum 2.24
2.25
2.28
2.28 Q1
2.29
2.32 2.335 Interquartile 0.11
Q2 range
2.35
median (IQR)
2.36
2.39 Q3
2.41 2.39

2.47 maximum 2.47

Outliers – lie outside (1.5 x IQR) marked with an asterix *


If we had an odd number of data points e.g., 11, 12, 13, 15, 16, 17, 18

11, 12, 13, 15, 16, 17, 18 Median = Q2 = 15

Remove the median and split the data

11, 12, 13 16, 17, 18

The median of the lower part of the data is quartile 1, Q1 The median of the upper part of the data is quartile 3, Q3

Interquartile range is quartile 3 – quartile 1 = (Q3 – Q1) = (17 – 12) = 5


Standard Deviation – captures how much on average each
value deviates from the mean of the data
Variance A measure of dispersion relative to the scatter of the values about the mean.

 x  x 
2

s 2

n  1
Why two formulae?
N vs n

Population (N) vs sample (n-1) data

(n – 1) is called Bessel’s correction and takes into account that the:

sample standard deviation (s) is calculated using the sample mean which is an estimate.

population standard deviation (s) is calculated using the true population mean m
Variance Calculate the differences
Square of the differences
𝒙 𝒙
ഥ 𝒙−𝒙
ഥ (𝒙 − 𝒙
ഥ)2
2.29 2.34 -0.050 0.0025
2.35 0.010 0.0001
Let us look Mean
2.24 -0.100 0.0100
at our set
of data 2.25 -0.090 0.0081
2.41 0.070 0.0049
2.28 -0.060 0.0036
2.39 0.050 0.0025
2.32 -0.020 0.0004
2.47 0.130 0.0169
0.020
2.36 0.0004

 x  x 
n = 10 2 ∑= 0.0494
Sample number n Sum of the squares
s 2

n  1
Variance – if the data was sample data

 x  x 
2
0.0494 0.0494
s 2
 0.0055
n  1 (10 – 1) 9

Variance – if the data was population data


0.0494
0.0049
10
Sample data

s= 0.0055 = 0.075
Population data

s= 0.0049 = 0.070
What is the impact of using estimated mean values from samples rather than population
means, i.e., the impact of Bessel’s correction on standard deviation

This will have an impact in determining our Confidence Limits or Confidence Levels

Outliers are determined by these limits


Which measure of spread should you use?

Data type Best measure of central Measure of Spread


tendency
Nominal Mode
Ordinal Mode Range
Median Interquartile Range
Interval and Ratio Median: skewed data, data not Interquartile Range, skewed data,
normally distributed data not normally distributed
Mean: symmetrical data, data Standard Deviation, symmetrical
normally distributed data, data normally distributed
Remember this?
CONFIDENCE INTERVALS % of values Within the range

66 % Mean ± 1 SD

95 % Mean ± 1.96 SD

99 % Mean ± 2.58 SD

95 % Confidence Limits for Population Data

95 % Confidence Limits for Sample Data


So if we go back to our data m = 2.34, Population Data
s = 0.070
2.24
2.25 95 % Confidence Limits
2.28
2.29 2.20 – lower limit 2.34 (1.96 x 0.070)
2.32 2.48 – upper limit
2.35
2.36
2.39
2.41 = 2.34, Sample Data
2.47
95 % Confidence Limits s = 0.075
2.19 – lower limit 2.34 (1.96 x 0.075)
2.49 – upper limit
mean
lower limit upper limit

+ / - 1.96 Standard Deviation

95% Confidence Limits of the measurement process

More spread in sample data than there is in population data because there is more uncertainty in sample data
And now the thinking person’s part……

Remember, the mean value of a sample data set is calculated by adding all the numbers
together and dividing by the amount of numbers

= sum of values or  (x )i
number of values n
is only an estimate as we do not have the full clear picture of the true mean value when
it comes to the data and this then has an impact when determining the measure of spread
within the data

But what happens if the same sample was measured five times in an experiment, and then
the experiment itself was repeated three times and the following data sets were returned?

Experiment 1 – 20.2, 20.4, 20.3, 20.5, 20.1 Experiment 2 – 20.8, 20.9, 20.5, 20.6, 20.7

Experiment 3 – 20.5, 20.4, 20.3, 20.6, 20.7


Experiment 1 Experiment 2 Experiment 3 Combined
20.2 20.8 20.5
20.4 20.9 20.4
20.3 20.5 20.3
20.5 20.6 20.6
20.1 20.7 20.7

n 5 5 5 15
Min 20.1 20.5 20.3 20.1
Max 20.5 20.9 20.7 20.9
Range 0.4 0.4 0.4 0.8
Mean 20.3 20.7 20.5 20.5
Median 20.3 20.7 20.5 20.5
Q1 20.15 20.55 20.35 20.3
Q3 20.45 20.85 20.65 20.7
InQuRange 0.3 0.3 0.3 0.4
Std Dev 0.16 0.16 0.16 0.22
Lower C.L 19.99 20.39 20.19 20.06
Upper C.L 20.61 21.01 20.81 20.94
Standard Error of the Mean - SEM
Measure of reliability of the mean,

If we take the combined set of data = 20.5, s = 0.22, n = 15

SEM = 0.22 / 15

SEM = 0.058

What does this mean?


95% Confidence Limits of the Mean

mean
lower limit upper limit

+ / - 1.96 SEM 20.5 (1.96 x 0.058)

20.39 – lower limit = 20.5,


20.61 – upper limit SEM = 0.058

While the sample data returned a mean value of 20.5, the data is telling us that the true real mean value lies
somewhere between 20.39 and 20.61
Experiment 1 Experiment 2 Experiment 3 Combined
20.2 20.8 20.5
20.4 20.9 20.4
20.3 20.5 20.3
20.5 20.6 20.6
20.1 20.7 20.7

n 5 5 5 15
Min 20.1 20.5 20.3 20.1
Max 20.5 20.9 20.7 20.9
Range 0.4 0.4 0.4 0.8
Mean 20.3 20.7 20.5 20.5
Median 20.3 20.7 20.5 20.5
Q1 20.15 20.55 20.35 20.3
Q3 20.45 20.85 20.65 20.7
InQuRange 0.3 0.3 0.3 0.4
Std Dev 0.16 0.16 0.16 0.22
Lower C.L 19.99 20.39 20.19 20.06
Upper C.L 20.61 21.01 20.81 20.94
A

D E
B C

B C
+ / - 1.96 SEM

B 95% Confidence Limits of the mean C


It gives us a clue to where the true mean value
really lies.

D + / - 1.96 Standard Deviation E

D 95% Confidence Limits of the measurement process E


It gives us confidence in the true range of the data set
It is easy to be confused about the difference between the standard deviation (SD) and the
standard error of the mean (SEM). The main differences are:

The SD quantifies scatter — how much the values vary from one another.

The SEM quantifies how precisely you know the true mean of the population. It takes into
account both the value of the SD and the sample size.

Both SD and SEM are in the same units - the units of the data.

The SEM, by definition, is always smaller than the SD


The SEM gets smaller as your sample sizes get larger.
This makes sense, because the mean of a large sample size is likely to be closer to the true
population mean than is the mean of a small sample. With a huge sample, you'll know the
value of the mean with a lot of precision even if the data are very scattered.
The SD does not change predictably as you acquire more data. The SD you compute from a
sample is the best possible estimate of the SD of the overall population.
As you collect more data, you'll assess the SD of the population with more precision.
But you can't predict whether the SD from a larger sample will be bigger or smaller than the
SD from a small sample.
(This is not strictly true. It is the variance -- the SD squared -- that doesn't change
predictably, but the change in SD is trivial and much much smaller than the change in the
SEM.)
So if we go back to our data
95 % Confidence Limits = 2.34, Sample Data
2.24 2.19 – lower limit s = 0.075
2.25 2.49 – upper limit
2.28 2.34 (1.96 x 0.075)
2.29
2.32
2.35 95 % Confidence Limits = 2.34, s = 0.075, n = 10
2.36
of the mean SEM = 0.024
2.39 2.29 – lower limit
2.34 (1.96 x 0.024)
2.41 2.39 – upper limit
2.47

95% confidence that all further data, if measured the same way and belonging to the same dataset, will fall
within the range 2.19 to 2.49 with a sample mean value currently at 2.34 but 95% confident that the real
mean value lies between 2.29 and 2.39
Another way to express spread in the data in a general (relative) sense is to quote
the Coefficient of Variation (CoV)

Measures ‘relative variation’ ie. expresses SD as a percentage of the mean

CoV = SD x 100
Mean

2.24
has a CoV of
2.25
2.28 (0.075/2.34)*100
2.29 = 3.2%
2.32
2.35 has a CoV of (0.22/20.5)*100 = 1.7%
2.36
2.39
2.41 It is hard to compare 2.34 with 20.5 as a measurement but it is easy to
2.47 compare relative spreads of 3.2% and 1.7%
95% Confidence Interval based on Normal (Z) distribution is approximately equal to
Sample Mean +/- (1.96 * SD)

95% Confidence Interval based on t-distribution has a wider range.


t-distribution is based on probability and gives a wider margin for error

As the sample size increases both t-distribution and normal distribution curves approximate
each other.

When sample size is small, the t-distribution curve shows less frequency for the mean and
has wider/heavier tails.

When sample size is small, the uncertainty of measurement is larger, the frequency of
getting true values is smaller and chances of getting more spread is greater

When measuring in the lab the uncertainty of measurement is based on error, the fewer
replicates you have the greater the margin of error.

Remember – Controlled Experience


Where do we go from here?

Significance testing – is the data real or did it just happen by


chance?

Various Statistical Tests – Parametric and Non Parametric.

Power calculations – how large should my sample size be?

probability p = (1-a)
Correlation – is there any relationship?
a = 0.05 at 95% confidence interval

Regression – can I make a prediction? Within limits?

You might also like