Data Analysis - Calculation of Spread

Measures of spread / measures of distribution of data within the data set
±1 SD (68.2%)
±2 SD (95.4%)
±3 SD (99.7%)
Normal Distribution – Gaussian, bell shape curve,

Range
Median
Interquartile Range
Skewed data distribution, non Gaussian

distribution, non normal distribution
Measure of
spread
Range
Mean Mean
Standard Deviation Standard Error of the Mean
Confidence Limits of the Range Confidence Limits of the Mean
Not skewed, normal distribution, Gaussian distribution
Range ( minimum - maximum)
2.29 2.24 Minimum = 2.24

2.35 2.25
2.24 2.28
2.25 2.29
2.41 2.32
Range = 2.47 - 2.24 = 0.23
2.28 2.35
2.39 2.36
2.32 2.39
2.47 2.41
2.36 2.47 Maximum = 2.47
Interquartile Range ( the middle 50% of the data)
Order the data low to high
Find the value at 25% - quartile 1 (Q1)
Find the value at 50% - quartile 2 (Q2) This is also the median value for the
data set
Find the value at 75% - quartile 3 (Q3)
Interquartile Range
Interquartile range = (quartile 3 – quartile 1) = (Q3 – Q1)
25% 50% 75%
Q1 Q2 Q3
2.24
2.25
2.28
2.29
2.32
2.35
2.36
2.39
2.41
2.47
2.24 2.25 2.28 2.29 2.32 2.35 2.36 2.39 2.41 2.47
Median = (2.32 + 2.35) / 2 = 2.335 Median = quartile 2 = Q2 = 2.335
2.24 2.25 2.28 2.29 2.32 2.35 2.36 2.39 2.41 2.47
As we have an even number of data we split the data in two
2.24 2.25 2.28 2.29 2.32 2.35 2.36 2.39 2.41 2.47
The median of the lower part of the data is quartile 1, Q1 The median of the upper part of the data is quartile 3, Q3
Interquartile range is quartile 3 – quartile 1 = (Q3 – Q1) = (2.39 – 2.28) = 0.11

2.24 minimum 2.24
2.25
2.28
2.28 Q1
2.29
2.32 2.335 Interquartile 0.11
Q2 range
2.35
median (IQR)
2.36
2.39 Q3
2.41 2.39
2.47 maximum 2.47
Outliers – lie outside (1.5 x IQR) marked with an asterix *

If we had an odd number of data points e.g., 11, 12, 13, 15, 16, 17, 18
11, 12, 13, 15, 16, 17, 18 Median = Q2 = 15
Remove the median and split the data
11, 12, 13 16, 17, 18
The median of the lower part of the data is quartile 1, Q1 The median of the upper part of the data is quartile 3, Q3
Interquartile range is quartile 3 – quartile 1 = (Q3 – Q1) = (17 – 12) = 5

Standard Deviation – captures how much on average each
value deviates from the mean of the data
Variance A measure of dispersion relative to the scatter of the values about the mean.
 x  x 
2
s 2

n  1
Why two formulae?
N vs n
Population (N) vs sample (n-1) data
(n – 1) is called Bessel’s correction and takes into account that the:
sample standard deviation (s) is calculated using the sample mean which is an estimate.
population standard deviation (s) is calculated using the true population mean m
Variance Calculate the differences
Square of the differences
𝒙 𝒙
ഥ 𝒙−𝒙
ഥ (𝒙 − 𝒙
ഥ)2
2.29 2.34 -0.050 0.0025
2.35 0.010 0.0001
Let us look Mean
2.24 -0.100 0.0100
at our set
of data 2.25 -0.090 0.0081
2.41 0.070 0.0049
2.28 -0.060 0.0036
2.39 0.050 0.0025
2.32 -0.020 0.0004
2.47 0.130 0.0169
0.020
2.36 0.0004
 x  x 
n = 10 2 ∑= 0.0494
Sample number n Sum of the squares
s 2

n  1
Variance – if the data was sample data
 x  x 
2
0.0494 0.0494
s 2
 0.0055
n  1 (10 – 1) 9
Variance – if the data was population data

0.0494
0.0049
10
Sample data
s= 0.0055 = 0.075
Population data
s= 0.0049 = 0.070
What is the impact of using estimated mean values from samples rather than population
means, i.e., the impact of Bessel’s correction on standard deviation
This will have an impact in determining our Confidence Limits or Confidence Levels
Outliers are determined by these limits

Which measure of spread should you use?
Data type Best measure of central Measure of Spread

tendency
Nominal Mode
Ordinal Mode Range
Median Interquartile Range
Interval and Ratio Median: skewed data, data not Interquartile Range, skewed data,
normally distributed data not normally distributed
Mean: symmetrical data, data Standard Deviation, symmetrical
normally distributed data, data normally distributed
Remember this?
CONFIDENCE INTERVALS % of values Within the range
66 % Mean ± 1 SD
95 % Mean ± 1.96 SD
99 % Mean ± 2.58 SD
95 % Confidence Limits for Population Data
95 % Confidence Limits for Sample Data

So if we go back to our data m = 2.34, Population Data
s = 0.070
2.24
2.25 95 % Confidence Limits
2.28
2.29 2.20 – lower limit 2.34 (1.96 x 0.070)
2.32 2.48 – upper limit
2.35
2.36
2.39
2.41 = 2.34, Sample Data
2.47
95 % Confidence Limits s = 0.075
2.19 – lower limit 2.34 (1.96 x 0.075)
2.49 – upper limit
mean
lower limit upper limit
+ / - 1.96 Standard Deviation
95% Confidence Limits of the measurement process
More spread in sample data than there is in population data because there is more uncertainty in sample data
And now the thinking person’s part……
Remember, the mean value of a sample data set is calculated by adding all the numbers
together and dividing by the amount of numbers
= sum of values or  (x )i
number of values n
is only an estimate as we do not have the full clear picture of the true mean value when
it comes to the data and this then has an impact when determining the measure of spread
within the data
But what happens if the same sample was measured five times in an experiment, and then
the experiment itself was repeated three times and the following data sets were returned?
Experiment 1 – 20.2, 20.4, 20.3, 20.5, 20.1 Experiment 2 – 20.8, 20.9, 20.5, 20.6, 20.7
Experiment 3 – 20.5, 20.4, 20.3, 20.6, 20.7

Experiment 1 Experiment 2 Experiment 3 Combined
20.2 20.8 20.5
20.4 20.9 20.4
20.3 20.5 20.3
20.5 20.6 20.6
20.1 20.7 20.7
n 5 5 5 15
Min 20.1 20.5 20.3 20.1
Max 20.5 20.9 20.7 20.9
Range 0.4 0.4 0.4 0.8
Mean 20.3 20.7 20.5 20.5
Median 20.3 20.7 20.5 20.5
Q1 20.15 20.55 20.35 20.3
Q3 20.45 20.85 20.65 20.7
InQuRange 0.3 0.3 0.3 0.4
Std Dev 0.16 0.16 0.16 0.22
Lower C.L 19.99 20.39 20.19 20.06
Upper C.L 20.61 21.01 20.81 20.94
Standard Error of the Mean - SEM
Measure of reliability of the mean,
If we take the combined set of data = 20.5, s = 0.22, n = 15
SEM = 0.22 / 15
SEM = 0.058
What does this mean?

95% Confidence Limits of the Mean
mean
lower limit upper limit
+ / - 1.96 SEM 20.5 (1.96 x 0.058)
20.39 – lower limit = 20.5,

20.61 – upper limit SEM = 0.058
While the sample data returned a mean value of 20.5, the data is telling us that the true real mean value lies
somewhere between 20.39 and 20.61
Experiment 1 Experiment 2 Experiment 3 Combined
20.2 20.8 20.5
20.4 20.9 20.4
20.3 20.5 20.3
20.5 20.6 20.6
20.1 20.7 20.7
n 5 5 5 15
Min 20.1 20.5 20.3 20.1
Max 20.5 20.9 20.7 20.9
Range 0.4 0.4 0.4 0.8
Mean 20.3 20.7 20.5 20.5
Median 20.3 20.7 20.5 20.5
Q1 20.15 20.55 20.35 20.3
Q3 20.45 20.85 20.65 20.7
InQuRange 0.3 0.3 0.3 0.4
Std Dev 0.16 0.16 0.16 0.22
Lower C.L 19.99 20.39 20.19 20.06
Upper C.L 20.61 21.01 20.81 20.94
A
D E
B C
B C
+ / - 1.96 SEM
B 95% Confidence Limits of the mean C

It gives us a clue to where the true mean value
really lies.
D + / - 1.96 Standard Deviation E
D 95% Confidence Limits of the measurement process E

It gives us confidence in the true range of the data set
It is easy to be confused about the difference between the standard deviation (SD) and the
standard error of the mean (SEM). The main differences are:
The SD quantifies scatter — how much the values vary from one another.
The SEM quantifies how precisely you know the true mean of the population. It takes into
account both the value of the SD and the sample size.
Both SD and SEM are in the same units - the units of the data.
The SEM, by definition, is always smaller than the SD

The SEM gets smaller as your sample sizes get larger.
This makes sense, because the mean of a large sample size is likely to be closer to the true
population mean than is the mean of a small sample. With a huge sample, you'll know the
value of the mean with a lot of precision even if the data are very scattered.
The SD does not change predictably as you acquire more data. The SD you compute from a
sample is the best possible estimate of the SD of the overall population.
As you collect more data, you'll assess the SD of the population with more precision.
But you can't predict whether the SD from a larger sample will be bigger or smaller than the
SD from a small sample.
(This is not strictly true. It is the variance -- the SD squared -- that doesn't change
predictably, but the change in SD is trivial and much much smaller than the change in the
SEM.)
So if we go back to our data
95 % Confidence Limits = 2.34, Sample Data
2.24 2.19 – lower limit s = 0.075
2.28 2.34 (1.96 x 0.075)
2.29
2.32
2.35 95 % Confidence Limits = 2.34, s = 0.075, n = 10
2.36
of the mean SEM = 0.024
2.39 2.29 – lower limit
2.34 (1.96 x 0.024)
2.47
95% confidence that all further data, if measured the same way and belonging to the same dataset, will fall
within the range 2.19 to 2.49 with a sample mean value currently at 2.34 but 95% confident that the real
mean value lies between 2.29 and 2.39
Another way to express spread in the data in a general (relative) sense is to quote
the Coefficient of Variation (CoV)
Measures ‘relative variation’ ie. expresses SD as a percentage of the mean
CoV = SD x 100
Mean
2.24
has a CoV of
2.25
2.28 (0.075/2.34)*100
2.29 = 3.2%
2.32
2.35 has a CoV of (0.22/20.5)*100 = 1.7%
2.36
2.39
2.41 It is hard to compare 2.34 with 20.5 as a measurement but it is easy to
2.47 compare relative spreads of 3.2% and 1.7%
95% Confidence Interval based on Normal (Z) distribution is approximately equal to
Sample Mean +/- (1.96 * SD)
95% Confidence Interval based on t-distribution has a wider range.

t-distribution is based on probability and gives a wider margin for error
As the sample size increases both t-distribution and normal distribution curves approximate
each other.
When sample size is small, the t-distribution curve shows less frequency for the mean and
has wider/heavier tails.
When sample size is small, the uncertainty of measurement is larger, the frequency of
getting true values is smaller and chances of getting more spread is greater
When measuring in the lab the uncertainty of measurement is based on error, the fewer
replicates you have the greater the margin of error.
Remember – Controlled Experience

Where do we go from here?
Significance testing – is the data real or did it just happen by

chance?
Various Statistical Tests – Parametric and Non Parametric.
Power calculations – how large should my sample size be?
probability p = (1-a)
Correlation – is there any relationship?
a = 0.05 at 95% confidence interval
Regression – can I make a prediction? Within limits?

Data Analysis - Calculation of Spread

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Analysis - Calculation of Spread

Uploaded by

Copyright:

Available Formats

Measures of spread / measures of distribution of data within the data set

Normal Distribution – Gaussian, bell shape curve,

Skewed data distribution, non Gaussian

2.29 2.24 Minimum = 2.24

Find the value at 25% - quartile 1 (Q1)

25% 50% 75%

Median = (2.32 + 2.35) / 2 = 2.335 Median = quartile 2 = Q2 = 2.335

As we have an even number of data we split the data in two

Interquartile range is quartile 3 – quartile 1 = (Q3 – Q1) = (2.39 – 2.28) = 0.11

2.47 maximum 2.47

Outliers – lie outside (1.5 x IQR) marked with an asterix *

11, 12, 13, 15, 16, 17, 18 Median = Q2 = 15

Remove the median and split the data

11, 12, 13 16, 17, 18

Interquartile range is quartile 3 – quartile 1 = (Q3 – Q1) = (17 – 12) = 5

Population (N) vs sample (n-1) data

(n – 1) is called Bessel’s correction and takes into account that the:

Variance – if the data was population data

Outliers are determined by these limits

Data type Best measure of central Measure of Spread

95 % Confidence Limits for Population Data

95 % Confidence Limits for Sample Data

+ / - 1.96 Standard Deviation

95% Confidence Limits of the measurement process

Experiment 3 – 20.5, 20.4, 20.3, 20.6, 20.7

If we take the combined set of data = 20.5, s = 0.22, n = 15

What does this mean?

+ / - 1.96 SEM 20.5 (1.96 x 0.058)

20.39 – lower limit = 20.5,

B 95% Confidence Limits of the mean C

D + / - 1.96 Standard Deviation E

D 95% Confidence Limits of the measurement process E

The SEM, by definition, is always smaller than the SD

Measures ‘relative variation’ ie. expresses SD as a percentage of the mean

95% Confidence Interval based on t-distribution has a wider range.

Remember – Controlled Experience

Significance testing – is the data real or did it just happen by

Various Statistical Tests – Parametric and Non Parametric.

Power calculations – how large should my sample size be?

Regression – can I make a prediction? Within limits?

You might also like