Professional Documents
Culture Documents
Data Analysis - Calculation of Spread
Data Analysis - Calculation of Spread
±1 SD (68.2%)
±2 SD (95.4%)
±3 SD (99.7%)
Measure of
spread
Range
Mean Mean
Standard Deviation Standard Error of the Mean
Confidence Limits of the Range Confidence Limits of the Mean
Not skewed, normal distribution, Gaussian distribution
Range ( minimum - maximum)
Find the value at 50% - quartile 2 (Q2) This is also the median value for the
data set
Find the value at 75% - quartile 3 (Q3)
Interquartile Range
Interquartile range = (quartile 3 – quartile 1) = (Q3 – Q1)
Q1 Q2 Q3
2.24
2.25
2.28
2.29
2.32
2.35
2.36
2.39
2.41
2.47
2.24 2.25 2.28 2.29 2.32 2.35 2.36 2.39 2.41 2.47
2.24 2.25 2.28 2.29 2.32 2.35 2.36 2.39 2.41 2.47
2.24 2.25 2.28 2.29 2.32 2.35 2.36 2.39 2.41 2.47
The median of the lower part of the data is quartile 1, Q1 The median of the upper part of the data is quartile 3, Q3
The median of the lower part of the data is quartile 1, Q1 The median of the upper part of the data is quartile 3, Q3
x x
2
s 2
n 1
Why two formulae?
N vs n
sample standard deviation (s) is calculated using the sample mean which is an estimate.
population standard deviation (s) is calculated using the true population mean m
Variance Calculate the differences
Square of the differences
𝒙 𝒙
ഥ 𝒙−𝒙
ഥ (𝒙 − 𝒙
ഥ)2
2.29 2.34 -0.050 0.0025
2.35 0.010 0.0001
Let us look Mean
2.24 -0.100 0.0100
at our set
of data 2.25 -0.090 0.0081
2.41 0.070 0.0049
2.28 -0.060 0.0036
2.39 0.050 0.0025
2.32 -0.020 0.0004
2.47 0.130 0.0169
0.020
2.36 0.0004
x x
n = 10 2 ∑= 0.0494
Sample number n Sum of the squares
s 2
n 1
Variance – if the data was sample data
x x
2
0.0494 0.0494
s 2
0.0055
n 1 (10 – 1) 9
s= 0.0055 = 0.075
Population data
s= 0.0049 = 0.070
What is the impact of using estimated mean values from samples rather than population
means, i.e., the impact of Bessel’s correction on standard deviation
This will have an impact in determining our Confidence Limits or Confidence Levels
66 % Mean ± 1 SD
95 % Mean ± 1.96 SD
99 % Mean ± 2.58 SD
More spread in sample data than there is in population data because there is more uncertainty in sample data
And now the thinking person’s part……
Remember, the mean value of a sample data set is calculated by adding all the numbers
together and dividing by the amount of numbers
= sum of values or (x )i
number of values n
is only an estimate as we do not have the full clear picture of the true mean value when
it comes to the data and this then has an impact when determining the measure of spread
within the data
But what happens if the same sample was measured five times in an experiment, and then
the experiment itself was repeated three times and the following data sets were returned?
Experiment 1 – 20.2, 20.4, 20.3, 20.5, 20.1 Experiment 2 – 20.8, 20.9, 20.5, 20.6, 20.7
n 5 5 5 15
Min 20.1 20.5 20.3 20.1
Max 20.5 20.9 20.7 20.9
Range 0.4 0.4 0.4 0.8
Mean 20.3 20.7 20.5 20.5
Median 20.3 20.7 20.5 20.5
Q1 20.15 20.55 20.35 20.3
Q3 20.45 20.85 20.65 20.7
InQuRange 0.3 0.3 0.3 0.4
Std Dev 0.16 0.16 0.16 0.22
Lower C.L 19.99 20.39 20.19 20.06
Upper C.L 20.61 21.01 20.81 20.94
Standard Error of the Mean - SEM
Measure of reliability of the mean,
SEM = 0.22 / 15
SEM = 0.058
mean
lower limit upper limit
While the sample data returned a mean value of 20.5, the data is telling us that the true real mean value lies
somewhere between 20.39 and 20.61
Experiment 1 Experiment 2 Experiment 3 Combined
20.2 20.8 20.5
20.4 20.9 20.4
20.3 20.5 20.3
20.5 20.6 20.6
20.1 20.7 20.7
n 5 5 5 15
Min 20.1 20.5 20.3 20.1
Max 20.5 20.9 20.7 20.9
Range 0.4 0.4 0.4 0.8
Mean 20.3 20.7 20.5 20.5
Median 20.3 20.7 20.5 20.5
Q1 20.15 20.55 20.35 20.3
Q3 20.45 20.85 20.65 20.7
InQuRange 0.3 0.3 0.3 0.4
Std Dev 0.16 0.16 0.16 0.22
Lower C.L 19.99 20.39 20.19 20.06
Upper C.L 20.61 21.01 20.81 20.94
A
D E
B C
B C
+ / - 1.96 SEM
The SD quantifies scatter — how much the values vary from one another.
The SEM quantifies how precisely you know the true mean of the population. It takes into
account both the value of the SD and the sample size.
Both SD and SEM are in the same units - the units of the data.
95% confidence that all further data, if measured the same way and belonging to the same dataset, will fall
within the range 2.19 to 2.49 with a sample mean value currently at 2.34 but 95% confident that the real
mean value lies between 2.29 and 2.39
Another way to express spread in the data in a general (relative) sense is to quote
the Coefficient of Variation (CoV)
CoV = SD x 100
Mean
2.24
has a CoV of
2.25
2.28 (0.075/2.34)*100
2.29 = 3.2%
2.32
2.35 has a CoV of (0.22/20.5)*100 = 1.7%
2.36
2.39
2.41 It is hard to compare 2.34 with 20.5 as a measurement but it is easy to
2.47 compare relative spreads of 3.2% and 1.7%
95% Confidence Interval based on Normal (Z) distribution is approximately equal to
Sample Mean +/- (1.96 * SD)
As the sample size increases both t-distribution and normal distribution curves approximate
each other.
When sample size is small, the t-distribution curve shows less frequency for the mean and
has wider/heavier tails.
When sample size is small, the uncertainty of measurement is larger, the frequency of
getting true values is smaller and chances of getting more spread is greater
When measuring in the lab the uncertainty of measurement is based on error, the fewer
replicates you have the greater the margin of error.
probability p = (1-a)
Correlation – is there any relationship?
a = 0.05 at 95% confidence interval