Professional Documents
Culture Documents
Unit 4 Basic Statistics
Unit 4 Basic Statistics
Unit 4 Basic Statistics
Topic Outline:
⚫ Gaussian distribution
⚫ Mean
⚫ Standard deviation
⚫ Confidence intervals
⚫ F-test
⚫ Student’s t-test
⚫ Grubb’s test
⚫ Calibration curve
Learning Objectives
Statistical laws have been derived for populations, but they can be used for samples
after suitable modification. Such modifications are needed for small samples
because a few data points may not represent the entire population.
Figure 1.0 shows two Gaussian curves in which we plot the relative frequency y of
various deviations from the mean versus the deviation from the mean. As shown in
the margin, curves such as these can be described by an equation that contains just
two parameters, the population mean μ and the population standard deviation σ.
The term parameter refers to quantities such as μ and σ that define a population
or distribution. Data values such as x are variables.
The term statistic refers to an estimate of a parameter that is made from a sample of
data
The sample mean and the sample standard deviation are examples of statistics that
estimate parameters m and s respectively.
Figure 1.0 Gaussian Curve adapted from Fundamentals of Analytical Chemistry (9th
ed.) (page 99), by D. A. Skoog et al., 2014, Brooks/Cole
The sample mean x is the arithmetic average of a limited sample drawn from a
population of data. The sample mean is defined as the sum of the measurement
values divided by the number of measurements as given by Equation
The population mean m, in contrast, is the true mean for the population. It is also
defined by Equation above with the added provision that N represents the total
number of measurements in the population. In the absence of systematic error, the
population mean is also the true value for the measured quantity.
The two curves in Figure 1.0a are for two populations of data that differ only in their
standard deviations. The standard deviation for the data set yielding the broader
but lower curve B is twice that for the measurements yielding curve A.
The breadth of these curves is a measure of the precision of the two sets of data.
Thus, the precision of the data set leading to curve A is twice as good as that of the
data set represented by curve B.
Figure 1.0b shows another type of normal error curve in which the x axis is now a
new variable z, which is defined as
Note that z is the relative deviation of a data point from the mean, that is, the
deviation relative to the standard deviation. Hence, when x - μ = ℴ , z is equal to one;
when x - μ = 2ℴ , z is equal to two; and so forth. Since z is the deviation from the
mean relative to the standard deviation, a plot of relative frequency versus z yields a
single Gaussian curve that describes all populations of data regardless of standard
deviation.
Thus, Figure 1.0b is the normal error curve for both sets of data used to plot curves
A and B in Figure 1.0a.
Because it appears in the Gaussian error curve expression, the square of the
standard deviation ℴ 2 is also important. This quantity is called the variance
The equation for computing a pooled standard deviation from several sets of data
takes the form
The variance is just the square of the standard deviation. The sample variance s 2 is
an estimate of the population variance ℴ 2 and is given by
Frequently standard deviations are given in relative rather than absolute terms. We
calculate the relative standard deviation by dividing the standard deviation by the
mean value of the data set. The relative standard deviation, RSD, is sometimes given
the symbol sr.
The spread, or range, w, is another term that is sometimes used to describe the
precision of a set of replicate results. It is the difference between the largest value in
the set and the smallest.
Often we must estimate the standard deviation of a result that has been calculated
from two or more experimental data points, each of which has a known sample
standard deviation.
Confidence Intervals
In most quantitative chemical analyses, the true value of the mean m cannot be
determined because a huge number of measurements (approaching infinity) would
be required. With statistics, however, we can establish an interval surrounding the
experimentally determined mean x͞ within which the population mean μ is expected
to lie with a certain degree of probability. This interval is known as the confidence
interval.
Sometimes the limits of the interval are called confidence limits. For example, we
might say that it is 99% probable that the true population mean for a set of
potassium measurements lies in the interval 7.25 6 0.15% K. Thus, the probability
that the mean lies in the interval from 7.10 to 7.40% K is 99%.
Table 2.0 adapted from Fundamentals of Analytical Chemistry (9th ed.) (page 126),
by D. A. Skoog et al., 2014, Brooks/Cole
Hypothesis testing is the basis for many decisions made in science and engineering.
To explain an observation, a hypothetical model is advanced and tested
experimentally to determine its validity. The hypothesis tests that we describe are
used to determine if the results from these experiments support the model. If they
do not support our model, we reject the hypothesis and seek a new one.
Tests of this kind use a null hypothesis, which assumes that the numerical quantities
being compared are, in fact, the same. We then use a probability distribution to
calculate the probability that the observed differences are a result of random error.
Usually, if the observed difference is greater than or equal to the difference that
would occur 5 times in 100 by random chance (a significance level of 0.05), the null
hypothesis is considered questionable, and the difference is judged to be significant.
If a large number of results are available so that s is a good estimate of s, the z test is
appropriate. The procedure that is used is summarized below:
1. State the null hypothesis: H0: μ = μ0
In the t test, we are interested in the difference between the means or x͞ 1 - x͞ 2. The
variance of the difference s2d between the means is given by
The standard deviation of the difference between the means is found by taking the
square root after substituting the values of s2m1 and s2m2 from above.
Now, if we make the further assumption that the pooled standard deviation spooled
is a better estimate of s than s1 or s2, we can write
Paired Data
The paired t test uses the same type of procedure as the normal t test except that we
analyze pairs of data and compute the differences, di. The standard deviation is now
the standard deviation of the mean difference. Our null hypothesis is H0: μd = Δ0,
where Δ0 is a specific value of the difference to be tested, often zero. The test
statistic value is
The F test tells us whether two standard deviations are “significantly” different from
each other. F is the quotient of the squares of the standard deviations:
We always put the larger standard deviation in the numerator so that If in Table 3.0,
then the difference is significant.
Table 3.0 adapted from Quantitative Chemical Analysis (7th ed.) (page) , by Daniel
Harris, 2007, W. H. Freeman and Company
Grubbs test
Grubbs' test is used to detect a single outlier in a univariate data set that follows an
approximately normal distribution.
Where:
ȳ is the sample mean,
s = sample standard deviation.
Calibration Curves