Unit 4 Basic Statistics

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

UNIT 4: BASIC STATISTICS

Topic Outline:
⚫ Gaussian distribution
⚫ Mean
⚫ Standard deviation
⚫ Confidence intervals
⚫ F-test
⚫ Student’s t-test
⚫ Grubb’s test
⚫ Calibration curve

Learning Objectives

At the end of the topic, the students should be able to:

• Describe different statistical tools in estimating the reliability of data and


measurements.
• Perform calculations involving different statistical tools.
• Explain the concept of data rejection (or elimination) and comparison of
measurements.
• Demonstrate the use of computer application and statistical software in statistical
calculations.
• Define relevant terms.

Samples and Populations

Typically in a scientific study, we infer information about a population or universe


from observations made on a subset or sample. The population is the collection of
all measurements of interest and must be carefully defined by the experimenter.

Statistical laws have been derived for populations, but they can be used for samples
after suitable modification. Such modifications are needed for small samples
because a few data points may not represent the entire population.

Properties of Gaussian Curves

Figure 1.0 shows two Gaussian curves in which we plot the relative frequency y of
various deviations from the mean versus the deviation from the mean. As shown in
the margin, curves such as these can be described by an equation that contains just
two parameters, the population mean μ and the population standard deviation σ.

The term parameter refers to quantities such as μ and σ that define a population
or distribution. Data values such as x are variables.
The term statistic refers to an estimate of a parameter that is made from a sample of
data

The sample mean and the sample standard deviation are examples of statistics that
estimate parameters m and s respectively.

Figure 1.0 Gaussian Curve adapted from Fundamentals of Analytical Chemistry (9th
ed.) (page 99), by D. A. Skoog et al., 2014, Brooks/Cole

The Population Mean μ and the Sample Mean x̅

The sample mean x is the arithmetic average of a limited sample drawn from a
population of data. The sample mean is defined as the sum of the measurement
values divided by the number of measurements as given by Equation

where xi represents the individual values of x making up the set of N replicate


measurements .N represents the number of measurements in the sample set.

The population mean m, in contrast, is the true mean for the population. It is also
defined by Equation above with the added provision that N represents the total
number of measurements in the population. In the absence of systematic error, the
population mean is also the true value for the measured quantity.

The Population Standard Deviation ℴ

The population standard deviation ℴ , which is a measure of the precision of the


population, is given by the equation
where N is the number of data points making up the population.

The two curves in Figure 1.0a are for two populations of data that differ only in their
standard deviations. The standard deviation for the data set yielding the broader
but lower curve B is twice that for the measurements yielding curve A.

The breadth of these curves is a measure of the precision of the two sets of data.
Thus, the precision of the data set leading to curve A is twice as good as that of the
data set represented by curve B.
Figure 1.0b shows another type of normal error curve in which the x axis is now a
new variable z, which is defined as

Note that z is the relative deviation of a data point from the mean, that is, the
deviation relative to the standard deviation. Hence, when x - μ = ℴ , z is equal to one;
when x - μ = 2ℴ , z is equal to two; and so forth. Since z is the deviation from the
mean relative to the standard deviation, a plot of relative frequency versus z yields a
single Gaussian curve that describes all populations of data regardless of standard
deviation.

Thus, Figure 1.0b is the normal error curve for both sets of data used to plot curves
A and B in Figure 1.0a.

The equation for the Gaussian error curve is

Because it appears in the Gaussian error curve expression, the square of the
standard deviation ℴ 2 is also important. This quantity is called the variance

A normal error curve has several general properties:


⚫ The mean occurs at the central point of maximum frequency,
⚫ there is a symmetrical distribution of positive and negative deviations about the
maximum,
⚫ there is an exponential decrease in frequency as the magnitude of the deviations
increases. Thus, small uncertainties are observed much more often than very
large ones.

The Sample Standard Deviation: A Measure of Precision

The sample standard deviation s is given by the equation


where the quantity (xi - x) represents the deviation di of value xi from the mean x.
N-1 is the number of degrees of freedom. s is said to be an unbiased estimator of the
population standard deviation ℴ .

The sample variance s2 is also of importance in statistical calculations. It is an


estimate of the population variance ℴ 2.

Equation for Calculating the Pooled Standard Deviation

The equation for computing a pooled standard deviation from several sets of data
takes the form

Variance and Other Measures of Precision

The variance is just the square of the standard deviation. The sample variance s 2 is
an estimate of the population variance ℴ 2 and is given by

Relative Standard Deviation (RSD) and Coefficient of Variation (CV)

Frequently standard deviations are given in relative rather than absolute terms. We
calculate the relative standard deviation by dividing the standard deviation by the
mean value of the data set. The relative standard deviation, RSD, is sometimes given
the symbol sr.

The result is often expressed in parts per thousand (ppt) or in percent by


multiplying this ratio by 1000 ppt or by 100%. For example,

The relative standard deviation multiplied by 100% is called the coefficient of


variation (CV).
Spread or Range (w)

The spread, or range, w, is another term that is sometimes used to describe the
precision of a set of replicate results. It is the difference between the largest value in
the set and the smallest.

Standard Deviation of Calculated Results

Often we must estimate the standard deviation of a result that has been calculated
from two or more experimental data points, each of which has a known sample
standard deviation.

Confidence Intervals

In most quantitative chemical analyses, the true value of the mean m cannot be
determined because a huge number of measurements (approaching infinity) would
be required. With statistics, however, we can establish an interval surrounding the
experimentally determined mean x͞ within which the population mean μ is expected
to lie with a certain degree of probability. This interval is known as the confidence
interval.

Sometimes the limits of the interval are called confidence limits. For example, we
might say that it is 99% probable that the true population mean for a set of
potassium measurements lies in the interval 7.25 6 0.15% K. Thus, the probability
that the mean lies in the interval from 7.10 to 7.40% K is 99%.

Finding the Confidence Interval When ℴ Is Known


Table 1.0 adapted from Fundamentals of Analytical Chemistry (9th ed.) (page 125),
by D. A. Skoog et al., 2014, Brooks/Cole

Finding the Confidence Interval When ℴ Is Unknown

Table 2.0 adapted from Fundamentals of Analytical Chemistry (9th ed.) (page 126),
by D. A. Skoog et al., 2014, Brooks/Cole

For the mean of N measurements,


Statistical Aids to Hypothesis Testing

Hypothesis testing is the basis for many decisions made in science and engineering.
To explain an observation, a hypothetical model is advanced and tested
experimentally to determine its validity. The hypothesis tests that we describe are
used to determine if the results from these experiments support the model. If they
do not support our model, we reject the hypothesis and seek a new one.

Tests of this kind use a null hypothesis, which assumes that the numerical quantities
being compared are, in fact, the same. We then use a probability distribution to
calculate the probability that the observed differences are a result of random error.
Usually, if the observed difference is greater than or equal to the difference that
would occur 5 times in 100 by random chance (a significance level of 0.05), the null
hypothesis is considered questionable, and the difference is judged to be significant.

Large Sample z Test

If a large number of results are available so that s is a good estimate of s, the z test is
appropriate. The procedure that is used is summarized below:
1. State the null hypothesis: H0: μ = μ0

2. Form the test statistic:


3. State the alternative hypothesis Ha and determine the rejection region
For Ha: μ ≠ μ0, reject H0 if z ≥ zcrit or if z ≤-zcrit (two-tailed test)
For Ha: μ > μ0, reject H0 if z ≥ zcrit (one-tailed test)
For Ha: μ < μ0, reject H0 if z ≤ -zcrit (one-tailed test)

Comparison of Two Experimental Means

The t Test for Differences in Means

The variance of the mean of analyst 1 is

Likewise, the variance of the mean of analyst 2 is

In the t test, we are interested in the difference between the means or x͞ 1 - x͞ 2. The
variance of the difference s2d between the means is given by
The standard deviation of the difference between the means is found by taking the
square root after substituting the values of s2m1 and s2m2 from above.

Now, if we make the further assumption that the pooled standard deviation spooled
is a better estimate of s than s1 or s2, we can write

The test statistic t is now found from

Paired Data

The paired t test uses the same type of procedure as the normal t test except that we
analyze pairs of data and compute the differences, di. The standard deviation is now
the standard deviation of the mean difference. Our null hypothesis is H0: μd = Δ0,
where Δ0 is a specific value of the difference to be tested, often zero. The test
statistic value is

where d͞ is the average difference = Σdi/N. The alternative hypothesis could be μd ≠


Δ0, μd > Δ0, or μd < Δ0.

Comparison of Standard Deviations with the F Test

The F test tells us whether two standard deviations are “significantly” different from
each other. F is the quotient of the squares of the standard deviations:

We always put the larger standard deviation in the numerator so that If in Table 3.0,
then the difference is significant.
Table 3.0 adapted from Quantitative Chemical Analysis (7th ed.) (page) , by Daniel
Harris, 2007, W. H. Freeman and Company

Grubbs test

Grubbs' test is used to detect a single outlier in a univariate data set that follows an
approximately normal distribution.

Step 1: Order the data points from smallest to largest.


Step 2: Find the mean (x̄ ) and standard deviation of the data set.

The Grubbs’ test statistic for a two-tailed test is:

Where:
ȳ is the sample mean,
s = sample standard deviation.

A left-tailed test uses the test statistic:

Where Ymin is the minimum value.

For a right-tailed test, use:

Where Ymax is the maximum value.

Step 3: Find the G Critical Value.


Step 4: Decide if Accept or Reject the Outlier

Compare your G test statistic to the G critical value:


Gtest < Gcritical: keep the point in the data set; it is not an outlier.
Gtest > Gcritical: reject the point as an outlier.

Calibration Curves

A calibration curve shows the response of an analytical method to known quantities


of analyte. Solutions containing known concentrations of analyte are called standard
solutions. Solutions containing all the reagents and solvents used in the analysis, but
no deliberately added analyte, are called blank solutions. Blanks measure the
response of the analytical procedure to impurities or interfering species in the
reagents.
Constructing a Calibration Curve

Step 1: Prepare known samples of analyte covering a range of concentrations


expected for unknowns.
Step 2: Subtract the average absorbance of the blank samples from each measured
absorbance to obtain corrected absorbance.
Step 3: Make a graph of corrected absorbance versus quantity of analyte analyzed
Step 4: If you analyze an unknown solution at a future time, run a blank at the same
time. Subtract the new blank absorbance from the unknown absorbance to obtain
the corrected absorbance.

You might also like