1 Notes Chemomatrics

Chapter 1: Chemometrics Analytical Chemistry (Chem-2202)
CHEMOMETRICS
SYLLABUS OUTLINES:
1. What is analytical Chemistry?

2. Fundamental and Derived SI Units
3. Concentration and its units
4. Replicate/Mean/Average/Median
5. Error and types of error
a. Systematic Errors
b. Random Errors
6. Samples and Populations (Variance)
7. The Sample Standard Deviation (A Measure of Precision)
8. Reporting Computed Data (Significant Figures & Rounding Off Data)
9. Confidence Intervals
10. Analysis of Variance (ANOVA)
11. Sampling
What is Analytical Chemistry?
Analytical chemistry is often described as the area of chemistry responsible for characterizing the
composition of matter, both qualitatively (what is present) and quantitatively (how much is present). A
more appropriate description of analytical chemistry is “the science of inventing and applying the
concepts, principles, and strategies for measuring the characteristics of chemical systems and species.
Chemometrics is the science of extracting information from chemical systems by data-driven means.
Fundamental Units
The fundamental units are the units of the fundamental quantities, as defined by the International
System of Units. They are not dependent upon any other units, and all other units are derived from
them. In the International System of Units, the fundamental units are:
1. The meter (symbol: m), used to measure length.

2. The kilogram (symbol: kg), used to measure mass.
3. The second (symbol: s), used to measure time.
4. The ampere (symbol: A), used to measure electric current.
5. The kelvin (symbol: K), used to measure temperature.
6. The mole (symbol: mol), used to measure amount of substance.
7. The candela (symbol: cd), used to measure light intensity.
Derived Units
The SI derived units are either dimensionless, or can be expressed as a product of one or more of the
base units, possibly scaled by an appropriate power of exponentiation. Many derived units do not have
3rd Revision, Jan‘18 Page 1/20

special names. For example, the SI derived unit of area is the square meter (m2) and the SI derived unit
of density is the kilogram per cubic meter (kg/m3or kgm−3). Around 22 derived units are recognized by
the SI with special names, which are written in lowercase. However, the symbols for units named after
persons, are always written with an uppercase initial letter. For example, the symbol for the hertz is
"Hz"; but the symbol for the meter is "m".
Concentration and its units
Concentration is the abundance of a constituent divided by the total volume of a mixture. Several types
of mathematical description can be distinguished: mass concentration, molar concentration, number
concentration, and volume concentration. The term concentration can be applied to any kind of
chemical mixture, but most frequently it refers to solutes and solvents in solutions. The molar (amount)
concentration has variants such as normal concentration.
NAME UNITS SYMBOLS

molarity M moles solute
M
liters solution
formality F number FW solute
F
liters solution
normality N number EW solute
N
liters solution
molality m m solute
m
kg solvent
weight % g solute
% w/w
100 g solution
volume % mL solute
% v/v
100 mL solution
weight-to-volume % g solute
% w/v
100 mL solution
parts per million ppm g solute
ppm
106 g solution
parts per billion ppb g solute
ppb
109 g solution
FW = formula weight; EW = equivalent weight.
Replicates: Replicates are samples of about the same size that are carried through an analysis in exactly
the same way. In order to improve the reliability and to obtain information about the variability of
results, two to five portions (replicates) of a sample are usually carried through an entire analytical
procedure. Individual results from a set of measurements are seldom the same, so we usually consider
the “best” estimate to be the central value for the set.
We justify the extra effort required to analyze replicates in two ways. First, the central value of a set
should be more reliable than any of the individual results. Usually, the mean or the median is used as

the central value for a set of replicate measurements. Second, an analysis of the variation in the data
allows us to estimate the uncertainty associated with the central value.
The Mean: The mean of two or more measurements is their average value. The most widely used
measure of central value is the mean, ̅. The mean, also called the arithmetic mean or the average, is
obtained by dividing the sum of replicate measurements by the number of measurements in the set: (xi
represents the individual values of x making up the set of N replicate measurements).
∑
̅=
The Median: The median is the middle result when replicate data are arranged in increasing or
decreasing numerical order. There are equal numbers of results that are larger and smaller than the
median.
1. For an odd number of results, the median can be found by arranging the results in order and locating the
middle result.
2. For an even number, the average value of the middle pair is used.
The median is used advantageously when a set of data contain an outlier (a result that differs
significantly from others in the set). An outlier can have a significant effect on the mean of the set but
has no effect on the median. In ideal cases, the mean and median are identical. However, when the
number of measurements in the set is small, the values often differ e.g.
Mean = ̅ = = 19.78 ≈ 19.8 Median = = 19.7
ERRORS: The term error has two slightly different meanings.
1. First, error refers to the difference between a measured value and the “true” or “known” value.
2. Second, error often denotes the estimated uncertainty in a measurement or experiment.
Precision: Precision is the closeness of results to others obtained in exactly the same way. Precision
describes the reproducibility of measurements—in other words, the closeness of results that have been
obtained in exactly the same way. Generally, the precision of a measurement is readily determined by
simply repeating the measurement on replicate samples.
Three terms are widely used to describe the precision of a set of replicate data: standard
deviation, variance, and coefficient of variation. These three are functions of how much an individual
result differs from the mean, called the deviation from the mean, di.
di = | ̅|
Accuracy: Accuracy is the closeness of a measured value to the true or accepted value. Note that
accuracy measures agreement between a result and the accepted value. Accuracy is often more difficult

to determine because the true value is usually unknown. An accepted value must be used instead.
Accuracy is expressed in terms of either absolute or relative error.
Absolute Error: The absolute error of a measurement is the difference between the measured value and
the true value. The sign of the absolute error tells you whether the value in question is high or low.
1. If the measurement result is low, the sign is negative;

2. If the measurement result is high, the sign is positive.
The absolute error E in the measurement of a quantity x is given by the following equation, where xt is
the true or accepted value of the quantity.
E = xi – xt
Relative Error: The relative error of a measurement is the absolute error divided by the true value.
Relative error may be expressed in percent, parts per thousand, or parts per million, depending on the
magnitude of the result. Relative error refers to the relative absolute error.
The relative error Er is often a more useful quantity than the absolute error. The percent relative error is
given by the following expression. Relative error is also expressed in parts per thousand (ppt). As 1% in
ppt will be 10 ppt (1% × 1000) and in ppm will be 10000 (1% × 106).
Er = × 100
Types of errors in experimental data:
Random (or indeterminate) error: One type, called random (or indeterminate) error, causes data to be
scattered more or less symmetrically around a mean value. In general, the random error in a
measurement is reflected by its precision. Random or indeterminate errors affect measurement
precision.
Systematic (or determinate) error: A second type of error, called systematic (or determinate) error,
causes the mean of a data set to differ from the accepted value. In general, a systematic error in a series
of replicate measurements causes all the results to be too high or too low. Systematic, or determinate,
errors affect the accuracy of results. An example of a systematic error is the loss of a volatile analyte
while heating a sample.
Gross Error: A third type of error is gross error. They usually occur only occasionally, are often large, and
may cause a result to be either high or low. They are often the product of human errors. For example, if
part of a precipitate is lost before weighing, analytical results will be low. Touching a weighing bottle
with your fingers after its empty mass is determined will cause a high mass reading for a solid weighed
in the contaminated bottle.
Gross errors lead to outliers, results that appear to differ markedly from all other data in a set of
replicate measurements. An outlier is an occasional result in replicate measurements that differs

significantly from the other results. Various statistical tests can be performed to determine if a result is
an outlier.
SYSTEMETIC ERROR
Systematic (or determinate) error, causes the mean of a data set to differ from the accepted value.
Systematic errors have a definite value, an assignable cause, and are of the same magnitude for
replicate measurements made in the same way. They lead to bias in measurement results. Note that
bias affects all of the data in a set in the same way and that it bears a sign. Bias measures the systematic
error associated with an analysis. It has a negative sign if it causes the results to be low and a positive
sign otherwise.
Sources of Systematic Errors: There are three types of systematic errors:
1. Instrumental errors are caused by non-ideal instrument behavior, by faulty calibrations, or by

use under inappropriate conditions.
2. Method errors arise from non-ideal chemical or physical behavior of analytical systems.
3. Personal errors result from the carelessness, inattention, or personal limitations of the
experimenter.
Instrumental Errors: All measuring devices are potential sources of systematic errors.
For example, pipets, burets, and volumetric flasks may hold or deliver volumes slightly different from
those indicated by their graduations. These differences arise from using glassware at a temperature that
differs significantly from the calibration temperature, from distortions in container walls due to heating
while drying, from errors in the original calibration, or from contaminants on the inner surfaces of the
containers. Calibration eliminates most systematic errors of this type.
Electronic instruments are also subject to systematic errors. These can arise from several sources. For
example, errors may emerge as the voltage of a battery-operated power supply decreases with use.
Errors can also occur if instruments are not calibrated frequently or if they are calibrated incorrectly.
The experimenter may also use an instrument under conditions where errors are large. For example, a
pH meter used in strongly acidic media is prone to an acid error.
Temperature changes cause variation in many electronic components, which can lead to drifts and
errors. Some instruments are susceptible to noise induced from the alternating current (ac) power lines,
and this noise may influence precision and accuracy. In many cases, errors of these types are detectable
and correctable.
Method Errors: The non-ideal chemical or physical behavior of the reagents and reactions on which an
analysis is based often introduce systematic method errors. Such sources of non-ideality include the
slowness of some reactions, the incompleteness of others, the instability of some species, the lack of
specificity of most reagents, and the possible occurrence of side reactions that interfere with the
measurement process. As an example, a common method error in volumetric analysis results from the
small excess of reagent required to cause an indicator to change color and signal the equivalence point.

The accuracy of such an analysis is thus limited by the very phenomenon that makes the titration
possible. Errors inherent in a method are often difficult to detect and are thus the most serious of the
three types of systematic error.
Personal Errors: Many measurements require personal judgments.
Examples include estimating the position of a pointer between two scale divisions, the color of a
solution at the end point in a titration, or the level of a liquid with respect to a graduation in a pipet or
burette. Judgments of this type are often subject to systematic, unidirectional errors. For example, one
person may read a pointer consistently high, while another may be slightly slow in activating a timer.
Yet, a third may be less sensitive to color changes, with an analyst who is insensitive to color changes
tending to use excess reagent in a volumetric analysis.
Analytical procedures should always be adjusted so that any known physical limitations of the analyst
cause negligibly small errors. Automation of analytical procedures can eliminate many errors of this
type.
A universal source of personal error is prejudice, or bias. Most of us, no matter how honest, have a
natural, subconscious tendency to estimate scale readings in a direction that improves the precision in a
set of results. Alternatively, we may have a preconceived notion of the true value for the measurement.
We then subconsciously cause the results to fall close to this value.
Number bias is another source of personal error that varies considerably from person to person. The
most frequent number bias encountered in estimating the position of a needle on a scale involves a
preference for the digits 0 and 5. Also common is a prejudice favoring small digits over large and even
numbers over odd. Again, automated and computerized instruments can eliminate this form of bias.
The Effect of Systematic Errors on Analytical Results:
Systematic errors may be either constant or proportional.
1. The magnitude of a constant error stays essentially the same as the size of the quantity
measured is varied. With constant errors, the absolute error is constant with sample size, but
the relative error varies when the sample size is changed.
2. Proportional errors increase or decrease according to the size of the sample taken for analysis.
With proportional errors, the absolute error varies with sample size, but the relative error stays
constant when the sample size is changed.
Removal of Systematic Instrument and Personal Errors: Some systematic instrument errors can be
corrected by calibration. Periodic calibration of equipment is always desirable because the response of
most instruments changes with time as a result of component aging, corrosion, or mistreatment.
Most personal errors can be minimized by careful, disciplined laboratory work. It is a good habit to
check instrument readings, notebook entries, and calculations systematically.

Errors due to limitations of the experimenter can usually be avoided by carefully choosing the analytical
method or using an automated procedure.
Removal of Systematic Method Errors:
Bias in an analytical method is particularly difficult to detect. One or more of the following steps can be
taken to recognize and adjust for a systematic error in an analytical method.
(i) Analysis of Standard Samples: The best way to estimate the bias of an analytical method is by analyzing
standard reference materials (SRMs i.e. materials that contain analytes in known concentration levels).
(ii) Independent Analysis: If standard samples are not available, a second independent and reliable analytical
method can be used in parallel with the method being evaluated. The independent method should differ as
much as possible from the one under study. This practice minimizes the possibility that some common factor
in the sample has the same effect on both methods.
(iii) Blank Determinations: A blank contains the reagents and solvents used in a determination, but no
analyte. A blank solution contains the solvent and all of the reagents in an analysis. Whenever feasible, blanks
may also contain added constituents to simulate the sample matrix.
The term matrix refers to the collection of all the constituents in the sample. In a blank determination, all
steps of the analysis are performed on the blank material. The results are then applied as a correction to the
sample measurements.
(iv) Variation in Sample Size: As the size of a measurement increases, the effect of a constant error decreases.
Thus, constant errors can often be detected by varying the sample size.
RANDOM ERRORS
Random (or indeterminate) error, causes data to be scattered more or less symmetrically around a
mean value. Random, or indeterminate, errors can never be totally eliminated and are often the major
source of uncertainty in a determination.
Random errors are caused by the many uncontrollable variables that accompany every measurement.
The accumulated effect of the individual uncertainties, however, causes replicate results to fluctuate
randomly around the mean of the set.
Random Error Sources: We can get a qualitative idea of the way small undetectable uncertainties
produce a detectable random error in the following way.
Imagine a situation in which just four small random errors combine to give an overall error. We will
assume that each error has an equal probability of occurring and that each can cause the final result to
be high or low by a fixed amount ±U.

The curve between relative frequency on y-axis and deviation from mean on x-axis, is called Gaussian
curve or a normal error curve. A Gaussian, or normal error curve, is a curve that shows the symmetrical
distribution of data around the mean of an infinite set of data.
A histogram is a bar graph of a set of data. The Gaussian curve has the same mean, the same precision,
and the same area under the curve as the histogram.
For example, in an experiment the results vary from a low of 9.969 mL to a high of 9.994 mL. This 0.025-
mL spread of data results directly from an accumulation of all random uncertainties in the experiment.
The spread in a set of replicate measurements is the difference between the highest and lowest result.
Sources of random uncertainties in the calibration of a pipet include: (1) visual judgments, such as the
level of the water with respect to the marking on the pipet and the mercury level in the thermometer;
(2) variations in the drainage time and in the angle of the pipet as it drains; (3) temperature fluctuations,
which affect the volume of the pipet, the viscosity of the liquid, and the performance of the balance;
and (4) vibrations and drafts that cause small variations in the balance readings.
Undoubtedly, there are many other sources of random uncertainty in this calibration process that we
have not listed. Even the simple process of calibrating a pipet is affected by many small and
uncontrollable variables. The cumulative influence of these variables is responsible for the observed
scatter of results around the mean.
Statistical Treatment of Random Errors:
Statistical analysis only reveals information that is already present in a data set. No new information is
created by statistical treatments. Statistical methods, do allow us to categorize and characterize data in
different ways and to make objective and intelligent decisions about data quality and interpretation.
Statistical methods can be used to evaluate the random errors discussed in the preceding section.
SAMPLES AND POPULATIONS:
Typically in a scientific study, we infer information about a population or universe from observations
made on a subset or sample. A population is the collection of all measurements of interest to the
experimenter, while a sample is a subset of measurements selected from the population. The
population is the collection of all measurements of interest and must be carefully defined by the
experimenter. In some cases, the population is finite and real, while in others, the population is
hypothetical or conceptual in nature.
In many of the cases in analytical chemistry, the population is conceptual. Consider, for example, the
determination of calcium in a community water supply to determine water hardness. In this example,
the population is the very large, nearly infinite, number of measurements that could be made if we
analyzed the entire water supply. Similarly, in determining glucose in the blood of a patient, we could
hypothetically make an extremely large number of measurements if we used the entire blood supply.

The subset of the population analyzed in both these cases is the sample. Again, we infer characteristics
of the population from those obtained with the sample.
The Population Mean (µ) and the Sample Mean ( ̅ ):
The sample mean ̅ is the arithmetic average of a limited sample drawn from a population of data. The
sample mean is defined as the sum of the measurement values divided by the number of measurements
as given by,
∑
̅=
where represents the individual values of making up the set of N replicate measurements. In this
equation, N represents the number of measurements in the sample set.
The population mean µ, in contrast, is the true mean for the population. It is also defined by the same
equation with the added provision that N represents the total number of measurements in the
population.
∑
µ=
In the absence of systematic error, the population mean is also the true value for the measured
quantity.
The Population Standard Deviation σ:
The population standard deviation σ, which is a measure of the precision of the population, is given by
the equation
∑
σ=√
where N is the number of data points making up the population.

Figures shows Normal Error Curves. The standard deviation for curve B is twice than that for curve A,
that is, σB = 2σA. In (a) the abscissa is the deviation from the mean ( – µ) in the units of measurement.
In (b) the abscissa is the deviation from the mean in units of σ. For this plot, the two curves A and B are
identical.
The two curves in Figure ‘a’ are for two populations of data that differ only in their standard deviations.
The standard deviation for the data set yielding the broader but lower curve B is twice that for the
measurements yielding curve A. The breadth of these curves is a measure of the precision of the two
sets of data. Thus, the precision of the data set leading to curve A is twice as good as that of the data set
represented by curve B.
THE SAMPLE STANDARD DEVIATION (A MEASURE OF PRECISION)
Equation used for the standard deviation (called population standard deviation) for a large sample data
is as,
∑
σ=√
This equation must be modified when it is applied to a small sample of data. Thus, the sample standard
deviation s is given by the equation
∑ ̅
s=√
When you make statistical calculations, remember that, because of the uncertainty in ̅ , a sample
standard deviation s may differ significantly from the population standard deviation σ. As N becomes
larger, ̅ and s become better estimators of µ, and σ. (As N → ∞, x → µ, and s → σ).
Standard Error of the Mean (SEM):
The standard error of a mean, sm, is the standard deviation of a set of data divided by the square root of
the number of data points in the set.
The probability figures for the Gaussian distribution calculated as areas refer to the probable
error for a single measurement. Thus 68.3% of the area beneath a Gaussian curve for a population
probable that a single result from the population will lie within 61s of the mean µ. If a series of replicate
results, each containing N measurements, are taken randomly from a population of results, the mean of
each set will show less and less scatter as N increases. The standard deviation of each mean is known as
the standard error of the mean and is given the symbol sm. The standard error is inversely proportional
to the square root of the number of data points N used to calculate the mean as given by,
sm =
√

Equation tells us that the mean of four measurements is more precise by √ = 2 than individual
measurements in the data set. For this reason, averaging results is often used to improve precision.
However, the improvement to be gained by averaging is somewhat limited because of the square root
dependence on N. For example, to increase the precision by a factor of 10 requires 100 times as many
measurements. It is better, if possible, to decrease s than to keep averaging more results since sm is
directly proportional to s but only inversely proportional to the square root of N. The standard deviation
can sometimes be decreased by being more precise in individual operations, by changing the procedure,
and by using more precise measurement tools.
Reliability of s as a Measure of Precision:
Several statistical tests are used to test hypotheses, to produce confidence intervals for results, and to
reject outlying data points. Most of these tests are based on sample standard deviations, s. The
probability that these statistical tests provide correct results increases as the reliability of s becomes
greater.
Variance and Other Measures of Precision:
Although the sample standard deviation, s, is usually used in reporting the precision of analytical data,
we often find three other terms.
Variance (s2):
The variance is just the square of the standard deviation. The sample variance s2 is an estimate of the
population variance σ2 and is given by,
∑ ̅
s2 =
Note that the standard deviation has the same units as the data, while the variance has the units of the
data squared. Scientists tend to use standard deviation rather than variance because it is easier to relate
a measurement and its precision if they both have the same units. The advantage of using variance is
that variances are additive in many situations.
Relative Standard Deviation (RSD) and Coefficient of Variation (CV):
Frequently standard deviations are given in relative rather than absolute terms. We calculate the
relative standard deviation by dividing the standard deviation by the mean value of the data set. The
relative standard deviation, RSD, is sometimes given the symbol sr.
sr =
̅
The result is often expressed in parts per thousand (ppt) by multiplying this ratio by 1000 ppt.
sr = × 1000 ppt
̅

The result is also expressed in percent by multiplying this ratio by 100%. This is also called the coefficient
of variation (CV).
CV = %age sr = × 100 %
̅
The coefficient of variation, CV, is the percent relative standard deviation.
Spread or Range (w):
The spread, or range, w, is another term that is sometimes used to describe the precision of a set of
replicate results. It is the difference between the largest value in the set and the smallest.
Significant Figures:
The significant figures in a number are all of the certain digits plus the first uncertain digit. Rules for
determining the number of significant figures:
1. Disregard all initial zeros.

2. Disregard all final zeros unless they follow a decimal point.
3. All remaining digits including zeros between nonzero digits are significant.
For example, 30.24 mL or 0.03024 L or 8.000 or 8.000 × 103 or 800.0 has four significant figures.
 In addition or subtraction, the answer should have the least digits after decimal, e.g., 3.4 + 0.020
+ 7.31 = 10.7. We can generalize and say that, for addition and subtraction, the result should
have the same number of decimal places as the number with the smallest number of decimal
places.
 It is suggested for multiplication and division that the answer should be rounded so that it
contains the same number of significant digits as the original number with the smallest number
of significant digits.
Rounding Off Data:
In rounding a number greater than 5, last digit is increased by one. In rounding a digit less than 5, last
digit is retained. In rounding a number ending in 5, always round so that the result ends with an even
number. Thus, 0.635 rounds to 0.64 and 0.625 rounds to 0.62.
CONFIDENCE INTERVALS
The confidence interval for the mean is the range of values within which the population mean m is
expected to lie with a certain probability.
In most quantitative chemical analyses, the true value of the mean µ cannot be determined because a
huge number of measurements (approaching infinity) would be required. With statistics, however, we
can establish an interval surrounding the experimentally determined mean within which the population

mean µ is expected to lie with a certain degree of probability. This interval is known as the confidence
interval.
Sometimes the limits of the interval are called confidence limits. For example, we might say that it is
99% probable that the true population mean for a set of potassium measurements lies in the interval
7.25 ± 0.15% K. Thus, the probability that the mean lies in the interval from 7.10 to 7.40% K is 99%.
The size of the confidence interval, which is computed from the sample standard deviation, depends on
how well the sample standard deviation s estimates the population standard deviation σ. If s is a good
estimate of σ, the confidence interval can be significantly narrower than if the estimate of σ is based on
only a few measurement values. The confidence level is the probability that the true mean lies within a
certain interval and is often expressed as a percentage.
Comparison of experimental mean and known value:
In the most of cases, we use a statistical hypothesis test to draw conclusions about the population mean
µ and its nearness to the known value, which we call µ0. There are two contradictory outcomes that we
consider in any hypothesis test.
1. The first, the null hypothesis H0, states that µ = µ0.

2. The second, the alternative hypothesis Ha can be stated in several ways.
(a) We might reject the null hypothesis in favor of Ha if µ is different than µ0 (µ ≠ µ0).
(b) Other alternative hypotheses are µ > µ0 or µ < µ0.
For tests concerning one or two means, the test statistic might be the z statistic if we have a
large number of measurements or if we know σ.
Quite often, however, we use the t statistic for small numbers of measurements with unknown
σ. When in doubt, the t statistic should be used.
Small Sample t Test:
For a small number of results, we use a similar procedure to the z test except that the test statistic is the
t statistic. Again we test the null hypothesis H0:  = 0, where 0 is a specific value of  such as an
accepted value, a theoretical value, or a threshold value. The procedure is:
1. State the null hypothesis: H0:  = 0

̅
2. Form the test statistic: t =
√
3. State the alternative hypothesis Ha and determine the rejection region.

As an illustration, consider the testing for systematic error in an analytical method. In this case, a sample
of accurately known composition, such as a standard reference material, is analyzed. Determination of
the analyte in the material gives an experimental mean that is an estimate of the population mean. If
the analytical method had no systematic error, or bias, random errors would give the frequency
distribution shown by curve A in Figure.
Method B has some systematic error so that ̅ , which estimates B, differs from the accepted value
0. The bias is given by
In testing for bias, we do not know initially whether the difference between the experimental mean and
the accepted value is due to random error or to an actual systematic error. The t test is used to
determine the significance of the difference.
ANALYSIS OF VARIANCE (ANOVA)
The methods used for multiple comparisons fall under the general category of analysis of variance, often
known by the acronym ANOVA. The basic principle of ANOVA is to compare the variations between the
different factor levels (groups) to that within factor levels. These methods use a single test to determine
whether there is or not a difference among the population means rather than pairwise comparisons as is
done with the t test.
After ANOVA indicates a potential difference, multiple comparison procedures can be used to identify
which specific population means differ from the others. Experimental design methods take advantage of
ANOVA in planning and performing experiments.
ANOVA Concepts / definitions:
In ANOVA procedures, we detect difference in several population means by comparing the variances.
For comparing I population means, 1, 2, 3, ….. I, the null hypothesis H0 is of the form
H0: 1 = 2 = 3 = ….. = I
and the alternative hypothesis Ha is

Ha: at least two of the i’s are different

The following are typical applications of ANOVA:
1. Is there a difference in the results of five analysts determining calcium by a volumetric method?
2. Will four different solvent compositions have differing influences on the yield of a chemical
synthesis?
3. Are the results of manganese determinations by three different analytical methods different?
4. Is there any difference in the fluorescence of a complex ion at six different values of pH?
In each of these situations, the populations have differing values of a common characteristic called a
factor or sometimes a treatment. In the case of determining calcium by a volumetric method, the factor
of interest is the analyst. The different values of the factor of interest are called levels. For the calcium
example, there are five levels corresponding to analyst 1, analyst 2, analyst 3, analyst 4, and analyst 5.
The comparisons among the various populations are made by measuring a response for each
item sampled. In the case of the calcium determination, the response is the amount of Ca (in millimoles)
determined by each analyst. For the four examples given above, the factors, levels and responses are:
The factor can be considered the independent variable, while the response is the dependent variable.
Figure illustrates how to visualize ANOVA data for the five analysts determining Ca in triplicate. The type
of ANOVA shown in Figure is known as a single-factor, or one-way, ANOVA.

Often, several factors may be involved, such as in an experiment to determine whether pH and
temperature influence the rate of a chemical reaction. In such a case, the type of ANOVA is known as a
two-way ANOVA. We consider only single-factor ANOVA. Take the triplicate results for each analyst in
above Figure to be random samples. In ANOVA, the factor levels are often called groups.
The basic principle of ANOVA is to compare the between-groups variation to the within-groups
variation. In our specific case, the groups (factor levels) are the different analysts, and this case is a
comparison of the variation between analysts to the within-analyst variation.
When H0 is true, the variation between the group means is close to the variation within groups. When
H0 is false, the variation between group means is large compared to the variation within groups.
The basic statistical test used for ANOVA is the F test. A large value of F compared to the critical value
from the tables may give us reason to reject H0 in favor of the alternative hypothesis.
RECOMMENDATIONS FOR TREATING OUTLIERS:
There are a number of recommendations for the treatment of a small set of results that contains a
suspect value:
1. Re-examine carefully all data relating to the outlying result to see if a gross error could have
affected its value.
2. If possible, estimate the precision that can be reasonably expected from the procedure to be
sure that the outlying result actually is questionable.
3. Repeat the analysis if sufficient sample and time are available. Agreement between the newly
acquired data and those of the original set that appear to be valid will lend weight to the notion
that the outlying result should be rejected. Furthermore, if retention is still indicated, the
questionable result will have a small effect on the mean of the larger set of data.

4. If more data cannot be secured, apply the Q test to the existing set to see if the doubtful result
should be retained or rejected on statistical grounds.
5. If the Q test indicates retention, consider reporting the median of the set rather than the mean.
The median has the great virtue of allowing inclusion of all data in a set without undue influence
from an outlying value. In addition, the median of a normally distributed set containing three
measurements provides a better estimate of the correct value than the mean of the set after
the outlying value has been discarded.
Qexp = [questionable value – nearest value] / [largest value – smallest value]
= (XN - XN-1) / ( XN - X1)
SAMPLING
Analytical Samples and Methods:

Many factors are involved in the choice of a specific analytical method. Among the most important
factors are the amount of sample and the concentration of the analyte.
Types of Samples and Methods: Often, we distinguish a method of identifying chemical species, a
qualitative analysis, from one to determine the amount of a constituent, a quantitative analysis.
Quantitative methods are traditionally classified as gravimetric methods, volumetric methods, or
instrumental methods. Another way to distinguish methods is based on the size of the sample and the
level of the constituents.
Sample Size: The term macro analysis is used for samples whose masses are greater than 0.1 g. A
semimicro analysis is performed on samples in the range of 0.01 to 0.1 g, and samples for a micro
analysis are in the range 1024 to 1022 g. For samples whose mass is less than 1024 g, the term ultramicro
analysis is sometimes used.
Thus the analysis of a 1-g sample of soil for a suspected pollutant would be called a macro analysis
whereas that of a 5-mg sample of a powder suspected to be an illicit drug would be a micro analysis. A
typical analytical laboratory processes samples ranging from the macro to the micro and even to the
ultramicro range. Techniques for handling very small samples are quite different from those for treating
macro samples.
Constituent Types: The constituents determined in an analytical procedure can cover a huge range in
concentration. In some cases, analytical methods are used to determine major constituents, which are
those present in the range of 1 to 100% by mass. Many of the gravimetric and some of the volumetric

procedures are examples of major constituent determinations. The range of 0.01 to 1% are usually
termed minor constituents, while those present in amounts between 100 ppm (0.01%) and 1 ppb are
called trace constituents. Components present in amounts lower than 1 ppb are usually considered to
be ultratrace constituents.
Sampling in Chemical Analysis: A chemical analysis is most often performed on only a small fraction of
the material of interest, for example a few milliliters of water from a polluted lake. The composition of
this fraction must reflect as closely as possible the average composition of the bulk of the material if the
results are to be meaningful. The process by which a representative fraction is acquired is termed
sampling. Often, sampling is the most difficult step in the entire analytical process and the step that
limits the accuracy of the procedure.
Sampling for a chemical analysis necessarily requires the use of statistics because conclusions will be
drawn about a much larger amount of material from the analysis of a small laboratory sample. From the
observation of the sample, we use statistics, such as the mean and standard deviation, to draw
conclusions about the population.
Obtaining a Representative Sample: The sampling process must ensure that
the items chosen are representative of the bulk of material or population. The
items chosen for analysis are often called sampling units or sampling
increments.
In the statistical sense, the sample corresponds to several small parts taken
from different parts of the bulk material. To avoid confusion, chemists usually
call the collection of sampling units or increments, the gross sample. For
analysis in the laboratory, the gross sample is usually reduced in size and
homogenized to create the laboratory sample.
In some cases, e.g. powders, liquids, and gases, materials may not be
homogeneous because they may consist of microscopic particles of different
compositions or, in the case of fluids, zones with different concentrations of
the analyte, now we can prepare a representative sample by taking our sample increments from
different regions of the bulk material.
Goals of Sampling: In sampling, a sample population is reduced in size to an amount of homogeneous
material that can be conveniently handled in the laboratory and whose composition is representative of
the population.
Statistically, the goals of the sampling process are:

1. To obtain a mean analyte concentration that is an unbiased estimate of the population mean.
This goal can be realized only if all members of the population have an equal probability of being
included in the sample.
2. To obtain a variance in the measured analyte concentration that is an unbiased estimate of the
population variance so that valid confidence limits can be found for the mean, and various
hypothesis tests can be applied. This goal can be reached only if every possible sample is equally
likely to be drawn.
Both goals require obtaining a random sample, does not meaning that samples are chosen in a
haphazard manner but in a randomization procedure is applied to obtain such a sample.
The Gross Sample:
The gross sample is the collection of individual sampling units. It must be representative of the whole in
composition and in particle-size distribution. Ideally, the gross sample is a miniature replica of the entire
mass of material to be analyzed. It should correspond to the bulk material in chemical composition and
in particle-size distribution if the sample is composed of particles.
Size of the Gross Sample
Gross sample size is determined by
1. the uncertainty that can be tolerated between the composition of the gross sample and that
of the whole.
2. the degree of heterogeneity of the whole
3. the level of particle size at which heterogeneity begins.
To obtain a truly representative gross sample, a certain number N of particles must be taken. The
magnitude of this number depends on the uncertainty that can be tolerated (point 1 above) and how
heterogeneous the material is (point 2 above).
Heterogeneity develops in particles that may have dimensions on the order of a centimeter or more and
may be several grams in mass. Intermediate between these extremes are colloidal materials and
solidified metals. With colloidal materials, heterogeneity is first encountered in the range of 10 25 cm or
less. In an alloy, heterogeneity first occurs in the crystal grains.
The number may range from a few particles to as many as 1012 particles. The need for large numbers of
particles is of no great concern for homogeneous gases and liquids since heterogeneity among particles
first occurs at the molecular level. However, the individual particles of a particulate solid may have a
mass of a gram or more, which sometimes leads to a gross sample of several tons. In such materials, the
individual pieces of solid differ from each other in composition.
Sampling Homogeneous Solutions of Liquids and Gases:
For solutions of liquids or gases, the gross sample can be relatively small because they are homogeneous
down to the molecular level. Therefore, even small volumes of sample contain many more particles than
the number computed from Equation.

Whenever possible, the liquid or gas to be analyzed should be stirred well prior to sampling to make
sure that the gross sample is homogeneous. With large volumes of solutions, mixing may be impossible;
it is then best to sample several portions of the container with a “sample thief,” a bottle that can be
opened and filled at any desired location in the solution.
With the advent of portable sensors, it has become common in recent years to bring the laboratory to
the sample instead of bringing the sample back to the laboratory. Most sensors, however, measure only
local concentrations and do not average or sense remote concentrations.
Gases can be sampled by several methods. In some cases, a sampling bag is simply opened and filled
with the gas. In others, gases can be trapped in a liquid or adsorbed onto the surface of a solid.
Sampling Particulate Solids:
It is often difficult to obtain a random sample from a bulky particulate material. Random sampling can
best be accomplished while the material is being transferred. Mechanical devices have been developed
for handling many types of particulate matter.
Sampling Metals and Alloys:
Samples of metals and alloys are obtained by sawing, milling, or
drilling. In general, it is not safe to assume that chips of the
metal removed from the surface are representative of the
entire bulk, so solid from the interior must be sampled as well.
With some materials, a representative sample can be obtained
by sawing across the piece at random intervals and collecting
the “sawdust” as the sample. Alternatively, the specimen may
be drilled, again at various randomly spaced intervals, and the
drillings collected as the sample; the drill should pass entirely
through the block or halfway through from opposite sides. The
drillings can then be broken up and mixed or melted together in
a special graphite crucible. A granular sample can often then be
produced by pouring the melt into distilled water.
Preparing a Laboratory Sample:
For heterogeneous solids, the mass of the gross sample may
range from hundreds of grams to kilograms or more, and so
reduction of the gross sample to a finely ground and
homogeneous laboratory sample, of at most a few hundred
grams, is necessary.
This process involves a cycle of operations that includes
crushing and grinding, sieving, mixing, and dividing the sample
(often into halves) to reduce its mass.
Figure: Sampling of Analytical Sample

1 Notes Chemomatrics

Uploaded by

Copyright:

Available Formats

You might also like

1 Notes Chemomatrics

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 Notes Chemomatrics

Uploaded by

Copyright:

Available Formats

Chapter 1: Chemometrics Analytical Chemistry (Chem-2202)

1. What is analytical Chemistry?

What is Analytical Chemistry?

1. The meter (symbol: m), used to measure length.

3rd Revision, Jan‘18 Page 1/20

Concentration and its units

NAME UNITS SYMBOLS

3rd Revision, Jan‘18 Page 2/20

Mean = ̅ = = 19.78 ≈ 19.8 Median = = 19.7

ERRORS: The term error has two slightly different meanings.

3rd Revision, Jan‘18 Page 3/20

1. If the measurement result is low, the sign is negative;

Types of errors in experimental data:

3rd Revision, Jan‘18 Page 4/20

Sources of Systematic Errors: There are three types of systematic errors:

1. Instrumental errors are caused by non-ideal instrument behavior, by faulty calibrations, or by

3rd Revision, Jan‘18 Page 5/20

Personal Errors: Many measurements require personal judgments.

The Effect of Systematic Errors on Analytical Results:

Systematic errors may be either constant or proportional.

3rd Revision, Jan‘18 Page 6/20

Removal of Systematic Method Errors:

3rd Revision, Jan‘18 Page 7/20

Statistical Treatment of Random Errors:

SAMPLES AND POPULATIONS:

3rd Revision, Jan‘18 Page 8/20

The Population Mean (µ) and the Sample Mean ( ̅ ):

The Population Standard Deviation σ:

where N is the number of data points making up the population.

3rd Revision, Jan‘18 Page 9/20

THE SAMPLE STANDARD DEVIATION (A MEASURE OF PRECISION)

Standard Error of the Mean (SEM):

3rd Revision, Jan‘18 Page 10/20

Reliability of s as a Measure of Precision:

Variance and Other Measures of Precision:

Relative Standard Deviation (RSD) and Coefficient of Variation (CV):

3rd Revision, Jan‘18 Page 11/20

Spread or Range (w):

1. Disregard all initial zeros.

Rounding Off Data:

3rd Revision, Jan‘18 Page 12/20

Comparison of experimental mean and known value:

1. The first, the null hypothesis H0, states that µ = µ0.

Small Sample t Test:

1. State the null hypothesis: H0:  = 0

3rd Revision, Jan‘18 Page 13/20

ANALYSIS OF VARIANCE (ANOVA)

ANOVA Concepts / definitions:

3rd Revision, Jan‘18 Page 14/20

Ha: at least two of the i’s are different

3rd Revision, Jan‘18 Page 15/20

RECOMMENDATIONS FOR TREATING OUTLIERS:

3rd Revision, Jan‘18 Page 16/20

Analytical Samples and Methods:

3rd Revision, Jan‘18 Page 17/20

3rd Revision, Jan‘18 Page 18/20

3rd Revision, Jan‘18 Page 19/20

Figure: Sampling of Analytical Sample