Professional Documents
Culture Documents
NIR - Multivariate Calibration - 3rd Edition 2014 PDF
NIR - Multivariate Calibration - 3rd Edition 2014 PDF
Multivariate Calibration
A practical guide
for developing methods in the
quantitave analytical chemistry
Multivariate_Calibration iii
iv Multivariate_Calibration
TABLE OF SYMBOLS
CHAPTER1
M
b UHJUHVVLRQFRHI¿FLHQWUHVXOWLQJIURPWKHFRPELQDWLRQRIVSHFWUDOGDWDDQG
concentration data of the master measurement
Ci concentration of sample i
Multivariate Calibration v
YAnalysis concentration values of analysed samples
Yi concentration value of sample i
M
Y Concentration data matrix resulting from the regression of the extinction and
concentration data measured with the master instrument
S
Y Concentration data matrix resulting from the nexus of the spectral data
PHDVXUHGZLWKWKHVODYHDQGWKHUHJUHVVLRQFRHI¿FLHQWMb
Ym
mean concentration value
meas
Yi measured concentration value of sample i
Yipred predicted concentration value of sample i
vi Multivariate Calibration
1
C H A P T E R
CHAPTER1
INTRODUCTION
CHAPTER2
Multivariate Calibration 1
sical univariate calibration. Chapter 3 shows how unknown
samples can be analyzed using a chemometric model, and
how the quality of the analysis can be evaluated using simple
URXWLQHV7KHVH¿UVWFKDSWHUVRQO\GHDOZLWKWKHWKHRUHWLFDO
aspects of multivariate calibration. They do not give practical
advice on how to optimize a PLS model yet.
The practical procedure of setting up chemometric methods
is described in the chapters 4, 5 and 6. They are very impor-
tant for the analyst as they give hints on how to optimize all
relevant parameters. The study of these chapters should
enable even the amateur to set up a chemometric model
quickly that provides optimum results at a given measure-
ment situation. Here, mainly the practical analyst is
addressed.
These three chapters build the core of this tutorial. The most
important task is to enable the user to achieve an easy and
successful application of multivariate calibration techniques.
With the help of these chapters, an effective and systematic
method is demonstrated. A detailed knowledge of the
theoretical background of the statistical methods (i.e. an in-
depth knowledge of chapters 2 and 3) is not neccessary.
Chapter 7 is a glossary of all relevant statistical terms that are
used in the specialized literature. It only serves as a quick
reference to help you to understand new terms. Once again,
learning the mathematical terms is not necessary to develop
chemometric methods successfully. The last chapter is a
summary and an outlook, where importance of the
multivariate calibration is discussed compared to the classi-
cal univariate calibration.
This manual is by no means a comprehensive summary on
chemometric methods in analytical chemistry. It is designed
as a tutorial which aims at giving practical advice for the
daily laboratory routine. It should enable the user to set up
optimum chemometric models systematically, without
deeper knowledge of the theory of multivariate calibrations.
Nevertheless, the impression should not be given that a quick
and carefree usage of chemometric software is advisable. It
LVDFDUHIXODQGDWWHQWLYHDSSURDFKWKDWLQÀXHQFHVWKHTXDOLW\
of an analysis considerably.
2 Multivariate Calibration
2
C H A P T E R
CHAPTER1
MULTIVARIATECHAPTER2
CALIBRATION METHODS IN
ANALYTICAL CHEMISTRY
A. Theoretical Considerations
Multivariate Calibration 3
This correlation is described in the calibration model:1
< ;ÂE
with the calibration function b1, which is often called
³UHJUHVVLRQFRHI¿FLHQW´RU³EFRHI¿FLHQW´
b = (XTÂ;Â;TÂ<
In this representation, the individual parameters X and Y are
written in matrix form. If they were to represent a spectro-
scopic measurement, for example, the spectral intensities
would be written into the X-matrix in rows, point by point.
Each additional sample would thus correspond to an addi-
tional row in the matrix. The corresponding component val-
ues would then be written into the rows of the Y-matrix. T
denotes the transpose of the associated matrices.
After the calibration, the analysis is performed. By connect-
ing the calibration model to the measured parameter X, the
system property Y of an unknown sample is determined. This
LVGHSLFWHGVFKHPDWLFDOO\LQ)LJXUH
Step 1: Calibration
Step 2: Analysis
4 Multivariate Calibration
Figure 2.2 Calibration of absorbance spectra
Multivariate Calibration 5
to be analyzed2. In this way, multi-component systems can be
analyzed such that for each additional component the
absorbance value at one additional, suitable wavelength is
added to the calibration set.
,QPDQ\FDVHVDXQLYDULDWHFDOLEUDWLRQOHDGVWRDQLQVXI¿
cient prediction capability, as this method has several short-
comings:
The concentration of the analyte is only correlated to a single
point in the spectrum, and therefore, when evaluating new,
uncalibrated samples, neither outliers nor the presence of
unknown interfering components can be recognized. In
other words: from the peak height or peak area, one cannot
DVVXPHWKDWWKHVWUXFWXUHRIWKHPHDVXUHGVSHFWUD¿WVWKH
calibration data.
Statistical variation of the signal, such as detector noise, is
directly incorporated into the concentration data. The
resulting uncertainty usually has to be minimized by
multiple sample measurements and subsequent averaging of
the results.
A satisfactory calibration of multi-component systems requires
VXI¿FLHQWVHSDUDWLRQRIWKHSHDNPD[LPD+RZHYHULQPDQ\
cases this is simply not possible, especially in near infrared
spectroscopy.
6 Multivariate Calibration
In the analysis of multi-component systems, a linear additivity
of the absorbance values of all analytes at the measured
wavelength is assumed (Beer-Lambert Law). Plotting the
absorbance values against the concentration leads to a linear
FDOLEUDWLRQIXQFWLRQEVHH)LJXUH,QPDQ\FDVHVWKLVLV
not valid for real systems. Intermolecular forces or
temperature effects can lead to distortions of the respective
analyte bands. Furthermore, there are several techniques for
which the Beer-Lambert law is not valid, such as diffuse
UHÀHFWDQFHPHDVXUHPHQWVDWHFKQLTXHRIWHQXVHGLQ,5
spectroscopy.
Thus an analysis using classical univariate calibration meth-
ods may often lead to useless results for multi-component
systems. One method for solving such problems is the use of
multivariate calibration methods, such as Multiple Linear
Regression (MLR), Principal Component Regression
(PCR), or Partial Least Squares (PLS)-Regression.
B. PLS-Regression
Multivariate Calibration 7
data sets is compressed into factors and can then be used for
the calibration.
In case of a PLS calibration, the eigenvectors are sorted in
GHVFHQGLQJ RUGHU 7KH ¿UVW IDFWRU FKDUDFWHUL]HV WKH PDLQ
changes of the observed spectrum. It has the largest impor-
tance for the calibration model. With an increasing number
of factors, ever smaller changes in the data structure are
characterized. This has an important consequence for the
evaluation of the substance spectra: The lower factors mostly
characterize the important changes in spectral structures,
whereas the higher factors mainly represent the disturbing
part of the spectral noise.
Figure 2.4 Encoding the spectral and concentration data in matrix form. In this example M
calibration samples were measured and - in a second step - all N wavelengths of
the resulting spectra are written in rows into a (M, N)-matrix. This matrix is
equivalent to the spectral data matrix X. In the same way, all L component values
are written into a (M, L)-concentration data matrix.
8 Multivariate Calibration
to a deterioration of the analysis, as too many parts of the
GLVWXUELQJVSHFWUDOQRLVHDUHLQFRUSRUDWHG³RYHU¿WWLQJ´
In a PLS regression, the spectral data matrix X and the con-
centration data matrix Y are reduced to only a few factors.
The original matrices are then represented as the sum of A
products of a so-called scores vector ti with the loadings
vector pi, or qi respectively3:
spectral data:
X = tpT + tpT + t3p3T + ...tRpRT)
concentration data:
Y = tqT + tqT + t3q3T + ...tRqRT*
In all cases, the scores and loading values are displayed as
vectors. This becomes more apparent in the schematic rep-
UHVHQWDWLRQRIHTXDWLRQ3:
Figure 2.5 Schematic diagram for the factorization of the spectral data matrix X
Multivariate Calibration 9
calibration. Furthermore, it is possible to determine outliers
during the analysis, and one can decide whether unknown
disturbance components which do not correlate with the data
set have lead to spectral changes. As it is possible - in contrast
to univariate calibration - to take the spectroscopic
LQIRUPDWLRQRIWKHSHDNÀDQNVLQWRDFFRXQWVSHFWUDFDQDOVR
be evaluated by means of their structure. For this reason,
strongly overlapping bands can be distinguished in the
spectra, as long as there is a small variance in their shape8. In
a similar way, spectral structures can be recognized in very
noisy regions, which leads to a corresponding improvement
in predictive accuracy of the concentration data.
In a PLS regression, The special importance of PLS regression in analytical
WKHGDWDVHWVDUH¿UVW chemistry arises from its simultaneous and mutually depen-
decomposed into their
principal components. dent factorization of the X- and Y-data. When evaluating
7KHQD¿WWLQJRIWKH absorption spectra, one can assume that changes of the spec-
scores vectors from tral data have their origin in variations of the corresponding
spectral data and
concentration data is analyte concentration. This means that a variation in the
carried out. spectral data should lead to a corresponding change in the
Consequently, the spectrum. Therefore, the scores vectors of concentration data
method is more robust
against inaccuracies in
DQGVSHFWUDOGDWDPDWUL[VKRXOGEHLGHQWLFDO+RZHYHULQWKH
the reference and case of real samples, errors in sample preparation and in the
sample measurements. reference methods used to determine the concentration
values as well as instrument drifts and spectral noise will
lead to different scores vectors, if the matrices were reduced
purely by mathematical methods (i.e. independently).
Therefore, in the PLS method, identical scores vectors for
both data sets at the given factor numbers are assumed. They
are chosen so that they have the smallest possible deviation
from the original values. This is a compromise between the
factors’ suitability for describing the samples and the
increase in correlation between the data sets.
7KHVRFDOOHG3/6DOJRULWKPRQO\WDNHVWKHFRQFHQWUDWLRQ
values of one analyte into account. All other data are inter-
preted as disturbances, i.e. the Y-matrix of the concentration
GDWDLVDYHFWRU,QWKH3/6DOJRULWKPWKHFRQFHQWUDWLRQVRI
all components in the system are taken into account for the
calibration. For the prediction of new samples, the model
delivers a simultaneous analysis of all calibrated substances.
$V LQ FRQWUDVW WR WKH 3/6 FDOLEUDWLRQ DOO GDWD RI WKH
concentration matrix must be correlated with the data of the
Multivariate Calibration
VSHFWUDOPDWUL[WKH3/6SUHGLFWLRQUHWXUQVW\SLFDOO\SRRUHU
UHVXOWV WKDQ WKH 3/6 SUHGLFWLRQ6. For this reason, it is
JHQHUDOO\DGYLVDEOHWRSHUIRUPDFDOLEUDWLRQZLWKWKH3/6
algorithm. To carry out an analysis of a multi-component
system, this algorithm is applied successively to all cal-
ibrated components, so that a model for all desired
FRPSRQHQWVLVHVWDEOLVKHGDVZLWKWKH3/6DOJRULWKP
Multivariate Calibration
Multivariate Calibration
3
C H A P T E R
CHAPTER1
VALIDATION OF
CHAPTER2
CHEMOMETRIC MODELS
AND ANALYSIS OF
UNKNOWN SAMPLES
Multivariate Calibration 13
and concentration data is of major importance for the quality
of an analysis. With a good correlation, the analysis is
relatively exact. In contrast, a bad correlation can never lead
WRJRRGDQDO\VLVUHVXOWV7KHUHIRUHLWLVQHFHVVDU\WR¿QGWKH
function b which yields the best correlation (and hence the
best possible analysis results).
Generally, a validation For this purpose, the acquired model must be validated, i.e.
is possible only with evaluated. Such an evaluation is performed by predicting a
“independent”
samples, i.e. the certain number of samples with known analyte concentration
spectra of these with the chemometric model. A comparison of the predicted
samples must not be values with the actual values shows the precision of the
included in the
calibration data set. model. This is carried out with a large number of different
model parameters. That parameter set which leads to the
smallest error in prediction characterizes the best method.
Validation of the different chemometric methods permits the
recognition of outliers and the most suitable frequency
ranges, and particularly allows to determine the optimum
number of factors. Two types of validation are possible:
internal validation (cross validation) and external validation
(test set validation).
In the case of an internal validation, individual samples
GH¿QHGE\WKHXVHUDUHWDNHQIURPWKHFDOLEUDWLRQVHW8VLQJ
the remaining samples, a chemometric model is established
and used to analyze the previously extracted samples. A
comparison of the results with the actual concentration
values shows how precisely the model predicts the samples.
By extracting the samples beforehand, it is guaranteed that
they are not known to the calibration model and are thus
independent. An independent data set is very important.
Only in this way, the actual preciseness in prediction can be
assessed realistically.
To assess the complete data set, the samples analyzed previ-
ously are returned to the data set, and a second set of test
spectra is removed for analysis. This procedure of removing
samples, analysing them, and returning them to the calibra-
tion data set is continued successively until all samples have
been analyzed once. A comparison of the resulting analysis
values with the original raw data allows the calculation of the
predictive error of the complete data system, the RMSECV
(„Root Mean Square Error of Cross Validation“). This is a
quantitative measure for the mean accuracy of the predictive
14 Multivariate Calibration
capability of the chemometric model. The smaller this error,
the better the quality of the model.
For a cross validation it is important to remove only few
samples from tha data set, as the model built from the
remaining data set must be very similar to the model created
from the original data. For data sets with less than 50 samples
it is strongly recommended to remove not more than one
sample for the cross validation.
The whole process is schematically illustrated in the follow-
LQJ¿JXUH
4 Return the removed sample to the data set and remove a new
sample. Calculate new model and predict new sample: Y2meas
– Y2pred.
Multivariate Calibration 15
(3-2)
16 Multivariate Calibration
The steps of a test set validation (external validation):
1 Set up the model, using all calibration spectra.
(3-3)
Multivariate Calibration 17
18 Multivariate Calibration
C H A P T E R
Multivariate Calibration 19
turbing baseline drifts. In practice, subtraction of a straight
OLQHYHFWRUQRUPDOL]DWLRQRUWDNLQJWKH¿UVWGHULYDWLYHRID
spectrum often leads to optimized PLS models.
6WHS'H¿QLWLRQRIDQ$SSURSULDWH)UHTXHQF\
Range
7KHVHOHFWLRQRIDQDSSURSULDWHIUHTXHQF\UDQJHLVRIFUXFLDO
LPSRUWDQFHIRUWKHTXDOLW\RID3/6PRGHO:KHQVHWWLQJXSD
PRGHORQHVKRXOGXVHWKHIUHTXHQF\UDQJHRIWKHVSHFWUXP
where a good correlation between the changes in the spectral
and the concentration data can be found. The extent of
FRUUHODWLRQ FDQ EH MXGJHG HDVLO\ E\ WKH FRHI¿FLHQW RI
determination R2 (see 4. Step).
6WHS9DOLGDWLRQDQG2SWLPL]DWLRQRIWKH0HWKRG
The suitability of the chosen data preprocessing methods and
RI WKH IUHTXHQF\ UDQJH IRU WKH JLYHQ PHDVXUHPHQW WDVN LV
evaluated during the validation. In this step, important
SDUDPHWHUVOLNHWKHFRHI¿FLHQWRIGHWHUPLQDWLRQR2 and the
mean errors of prediction RMSECV or RMSEP are calcu-
lated. In addition, an automatic outlier recognition is carried
out (see Chapter 5). The results are summarized in a report.
It is very important to 7RDFKLHYHDQRSWLPXPPHWKRGWKHFRHI¿FLHQWRIGHWHUPL
WDNHWKHFRHI¿FLHQWRI nation R2 and the respective mean error of prediction are
determination into
account, as a high summed up in a table for all “sensible” combinations of data
value for R2 indicates a SUHSURFHVVLQJDQGIUHTXHQF\ZLQGRZVLWLVWKHWDVNRIWKH
good correlation DSSOLFDWLRQ VSHFLDOLVW WR ¿QG PHDQLQJIXO FRPELQDWLRQV
between the spectral
data and the a priori no general recommendation can be given). Table 4.1
concentration data shows a way of summarizing the validation results.
7DEOH)RUPIRU&RPSDULQJWKH4XDOLW\RI9DOLGDWLRQ5HVXOWV
No. Data )UHTXHQF\ 2SWLPXPUDQN &RHIIRIGHWHU 0HDQHUURURI 5HPDUNV
preprocessing ranges [cm-1] mination R2 [%] prediction
1 1st Derivative 7,835-8,905 9 99,78 0.035 high rank
2 MSC 4,755-5,235 5 99,80 0.031 optimum
3 1st Der. + VN 4,755-5,825 7 99,23 0.041 outliers
4 Vector Norm. 5,745-6,105 6 99,05 0.052
20 Multivariate Calibration
The settings which lead to a high R2 value and a corre-
sponding low mean error of prediction should be used for the
calibration. Moreover, in many cases, it is sensible to choose
a setting which delivers a smaller number of factors with
HTXDOO\JRRGYDOLGDWLRQUHVXOWV
During the validation, potential outliers can be detected eas-
ily. They are distinguished, for example, by exceptionally
high F-values or FProb-values. If an independent check of
WKHVH VDPSOHV FRQ¿UPV WKDW WKH YDOXHV ZHUH FDXVHG E\ DQ
erroneous measurement, they should be removed from the
data set.
6WHS7KH&DOLEUDWLRQ
Only after all outliers have been removed from the calibra-
tion data set, and after the optimum system parameters have
EHHQ IRXQG WKH ¿QDO YHUVLRQ RI WKH PRGHO LV FRQVWUXFWHG
During the calibration, the scores and loading vectors are
calculated and thus the calibration function b is determined
(see Chapter 2). These values are stored internally, and they
are now available for the analysis of new samples.
6WHS7KH$QDO\VLV
For the analyte In this last step, the optimized chemometric model is used to
samples, the true analyze new samples. Simultaneously, the credibility of the
concentration values
are not known. analysis is checked by using characteristic parameters. One
Therefore, it is option is the calculation of the so-called “Mahalanobis” dis-
neces-sary to check by tance. Here the spectral structures of the complete calibra-
using statistical
parameters whether the tion data set are compared to the structure of the analyte
VDPSOHV³¿W´WKH spectrum. If the spectrum contains structures which do not
calibration spectra. ³¿W´RULIWKHFRPSRQHQWYDOXHVRIWKHDQDO\WHLVRXWVLGHWKH
calibration range, an increase of the Mahalanobis distance
can be observed (see Chapter 7).
$QDGGLWLRQDOPHWKRGZKLFKLVIUHTXHQWO\XVHGWRGHWHUPLQH
outliers, is the calculation of the spectral residuae. In this
method, the difference is calculated between the measured
spectrum and the spectrum which is theoretically expected
from the factor analysis of the calibration spectra. The
smaller this difference (i.e. the smaller the residuum), the
more credible the analysis result (see Chapter 7). The value
of the spectral residuum and the Mahalanobis distance are a
Multivariate Calibration 21
TXDQWLWDWLYHPHDVXUHIRUWKHTXDOLW\RIWKHDQDO\VLVUHVXOW,Q
addition, there is a number of further statistical parameters
for determining outliers, which shall not be discussed at this
point.
Hence, the analysis delivers two relevant pieces of informa-
tion: It indicates the analysis value of the sample, and it pro-
vides an outlier determination. This ensures that the user is
alerted if, by mistake, an erroneous measurement causes in-
correct analysis results.
22 Multivariate Calibration
5
C H A P T E R
CHAPTER1
SELECTION CRITERIA
CHAPTER2
Multivariate Calibration 23
WKHFRQFHQWUDWLRQUDQJHXVHGWRFDOLEUDWHPXVWEHVXI¿FLHQWO\
large in order to permit reliable analysis of these samples.
Figure 5.1 shows a comparison of reference values and
analysis values.
24 Multivariate Calibration
B. Creating Representative
Calibration Samples
Another important point is the consideration of external dis-
turbances, which can occur during the sample measurement
and which cannot be avoided by a suitable sample prepara-
tion. An example is the spectroscopic in-line measurement of
chemical processes. Frequently, preparation or thermostatting
of the sample is not possible. In the same way, modern quality
control often does not allow extensive sample preparation,
which would be technically possible, but is too expensive.
Here, the methods of chemometrics are an important tool to
solve these tasks. An overlapping or deformation of the ana-
lyte signals - caused by contamination or by temperature
ÀXFWXDWLRQVRIWKHVDPSOHFDQEHDFFRXQWHGIRULQWKHIDF
torization of the spectra. Since all relevant system parameters
are stored as independent information blocks (i.e. as factors),
disturbances that overlap the spectrum can easily be elimi-
nated. For this purpose, all potential disturbances are simu-
lated during the calibration and stored as independent factors.
When comparing these values with the reference values of the
analyte, the algorithm will “detect” that this information is not
from the analyte itself. The corresponding structures will thus
not be used for the prediction of new, unknown samples. In
other words: The PLS algorithm can distinguish between
analytically relevant and analytically useless structures in the
spectrum. Disturbances are detected during calibration and
eliminated in further analysis.
The task of the method developer is to measure the samples in
realistic conditions. Consequently, all potential disturbances
that interfere with the system should be taken into account in
order to make the algorithm “learn” to recognize and eliminate
WKHP $OO ÀXFWXDWLRQV WKDW FDQ RFFXU LQ UHDOLW\ VKRXOG EH
accounted for in the calibration sample set. Only this will
ensure that the sample set is representative and suitable for
developing a method. Therefore, it is not advisable to carry
out a calibration under the most ideal conditions. Measuring
purest, well thermostatted samples leads to a very small error
in analysis, but the model is not robust enough to work reliably
in practice.
Multivariate Calibration 25
7KHLQÀXHQFHRIWKHVDPSOHWHPSHUDWXUHLVGHPRQVWUDWHGE\
the following example:
Due to the ability to build H-bridges, the spectrum of water is
very sensitive to temperature changes. With increasing
temperature, the H2O molecules move faster, which leads to
a cracking of the hydrogen-bridges. The existing electron
density moves back to the chemical OH-bonding, which
leads to an increase in the force constant. The molecule
oscillates faster, and the respective absorption band in the
NIR spectrum is moved to higher frequencies (wavenum-
bers). The more the water is heated, the further the band
moves to higher wavenumbers. Moreover, the water expands
with increasing temperature. The resulting reduction of
density leads to a decrease in signal height. Thus, with
increasing temperature, the water spectrum shows a decrease
in signal intensity and a simultaneous shift of the peak
maxima to higher wavenumbers (see Figure 5.2).
)LJXUH 7KHLQÀXHQFHRIWKHWHPSHUDWXUHRQWKHVWRYHUWRQHRIZDWHU
(optical pathlength 1 mm, reference: air)
26 Multivariate Calibration
For even larger temperature differences, a broadening and
distortion of the band can be observed, resulting from the
different population density of the single rotation-vibration
levels. For the sake of simplicity, a graphical description of
these deformations is left out here.
It is obvious that a calibration with thermostated aqueous
solutions can never supply reliable results if the actual anal-
ysis is done at different temperatures later.
,IWKHFDOLEUDWLRQLVFDUULHGRXWXQGHUVWULFWO\GH¿QHGFRQGL
tions, it is necessary to do the same during analysis. Addi-
tionally, it has to be taken into account that chemometric
methods must be adapted to changes in product composition
or product quality. This is also true if the measurement
parameters change. A method generally only works under
those conditions under which it was set up, and only those
disturbances are recognized which were accounted for in the
calibration. For this reason, a maintenance of the method is
often necessary at a later point.
Setting up and maintaining a method is quite easy, as it is not
necessary to correlate the reference values to the respective
disturbances or temperature values. This is also a result from
the factorization. To compensate for unwanted disturbances
in a multivariate calibration, the calculated analysis values
are not - as one could assume - corrected by an adjustment
function set up internally. Rather, the PLS1 algorithm tries to
¿QG D UHODWLRQ EHWZHHQ WKH FRPSRQHQW DQG WKH UHVSHFWLYH
spectral structure. Any other information is not accounted
for. There is no difference whether it is caused by an impurity,
a further analyte, or spectral noise (see Chapter 2).
7KHUHIRUHLWLVVXI¿FLHQWIRUDFDOLEUDWLRQWRKDYHDGDWDVHW
which was measured under authentic conditions. The refer-
ence values of the disturbing components are of no impor-
tance. Referring to the above example, for a calibration of an
DTXHRXV VROXWLRQ LW LV WRWDOO\ VXI¿FLHQW WR PHDVXUH WKH
samples at a representative range of temperatures. The
recording of the temperature itself is not necessary. Also, if a
change in product quality should ask for maintenance, a
further representative set of samples is added to the existing
model. Afterwards, the model should work reliably again.
Multivariate Calibration 27
C. $PELHQW,QÀXHQFHVRQWKH,QVWUXPHQW
28 Multivariate Calibration
Figure 5.3 Absorbance spectra before (above) and after (below) the diffusion of humid
air into a NIR-spectrometer (MATRIX-F, Bruker Optik GmbH). The
spectra are shifted in y-direction for a better overview.
Multivariate Calibration 29
In the case of a collinear data set, the PLS algorithm cannot
assign the individual spectral bands clearly to the respective
component values. A method set up in this way is useless for
the analysis of non-collinear data sets.
30 Multivariate Calibration
This is obviously not possible if the component values can
change independently of each other. In this case, it should be
observed that the values have a statistic distribution across
all samples. It is therefore not recommended to produce the
standards for the calibration from a dilution series, as not
only the analyte but all other components are diluted in the
same way. Again, the algorithm cannot differentiate between
the individual values, and a validation will lead to completely
useless results.
In this context, it is recommended not to measure a set of
calibration standards in ascending or descending order of the
concentration values. Systematic changes in the samples, for
example a rise in temperature during a series of
measurements, can simulate a correlation with a system
property. Measuring the samples again at a later stage and
comparing them with the spectra measured previously guar-
antees that no systematic changes are included in the cali-
bration.
It is useful to keep the problem of collinearity in mind right
from the beginning when setting up calibration data sets.
Afterwards, it may not be easy to recognize whether single
component values change in a collinear way. This is illus-
trated by the following example:
Multivariate Calibration 31
The numbers in each column can be generated by a simple
conversion from the numbers in the column „X“. They are
WKXVFROOLQHDU7KLVDSSOLHVQRWRQO\ZLWKUHJDUGVWRWKH¿UVW
column. The numbers of each column behave collinearly to
the numbers of every other column. The relative changes of
individual component values are the same for each sample.
Nevertheless, the collinearity may not be recognizable at
¿UVWVLJKW7KHUHIRUHLWLVXVHIXOWRSD\DWWHQWLRQWRDQLQGH
pendent distribution of the individual component values
right from the beginning.
Gaseous measurements:
MIR: vacuum or dry nitrogen.
NIR: ambient air, rarely vacuum or dry nitrogen
Liquid measurements:
MIR: ambient air, in some cases pure solvent
(only at constant temperatures; see below).
NIR: ambient air.
Solid measurements:
MIR: dry potassium bromide or caesium jodide.
1,5 7HÀRQ RU URXJK PHWDO VXUIDFHV ZKLFK VFDWWHU
light (mainly gold).
32 Multivariate Calibration
In Raman spectroscopy, normally no background measure-
ment is carried out.
Multivariate Calibration 33
Figure 5.5 Spectra structure which emerges from the shift of the 1. overtone of water with
rising temperature. For the background and sample measurement pure water was
used. The background was taken at 30°C. The samples were measured with rising
temperature up to 60°C.
34 Multivariate Calibration
matrix for a background measurement. The equation mostly
valid in chemical analysis
Multivariate Calibration 35
WLRQ ,W LV FOHDU WKDW IRU H[DPSOH ÀXFWXDWLRQV LQ VDPSOH
temperature during measurement or investigation of inho-
mogeneous or impure sample material can also lead to seri-
ous errors in the analysis, if they were not taken into account
in the method. The measurement error of the spectrometer or
the inaccuracy associated with the PLS model itself are
usually of negligible importance. Hence, it is primarily a
careful analytical technique as well as a reliable reference
method that determine the quality of the analysis.
,IWKHTXDOLW\RIWKHUHIHUHQFHGDWDLVGLI¿FXOWRULPSRVVLEOHWR
improve, it is advisable to make several repeat measurements
of the same sample and to construct an average. Statistical
variations are then averaged out, and possible outliers have a
much smaller effect on the quality of the analysis. This is also
true for spectrometric analysis, where, in critical cases, errors
can be minimized by repeat sample measurements and
averaging.
36 Multivariate Calibration
other to validate the corresponding model. Both sets should
consist of about the same number of samples, and each set
should cover the whole concentration range of the system
evenly.
Cross Validation: If the number of available samples is lim-
ited (50 samples or less), one should do without a separate
test set. Here, only a cross validation can make reliable pre-
dictions about the quality of a model.
,QSUDFWLFHDFURVVYDOLGDWLRQLVSHUIRUPHG¿UVWWRDVVHVVWKH
method. Only if it is guaranteed that a representative selec-
WLRQRIVSHFWUDKDYHEHHQPHDVXUHGWKHGH¿QLWLRQRIDVHSD
rate test set is sensible. In contrast, the test set method has
advantages where large amounts of data are available, as the
calculation time is considerably shorter than with cross vali-
dation.
Most software packages offer both possibilities: internal and
external validation. This enables the analyst to estimate the
robustness of the method. This is carried out as follows:
,QWKH¿UVWVWHSDVGHVFULEHGDERYHWKHEHVWPRGHOSDUDP
eters are found by performing a cross validation. The result-
ing value of R2 and RMSECV are written down. In a second
VWHSWKHGDWDVHWLVVSOLWLQWRWZRHTXDOO\VL]HGSDUWV7KH¿UVW
³SDFNHW´LVGH¿QHGDVWKHFDOLEUDWLRQGDWDVHWDQGWKHVHFRQG
“packet” is the test set. Again, a method is set up with the
SDUDPHWHUV GH¿QHG SUHYLRXVO\ YDOLGDWDWHG E\ D WHVW VHW
validation and documented by writing down the R2 and
RMSEP value. In a third step, the test set and the calibration
data set are exchanged and validated with the same model
parameters (2nd test set validation). The resulting values of R2
and RMSEP are compared to the previously calculated values
of the test set validation as well as for the cross validation.
Table 5.2 shows this for the error of analysis.
Multivariate Calibration 37
Table 5.2 Checking the Stability of a Calibration
Analysis Error of Analysis Error of
Analysis Error of 1. Test Set 2. Test Set
Cross Validation Validation Validation
Stable 0,10 0,11 0,10
Calibration
Unstable 0,10 0,25 0,33
Calibration
38 Multivariate Calibration
I. Selecting Spectral Ranges
Multivariate Calibration 39
Therefore, it is often useful to measure all samples over the
whole spectral range and then look for spectral features that
enhance the model. In this context, absorbance bands with
absorbance values between 0.7 and 1.0 generally lead to the
best results. When using modern FT-spectrometers, typically
absorbance values of up to 2.5 can be used for the calibration.
However, it is necessary that an instrument is set up with
detectors which work linearly in the whole range (e.g. Peltier-
cooled InAs- or InGa-As detectors). Larger signals should
not be taken into account, as they will lead to a higher
uncertainty of the measurement.
Anyway, it is advisable to remove whole signal groups suc-
cessively. As, in IR-, NIR and Raman-spectroscopy, most
substances (with very few exceptions) have signals in a very
large spectral range, the search for a dedicated structure is
often not necessary. Table 5.3 und Figure 5.7 show a brief
overview of the frequency ranges which are normally
observed in near-infrared spectroscopy (more detailed infor-
mation are given in the appendix):
Table 5.3 Absorbance Signals in the Near Infrared
Group Frequency Range [cm-1] Name
Aliphatic 6.300 - 5.500 1. Overtone CH-stretching
hydrocarbons 9.100 - 7.800 2. Overtone CH-stretching
5.000 - 4.100 combination
7.700 - 6.900 combination
Aromatic ca. 6.000 1. Overtone CH-stretching
hydrocarbons ca. 9.000 2. Overtone CH-stretching
4.700 - 4.000 combination
7.300 - 6.900 combination
Carboxylic acid ca. 6.900 1. Overtone CH-stretching
ca. 5.250 2. Overtone CO-stretching
4.900 - 4.600 combination
Amines 7.000 - 6.500 1. Overtone NH-stretching
5.200 - 4.500 combination
Water (very strong 7.500 - 6.400 1. Overtone OH-stretching
absorptions) 5.400 - 4.900 combination
40 Multivariate Calibration
Figure 5.7 Absorbance bands of various functional groups in the NIR region.
Multivariate Calibration 41
J. Selecting the Data Processing Method
42 Multivariate Calibration
Multiplicative Scatter Correction: First, a mean spectrum is
calculated from all spectra of the calibration data set. Then,
each spectrum X(i) is transformed according to
;L މXYÂ;L
7KHFRHI¿FLHQWVu and v are chosen such that the difference
between the transformed spectrum X(i)‘ and the mean
spectrum has a minimum.
Application: This method is often used for measurements in
GLIIXVHUHÀHFWLRQ
First Derivative:FDOFXODWHVWKH¿UVWGHULYDWLYHRIWKHVSHFWUXP
(Figure 5.8).
Application:%\FDOFXODWLQJWKH¿UVWGHULYDWLYHWKHVLJQDOV
ZLWKVWHHSHGJHVJHWPRUHHPSKDVLVWKDQUHODWLYHO\ÀDW
bands. This method is used to emphasize pronounced, but
small features compared to huge broad-banded structures.
Another important application is the evaluation of broad
bands. This is often done in NIR-technology. By calculating
the derivative, these structures get a steeper shape which can
be evaluated more easily.
When using the derivative as a data preprocessing method,
it has to be taken into account that the spectral noise is
enhanced as well. This superimposes the spectrum as an
additional disturbance and can deteriorate the analyte signal.
Second Derivative: calculates the second derivative of the
respective spectrum.
Application:&RPSDUHGWRWKH¿UVWGHULYDWLYHHYHQ
H[WUHPHO\ÀDWVWUXFWXUHVFDQEHHYDOXDWHG7KHGLVWXUELQJ
LQÀXHQFHRIWKHVSHFWUDOQRLVHLVJHQHUDOO\VRVWURQJWKDW
spectra can only be evaluated in a very restricted spectral
range.
)LJXUHVKRZVWKHLQÀXHQFHRIYDULRXVGDWDSUHSURFHVVLQJ
methods on the appearance of an NIR-spectrum (mea-
VXUHPHQW RI D KXPDQ KDQG ZLWK D ¿EHURSWLF SUREH 7KH
original spectrum shows a slight offset in the baseline as well
as a slight drift. This drift can be eliminated by the sub-
traction of a straight line (broken line) and the offset by a
0LQ0D[1RUPDOL]DWLRQGRWWHGOLQH7KH¿UVWGHULYDWLYHRI
the original spectrum (line-dotted line) has been expanded
for reasons of display and shifted to higher absorbance
values. A relative enhancement of the sharp structures
compared to the original spectrum can be observed.
Multivariate Calibration 43
)LJXUH 1,5VSHFWUXPRIDKXPDQKDQGPHDVXUHGLQGLIIXVHUHÀHFWLRQ
K. Selecting the
Appropriate Number of Factors
In PLS regression, the spectral data and the concentration
GDWDDUH¿UVWHQFRGHGLQDPDWUL[IRUPDQGWKHQUHGXFHGWR
only a few factors. The number of factors of a chemometric
model is called “rank”. The section of this rank is of major
importance for the quality of the analysis. Choosing too few
IDFWRUVZLOOOHDGWRDQLQVXI¿FLHQWH[SODQDWLRQRIWKHFKDQJHV
LQWKHVSHFWUDODQGFRQFHQWUDWLRQGDWD³XQGHU¿WWLQJ´7KHUH
44 Multivariate Calibration
is only little correlation between the two data sets, and the
DQDO\VLVUHVXOWVIURPWKLVPRGHODUHLQVXI¿FLHQW,IWRRPDQ\
factors are chosen, the model tries to account for even the
smallest changes in the data set, such as spectral noise
³RYHU¿WWLQJ´7KLVZD\VSHFWUDOLQIRUPDWLRQunVSHFL¿FIRU
the sample is included in the model. A deterioration of the
analysis results is also to be expected from these models.
Thus, every PLS model has the optimum number of factors
which guarantees the smallest possible error of analysis.
There are numerous hints that lead to the optimum number of
factors for a certain model: The values of the mean error of
prediction (RMSECV for cross validation, and/or RMSEP
for test set validation) go through a minimum for the optimal
UDQN,QFRQWUDVWWKHYDOXHRIWKHFRHI¿FLHQWRIGHWHUPLQDWLRQ
R2 possesses a maximum. Thus, the optimum rank for a
certain calibration model can be found easily: First, the R2-
values and the mean error of prediction are computed. Then,
these values are plotted as function of the rank. The rank is to
be regarded as optimal, when the characteristics mentioned
go through an optimum value, and/or do not change
VLJQL¿FDQWO\IRUKLJKHUIDFWRUQXPEHUV,IVHYHUDOUDQNVOHDG
to comparably good results, it is recommended to select the
model with the smallest number of factors (see Chapter 6).
Caution: The validation of a method is only possible using
independent test spectra, i.e. the spectra must by no means
be part of the calibration data set. This applies if - in a cross
validation - all measurements of one sample are declared as
“leave-out spectra”. In case of a test set validation, new
samples should be measured for the test set.
Multivariate Calibration 45
L. Selecting Suitable Calibration Samples;
Recognizing and Eliminating Outliers
46 Multivariate Calibration
ously, only by considering these values without any
knowledge of the corresponding application it is not possible
to judge if a particular spectrum really is an outlier. Therefore,
only the analyst himself can decide, on the basis of an
independent examination of the spectrum, whether this value
must actually be rejected for the particular application.
Multivariate Calibration 47
)LJXUH 3UHVHQWDWLRQRIWZRFDOLEUDWLRQVZLWKLQVXI¿FLHQWSUHFLVLRQ
48 Multivariate Calibration
in the spectrum properly to the respective components. Here,
an analysis of new unknown samples would lead to a precise,
but wrong analysis. The model must therefore be extended
by further calibration spectra.
In many cases, the causes for a bad result of an analysis can
be found easily, if the individual measured values are pre-
cisely illustrated with the analysis. It is far more critical, if
the individual points show a large statistic dispersion (right
picture of Figure 5.9). Here, the calculated values are not
only wrong, but also not reproducible. Usually, an unsuitable
sample preparation or measurement setup is responsible. An
example is the investigation of heterogeneous samples. If the
material is not homogenized, or if the measurement spot used
is too small, no accurate values can be found despite a precise
local measurement. This must be considered during the
evaluation of the results. Even if the measurement values
cannot be improved in suitable measures, it is frequently
recommendable to add a further number of samples to the
calibration model. If even this does not lead to a considerable
improvement of the predictive accuracy, it is to be feared that
the selected analytical method is not suitable.
N. Implementation and
Maintenance of Methods
Multivariate Calibration 49
It is generally recommended to examine the reliability of a
chemometric method from time to time. This can simply be
done by measuring the validity of the method with an inde-
pendent test set at regular intervals (e.g. every „n“ months,
every „n“ batches, after „n“ measurements, etc.). In some
industries, for example in the pharmaceutical industry,
appropriate procedure guidelines force the user to carry out
VXFKH[DPLQDWLRQVUHJXODUO\DPHGLFLQHUHFHLYHVWKHRI¿FLDO
release for sale only if the unrestricted usefulness of the
release analytics is documented in written form).
But even without appropriate regulations, “observation” and
maintenance of the method is advisable. It can often be
REVHUYHGWKDWDPHWKRGDOWKRXJKLWZDVXVHIXODW¿UVWJUDG
ually yields worse results. This can have a number of rea-
sons: First, a creeping change in the instrument (and changes
LQ FXYHWWHV SUREHV DQG ¿EHU RSWLFV FDQ EH WKH FDXVH
Furthermore, changing site conditions or even different raw
material qualities affect the quality of the analysis.
Particularly the calibration of natural materials or of petro-
chemical raw materials can make method maintenance nec-
essary over a long period.
For many applications, it is important that slow degradation
in the quality of analysis is recognized early. Modern prob-
lems of release analytics or process technology do not permit
improvement of the method, once it has lost its usefulness for
the given task. The maintenance of a method should be a
continuous process in the background of daily routine
analytics.
If the method developer realizes that a number of test sam-
SOHV ZHUH QRW DQDO\]HG FRUUHFWO\ DW ¿UVW LW LV LPSRUWDQW WR
review it and to compare the spectra with the earlier calibra-
tion spectra. In most cases, the reason for outliers can be
found relatively quickly. Once the reason is understood,
suitable strategies can be developed to extend the sample set
with suitable samples. It should be mentioned here that it
does not make sense to add new spectra to the data set con-
tinuously without prior examination. There is the danger that
the information necessary for a stable method is not included
in these samples. In this way, very large data sets are produced
which are by no means representative.
50 Multivariate Calibration
A further problem which we often encounter in process ana-
lytics is the impossibility to imitate process conditions in the
laboratory. Also, taking a sample and analyzing it with a
reference method is often not possible. In other words: it is
not possible to collect representative calibration samples.
+HUHLWLVRIWHQXVHIXOWRFROOHFWWKHVDPSOHV¿UVWDQGWKHQWR
draw conclusions for the respective values of the process
from the results of the established routine analytics. If this is
not possible, one has to be content with a strongly limited
method that can only work as a process monitor stating
“Process OK” or “Process not OK”. Sometimes a provision-
ary method set up in the laboratory must be used which
delivers imprecise results but mirrors the relative progres-
sion of the reaction correctly.
$JDLQWKHMXGJPHQWRIDTXDOL¿HGDQDO\VWLVQHFHVVDU\WRGH-
WHUPLQHLIWKHVSHFL¿HGPHWKRGVDWLV¿HVWKHUHTXLUHPHQWVRI
the process.
Multivariate Calibration 51
52 Multivariate Calibration
6
C H A P T E R
CHAPTER1
A PRACTICAL EXAMPLE
CHAPTER2
Multivariate Calibration 53
physical parameters in petrochemical products demand a far
higher effort.
To set up a method, the calibration spectra and their respec-
tive reference values are read into the PLS software. Appro-
priate frequency windows as well as data processing methods
DUHGH¿QHGDQGDFDOLEUDWLRQLVFDUULHGRXW7KHTXDOLW\RIWKH
calibration is evaluated by means of a validation (see
Chapter 4). It is left to the user to decide whether he wants to
carry out the evaluation using an external (test set) validation
or an internal (cross) validation. Very often, cross validation
offers advantages: All spectra are used for the calibration and
the consecutive validation. No part of the measurements is
ORVWE\GH¿QLQJDQH[WHUQDOWHVWGDWDVHWVHH&KDSWHU
The quality of a calibration can easily be assessed by the
YDOXHRIWKHFRHI¿FLHQWRIGHWHUPLQDWLRQR2 and the error of
analysis (RMSECV or RMSEP). This is illustrated in the
following example with an NIR-spectroscopic analysis of a
mixture of Methanol CH3OH, Ethanol C2H5OH and Pro-
panol C3H7OH. The measurement was carried out using the
Bruker near-infrared spectrometer MATRIX-F with a spec-
tral range from 10,000 cm-1 to 4,000 cm-1 (optical path length
of the cuvettes: 2 mm; spectral resolution: 8 cm-1, connected
YLDPOLJKW¿EHUV$WRWDORIPL[WXUHVZHUHPHDVXUHG
with a concentration range of 0 – 100%. Figure 6.1 shows a
selection of the corresponding spectra.
NIR spectra in A strong overlapping of the analyte signals can be observed
particular show even with this three-component mixture. There are mainly
strongly overlapping
bands. Multivariate four major signal groups: The COH-combination vibrations
calibration has distinct (at 4,800 cm-1 WKH ¿UVW RYHUWRQHV RI WKH &+2- and CH3-
advantages over the groups (6,000 cm-1- 5,500 cm-1) and the COH-group
univariate calibration
methods. (7,300 cm-1- 6,000 cm-1), as well as the second overtones of
the CH2- and CH3-groups (8,800 cm-1- 7,800 cm-1). The
spectrum shows no relevant signals above approx. 9,000 cm-
1
. Below 4,400 cm-1 large parts of spectral noise can be found,
which can be explained by strong light loss in the used light
¿EHU
54 Multivariate Calibration
Figure 6.1 Near-Infrared Spectra of mixtures from methanol, ethanol and propanol (cuvette
FRQQHFWHGYLDPOLJKW¿EHUVSDWKOHQJWKPP
Multivariate Calibration 55
If one performs a validation for an increasing number of
factors, one will usually see an improvement of the results of
DQDO\VLV DW ¿UVW 7KH KLJKHU WKH VHOHFWHG UDQN WKH PRUH
spectral information is processed, and the better the results.
However, this cannot be continued arbitrarily. From a certain
critical number of factors, ever more portions of spectral
noise are added to the analysis model, and the quality of the
UHVXOWVGHWHULRUDWHVRYHU¿WWLQJ
This is shown in Figure 6.2. Here, the mean errors of predic-
tion for the PLS regression of the methanol concentration of
the 30 mixtures from CH3OH, C2H5OH and C3H7OH are
represented exemplarily for a rising number of factors. The
method evaluation was carried out using a cross validation.
First, one recognizes an improvement of the results of anal-
ysis for a rising number of factors. For seven or more factors
the results deteriorate again, since the model becomes
RYHU¿WWHG7KHUHIRUHVL[IDFWRUVVKRXOGEHEHVWVXLWDEOHLQ
order to obtain optimum analysis results for the example
shown. A mean error of prediction of 0,07% is found. This is
a realistic value for the NIR spectroscopic investigation of
pure, liquid multi-component mixtures.
)LJUXH 0HDQ(UURURI3UHGLFWLRQIRUPHWKDQROSORWWHGDJDLQVWWKH
UDQNRID3/6UHJUHVVLRQIRUPHWKDQROHWKDQROSURSDQRO
56 Multivariate Calibration
A comparison of the analysis values and the corresponding
reference data is shown for the 6-factor-model in Figure 6.3.
One recognizes that a good match between both values is
found generally. The analysis was always performed using
independent test spectra, i.e. the query spectrum was not
contained in the calibration data set. Hence, one can assume
similarly good results to be obtained with the future analysis
of new alcohol mixtures.
)LJXUH &RPSDULVRQRIWKHUHIHUHQFHDQGDQDO\VLVYDOXHVIRUD3/6
regression determining the methanol concentration from a
PL[WXUHRIPHWKDQROHWKDQROSURSDQRO506(&9min =
IUHTXHQF\UDQJHFP-1FP-1UDQN
GDWDSUHSURFHVVLQJÄ6XEWUDFWLRQRIDVWUDLJKWOLQH³
6RWKHRSWLPXPUDQNIRUDFHUWDLQGH¿QHGFDOLEUDWLRQPRGHO
can be found easily. Therefore, the only question to be
FODUL¿HGLVZKLFKLVWKHPRVWVXLWDEOHPHWKRGIRUDJLYHQ
task, i.e. which combination of data preprocessing and fre-
quency ranges. Since this question cannot be answered gen-
erally, the optimum frequency ranges and data preprocessing
methods must be determined empirically by „trial and error“.
For this purpose these values are changed systematically and
calculated in each case for a rising number of factors. That
VHWWLQJZKLFKVKRZVWKHODUJHVWYDOXHVRIWKHFRHI¿FLHQWRI
determination R2, and/or a minimum error of prediction,
Multivariate Calibration 57
characterizes the best analysis model. Thus all meaningful
variants at frequency ranges and data preprocessing methods
are tested successively until the optimum model is found.
For the selection of suitable frequency ranges, in most cases
LW LV VXI¿FLHQW WR JURXS WRJHWKHU DSSURSULDWHO\ ODUJH
frequencies (see “Selecting the Spectral Regions” in Chapter
5). Finding of individual spectral data points is typically not
necessary.
A deeper understanding of the basic mathematics is not nec-
essary for the selection of the different data preprocessing
methods and frequency ranges. Good values for R2 are larger
than 90% for solids and larger than 99% for liquid
measurements. Noticably worse values refer to a calibration
PRGHORILQVXI¿FLHQWTXDOLW\DQGVKRXOGXVXDOO\QRWEHWDNHQ
into account.
In order to ensure an uncomplicated comparison of the indi-
vidual models, it is advisable to sum up the most important
parameters in a table. This is shown below for the NIR-
spectroscopic analysis of the Methanol/Ethanol/ Propanol-
mixture discussed previously. (For reasons of a better over-
YLHZWKHUHVXOWVIRURQO\¿YHYDOLGDWLRQVDUHVKRZQKHUHLQ
the case of a real method optimization, all conceivable
meaningful combinations of data preprocessing and fre-
quency ranges would have to be tried one after the other. This
often results in tables with 30 and more documented
methods.)
7DEOH0HWKRGRSWLPL]DWLRQIRUWKH1,5VSHFWURVFRSLFDQDO\VLVRIWKHPHWKDQRO
FRQFHQWUDWLRQLQPHWKDQROHWKDQROSURSDQROPL[WXUHV
No. Data Frequency Optimum Coeff. of Mean error of 5HPDUNV
preprocessing ranges [cm-1] rank determination prediction
5 [%]
1 none 9,000-5,200 9 99.8 0.16% total. Spec.
2 SSL 9,000-5,200 6 >99.9 0.07% total. Spec.
3 Vec. Norm 9,000-5,200 8 99.6 0.42% total. Spec.
4 SSL 7,000-5,200 8 >99.9 0.07% 1.overtones
5 SSL 6,000-5,200 7 >99.9 0.07% no OH
... ... ... ... ... ... ...
58 Multivariate Calibration
,QWKH¿UVWWKUHHOLQHVFDOLEUDWLRQVRYHUWKHHQWLUHVSHFWUDO
region of 9,000 cm-1 - 5,200 cm-1 for different data prepro-
cessing methods are shown. One recognizes that the sub-
traction of a straight line (SSL) is the best suitable data
preprocessing option (Method No.2 in Table 6.1) in order to
REWDLQDJRRGUHVXOWRIDQDO\VLV7KHFRHI¿FLHQWRIGHWHUPL
nation R2 is larger, and/or the mean error of analysis is
smaller than in the two other models (method No.1 and 3).
Looking at further frequency ranges (method No. 4 and 5),
the model cannot be improved any further.
Similarly, neglecting the second overtone of the CH2- and
CH3-vibrations between 8,800 cm-1 and 7,800 cm-1 does not
lead to any measurable losses in the analysis quality, neither
does neglecting the strong OH absorption at approx. 6,900
cm-1. Several models result in a mean error of analysis of
7KHUHIRUH RQ WKH ¿UVW YLHZ DOO WKUHH PRGHOV DUH
equally well suitable to determine the methanol concen-
tration. Nevertheless, it is recommendable to use the model
with the lower number of factors. Methods which work with
fewer ranks usually have a higher stability. Therefore, in the
example shown here, it would be useful to perform a cali-
bration in the spectral region between 9,000 cm-1 - 5,200 cm-1
with the use of a „SSL“ as data preprocessing option (method
No.2).
With these settings, a method can be compiled which is suit-
able for the analysis of new unknown samples. The most
important results should always be written down. Fig. 6.4
shows an example for a validation report for the application
shown here.
Multivariate Calibration 59
Figure 6.4 Validation report
60 Multivariate Calibration
It may surprise some analysts that, in the example shown
here, several very different analytical models lead to almost
LGHQWLFDOUHVXOWV7KHUHIRUHWKLVSRLQWZLOOEHFODUL¿HGIXU
ther: The equivalence of several chemometric models can be
explained on the basis of the factorization of the spectra. To a
certain extent, the individual factors represent “information
units” which represent certain properties (and/or the
combination of certain properties) of the samples. The con-
centration of the analyte, for example, is such a system
property. In the case of a successful factorization, the PLS
algorithm recognizes the factors that are relevant for analysis
and correlates these with the appropriate system properties
(e.g. the analyte concentration). Generally, this succeeds for
a larger number of spectral regions, since most substances
possess analytically evaluable signals in more than one
frequency window of the spectrum. Since each of these
ranges consists of a multitude of data points (i.e. has an
accordingly high analytic information content) often the
system is determined statistically safely for all of these
ranges. Thus, in most cases there is a selection of calibration
models of comparable quality, which lead to similarly good
results of analysi.s
A further important point results from the factorizing of the
spectra. In the case of a univariate calibration, the analysis of
PXOWLFRPSRQHQWPL[WXUHVUHTXLUHVDVXI¿FLHQWVHSDUDWLRQRI
the individual analyte signals. Each component is assigned
to a certain wavelength or a certain area. This is not
necessary in a multivariate calibration. Here, an evaluation
for several components can often be accomplished from
identical spectral structures and data preprocessing methods.
Since with factorization the spectra are separated into
independent information units, it is not necessary to separate
spectral structures manually. Particularly if the single signals
overlap strongly, this is an advantage over the classical
univariate evaluation.
Multivariate Calibration 61
B. Analysis and Determination of Outliers
62 Multivariate Calibration
Figure 6.5 Analysis report
Multivariate Calibration 63
C. 3/6UHJUHVVLRQD0HWKRG3URYLGLQJ
,Q¿QLWH$FFXUDF\"
64 Multivariate Calibration
So, using a number of factors which is large enough would
EHVXI¿FLHQWWRREWDLQD³SHUIHFW´PDWFKEHWZHHQDWHVWVDP
ple and the corresponding spectrum in the calibration set, i.e.
the very analysis value is returned which was initially fed to
the model during calibration. In this case, spectral noise
cannot reduce the quality of the results because the noise
amplitudes of the test and calibration spectra are identical.
This is referred to as a validation using “dependent samples”,
because the samples are already known to the model.
Obviously, the analysis of dependent samples to validate the
method yields results which are completely useless10. This
will be shown in the following example using the 30 spectra
obtained from the mixtures described earlier, containing
methanol, ethanol and propanol. This time, instead of using
the accurate concentrations of the mixtures, random num-
bers between 0 and 100% are chosen, numbers which are
completely unrelated to the actual component values.
)LJXUH 9DOLGDWLRQRID3/6UHJUHVVLRQIRUWKHGHWHUPLQDWLRQRI
0HWKDQROIURPDPL[WXUHRI0HWKDQRO(WKDQRO3URSDQRO
with independent samples for a 13 factor model. As
UHIHUHQFHGDWDDUELWUDU\QRQVHQVHFRQFHQWUDWLRQYDOXHV
EHWZHHQDQGZHUHFKRVHQ7KHYDOLGDWLRQVKRZV
WKDWDQDQDO\VLVLVQRWSRVVLEOHLQWKLVFDVH
Multivariate Calibration 65
An analytically correct validation of this model shows that it
is obviously not suitable for predicting the stated nonsense
concentration values. This is shown in Figure 6.6. Reference
value and analysis value show no observable connection.
E.g., a PLS analysis with a methanol content of 5% results in
a analysis value of 102%. Another sample with a content of
96% is predicted with 29%. As expected, in this example, an
analysis (or in better words, a reconstruction of the
meaningless input values) is not possible.
)LJXUH 9DOLGDWLRQRID3/6UHJUHVVLRQIRUWKHGHWHUPLQDWLRQ
RI0HWKDQROIURPDPL[WXUHRI0HWKDQRO(WKDQRO3URSDQRO
with dependent samples for a 7 factor model. As reference
GDWDDUELWUDU\QRQVHQVHFRQFHQWUDWLRQYDOXHVEHWZHHQ
DQGZHUHFKRVHQ506((
66 Multivariate Calibration
Figure 6.8 The same validation as Fig. 6.7 for a 13 factor model
506((
)LJXUH 7KHVDPHYDOLGDWLRQDV)LJIRUDIDFWRUPRGHO
506(E
Multivariate Calibration 67
dependent samples. Even for a 7-factor model, a rough cor-
relation between the “real” (nonsense) values and the pre-
dicted “analysis” values can be found (see Figure 6.7).
The accuracy can even be „improved“ by using 13 or 16 fac-
tors (Figure 6.8 and Figure 6.9). The corresponding mean
errors of prediction are obtained as 17,5% for the 7-factor
model, 0,42% for the 13-factor model and 0.04% for the
16-factor model. Therefore, a calibration using 16 factors
seems to yield even better results than the model which is
analytically correct (see Figure 6.3).
This shows impressively that even with a relative small
number of factors the (completely nonsense) values can be
well reproduced. Thus, it is possible to obtain arbitrarily
„good“ analysis results by selecting inadmissible test spec-
tra. However, the method cannot withstand a validation with
real samples.
The credibility of a validation can be assessed easily in
practice. On the one hand, this is possible by checking
characteristic parameters, such as the mean error of analysis.
As already mentioned, the error of prediction must go
through an optimum value for an increasing number of fac-
tors. If, however, a steady improvement is achieved for a
growing number of factors, dependent test spectra have been
used for the validation. This is shown in Figure 6.10 for the
example described above. With a growing number of factors,
the error of analysis decreases to drop down at 16 factors
close to 0%. In contrast, the correctly validated model shows
its minimum at 6 factors with a mean error of analysis of
0.07% which cannot be decreased any more (Figure 6.2).
On the other hand the legitimation of a model can be checked
by a simple sample measurement. For this purpose, a small
number of samples is measured and analyzed. The resulting
error of analysis must be in the same range as the error of
prediction (RMSECV or RMSEP), which was determined
beforehand. If the mean analysis error of the validation data
set is well below the error of the measured samples, the
model might have been validated using inadequate samples.
68 Multivariate Calibration
Figure 6.10 Mean error of estimation plotted against the rank for a
PL[WXUHRI0HWKDQRO(WKDQRO3URSDQRO$VUHIHUHQFH
YDOXHVDUELWUDU\QRQVHQVHFRQFHQWUDWLRQYDOXHVEHWZHHQ
0 and 100% were chosen and validated using dependent
WHVWVSHFWUD7KLVLQDGPLVVLEOHYDOLGDWLRQFDQEHUHFRJQL]HG
E\DQHUURURIDQDO\VLVZKLFKGHFUHDVHVVWHDGLO\ZLWKKLJKHU
ranks.
Multivariate Calibration 69
70 Multivariate Calibration
7
C H A P T E R
CHAPTER1
DEFINITION OF
CHAPTER2
IMPORTANT
ABBREVIATIONS
(7-1)
Multivariate Calibration 71
deviations can be found. This bias is case of site change
due to the systematic deviations in the reference values of
the different laboratories. When changing from one
instrument to another, slightest constructional changes in
the spectrometers can lead to different spectra. To adapt
the original analysis values to the expected new values,
the difference (i.e. the bias) is subtracted from the
predicted value. This way, existing calibrations can be
used at different sites or on different instruments.
The bias correction should not be confused with the
Offset and Slope correction. Here the predicted analysis
values Yipred are plotted against the true values Yimeas and
the regression line is calculated. The correction now
adjusts this regression line towards the bisector, so that on
an average the predicted, new values and the original
values match (for more details see Offset and Slope
correction).
Typically the slope of the regression line is “1” or close to
“1”. Thus the ordinate (i.e. the offset) is equal to the mean
systematic deviation of the predicted from the original
values (i.e. the bias). Therefore, offset and bias are
normally almost identical. In the literature, it is often
referred to as “Bias and Slope correction”, although
actually the offset of the regression line is taken for the
correction. If there is however no linear correlation
between the original and the predicted value, the slope of
the corresponding regression line will be different from
“1”. Bias and offset will be different and a bias correction
will lead to different values compared to an offset
correction.
&DOLEUDWLRQ)XQFWLRQ During the calibration, a number
of samples of known concentration is measured. The
calibration function b correlates a property Y (e.g. the
concentration of the analyte) of a system with an
experimentally observable X (e.g. the spectrum):
b = (XTÂ;-1Â;TÂ<
X and Y are written in matrix form. In expert literature,
the function b is often called bFRHI¿FLHQWRUUHJUHVVLRQ
FRHI¿FLHQW:LWKWKLVIXQFWLRQWKHDQDO\WHFRQFHQWUDWLRQ
can be calculated directly from the spectrum (the
procedure is described in Chapters 2 and 3):
72 Multivariate Calibration
< ;ÂE
Experienced method developers often observe the
bFRHI¿FLHQWRIDFDOLEUDWLRQWR¿QGVXLWDEOHVSHFWUDO
regions for method development. These ranges of a
spectrum which contain important information of the
analyte often contribute the largest portions of the
UHJUHVVLRQFRHI¿FLHQW
&RHI¿FLHQWRI'HWHUPLQDWLRQ see R2.
&RUUHORJUDP The correlogram shows the amount of
consistence between spectral and concentration data for a
given rank. Values close to +1 or -1 mark those spectral
regions where a good correlation occurs. Values close to
]HURVKRZDQLQVXI¿FLHQWFRUUHODWLRQ7KHVHUDQJHV
should not be used for the calibration:
(7-4)
Multivariate Calibration 73
)9DOXH and )3URE: F-values are used for recognizing
outliers in the calibration data set. They can generally be
derived from the spectral- and concentration values of the
measured sample. There are two kinds of F-values: those
which are calculated directly from the spectral residuae,
and those which result from the difference of the true and
the predicted values (predicted by the chemometrical
model). The larger the F-value of the analyzed sample,
the more likely it is an outlier.
F-value calculation for the determination of spectral
outliers:
(7-6)
(7-7)
(7-8)
74 Multivariate Calibration
/HYHUDJH7KHOHYHUDJHLVDPHDVXUHIRUWKHLQÀXHQFHRID
sample on the PLS model. Mathematically, it is the
Mahalanobis distance of the single calibration samples
(see Mahalanobis distance). The leverage value of
outliers is unnaturally high, compared to other samples.
0DKDODQRELV'LVWDQFHDuring factorization, the
measured spectra are decomposed into different factors
and a spectral residuum (see equation 2-3).This applies to
the calibration samples as well as to the analysis samples.
If an analysis of unknown samples is to be performed
with the PLS algorithm, it should be checked whether the
spectra can be analyzed reliably with this method. This is
possible with the calculation of the Mahalanobis distance.
Here, it is checked how well the spectrum of the analyte
³¿WV´WKHVSHFWUDRIWKHFDOLEUDWLRQGDWDVHW
7KH0DKDODQRELVGLVWDQFHLVGH¿QHGDVWKHGLIIHUHQFHRI
the measured spectrum of the analyte to the mean value of
all spectra in the calibration data set, which was used
when reconstructing the spectral data matrix for the given
number of samples. The larger this difference, the larger
the value of the Mahalanobis distance. There are various
possible reasons. External disturbances, such as the
contamination of the samples or disturbing temperature
drifts, lead to a distortion of the peak symmetry, which
results in a growing of the Mahalanobis distance. In the
same way, this value grows if samples are analyzed which
lie outside the concentration range. The Mahalanobis
distance is a quantitative measure for the reliability of an
analysis, because outliers as well as samples with
reference values outside the valid concentration range are
detected.
During the PLS regression, the Mahalanobis distances of
all calibration spectra are calculated. From these results,
the maximum tolerable value of this distance is
determined, for which the spectrum can be analyzed
safely with the given calibration function. Samples which
lie above this threshold cannot be determined reliably.
These samples may be potential outliers.
If the spectral data are factorized according to equation
(2-3), the Mahalanobis distance hi is determined as:
(7-9)
Multivariate Calibration 75
where the calculation is performed for R factors. If the
single scores vectors ti were not calculated from an
unknown sample but from a calibration spectrum, it is
also called “Leverage”. The leverage values of the
FDOLEUDWLRQVDPSOHVDUHDPHDVXUHIRUWKHLULQÀXHQFHRQ
the PLS model.
0'/(Minimum Description Length): An empirical
number to determine the optimum number of factors:
(7-10)
M
Y = M;ÂMb (7-11)
For the measurements with the slave, no extra calibration
is performed. To calculate the concentration data matrix
for the slave measurements, they are linked to the
76 Multivariate Calibration
FRHI¿FLHQWMb, which was generated from the master
measurement:
S
Y = S;ÂMb (7-12)
(7-15)
Multivariate Calibration 77
mark those spectral ranges, where not only a good
correlation of the data sets can be found (like the
FRUUHORJUDPEXWDOVRDVXI¿FLHQWEDQGLQWHQVLW\FDQEH
observed. To set up a calibration method, spectral ranges
with high correlogram and high PWS values should be
selected:
(7-16)
(7-17)
(7-18)
78 Multivariate Calibration
The larger the value, the smaller the part of the spectral
structures which can be explained by the factorization.
In the same way, the residual matrix of the concentration
data G describes that part of the components of the
calibration data set which cannot be explained by
factorization:
(7-19)
(7-20)
(7-22)
Multivariate Calibration 79
samples. Caution: The RMSEE is not suitable for a
method validation, as no independent test spectra are
analyzed, see Chapter 6-C:
(7-23)
(7-24)
(7-25)
(7-26)
with
(7-27)
and
(7-28)
80 Multivariate Calibration
To evaluate the quality of a validation, calculating the
RPD is more meaningful than only looking at the error of
prediction. For example, validating a calibration with a
relatively small range will most likely lead to a small
error of prediction. The model “knows” only little
variation in the reference data set and will therefore
predict the analysis values with a small variance (i.e. an
apparently small error), even if the model itself is not
good. To avoid an over-optimistic interpretation of the
result, the RPD value is calculated: This way it can be
seen that the standard deviation of the reference values is
relatively small (low SD values) and that even a minor
deterioration of the predictive error (growing values for
SEPbias) will lead to a clearly worse RPD value. So even
small prediction errors can belong to a bad calibration
model - clearly indicated by the RPD value.
Looking at the RPD has another advantage: It was
already said that calibration samples should span the
whole range of the calibration range homogeneously.
Methods, which have a large amount of samples in the
FHQWHUEXWRQO\DIHZDWWKHHQGRIWKHUDQJHOLNHLQ¿JXUH
5.1) are less robust. This is being accoutned for in the
RPD value. A more homogeneous distribution of the
reference values leads to a higher standard deviation SD
than clustering at one position. The resulting RPD is
accordingly larger and the calibration model is rated
better.
For historical reasons, the RPD value is often used in the
agro market, for example in the NIR spectroscopic
analysis of wheat. The following rule of thumb is valid
for assesing the quality of a calibration:
RPD between 2.5 -3: method OK for rough screening
RPD > 3: method OK for screening
RPD > 5: method OK for quality control
RPD > 8: method excellent for all analytical tasks.
These values are very dependent on the kind of sample
and the calibration range. The rule of thumb above should
be valid for most applications in the food industry (for
natural, heterogeneous and solid samples). For chemical
analysis (synthetic, homogeneous liquid samples), the
calibrations should show higher RPD values.
Multivariate Calibration 81
The RPDYDOXHLVE\WKHZD\RIWKHVDPHVLJQL¿FDQFHDV
the so called “explained variance” R2. The R2also allows
a qualitative evaluation of the error of prediction during a
validation. For a non-biascorrected calibration (bias = 0)
the following correlation applies:
(7-29)
(7-30)
82 Multivariate Calibration
8
C H A P T E R
CHAPTER1
Multivariate Calibration 83
Another advantage of multivariate calibration results from
the use of large spectral ranges when setting up a calibration
model: In the region of (near-) infrared spectroscopy or
Raman technology, there are only few substances that have a
spectrum which can be observed in a small frequency win-
dow. Normally, the analyte bands are found in a wide fre-
quency range, which can be accounted for in the factorization.
Therefore, it is highly unlikely that the spectrum contains a
hidden signal which alone leads to a successful analysis, and
ZKLFKWKHDQDO\VWQHHGVWR¿QG7KHRSSRVLWHLVPRUHOLNHO\
,WLVRIWHQHQWLUHO\VXI¿FLHQWWRWDNHODUJHUVLJQDOJURXSVLQWR
account when developing a method. Furthermore, the result
of a PLS-regression usually improves if a higher number of
analytically relevant data points is used.
Therefore, the approach of method development used in
multivariate calibration is entirely different from the
approach used in univariate calibration. In the latter method,
RQH QRUPDOO\ WULHV WR ¿QG D VLJQDO WKDW FDQ EH HYDOXDWHG
quantitatively without any overlapping from other spectral
structures. In multivariate calibration, larger spectral areas
are combined intentionally. It does not matter whether the
signals of various other system properties overlap in these
areas. The combination which is best suited for the analytical
method can be found reliably by testing all sensible
combinations and performing a consecutive validation.
Usually, no expert knowledge is required regarding the
spectra of the pure substances which are to be analyzed.
The obvious conclusion would be to automate the entire
process of method development and optimization. As the
position of most functional groups in (N)IR and Raman
technology are well known, it should be possible to automate
the “testing” of all possible parameters with an appropriate
software routine.
Figure 8.1 shows the result of such an automated optimiza-
tion for the application sample in Chapter 6.
84 Multivariate Calibration
Figure 8.1 The result of an automatic method optimization with the OPUS/QUANT software
%UXNHU2SWLN*PE+(WWOLQJHQ9HUV7KH¿JXUHVKRZVDOLVWRIWKH
calibration models with the smallest analysis error RMSECV.
Multivariate Calibration 85
86 Multivariate Calibration
APPENDIX
CHAPTER1
CHAPTER2
Multivariate Calibration 87
2 Why can’t I calibrate with only a few samples?
PLS is a statistical procedure. With only a few samples, no
reliable statistics can be achieved. A simple thought
experiment can illustrate this: If one assumes, that an NIR
spectrum consists of about 1,000 data points, there will be
VHYHUDOVSHFWUDOVWUXFWXUHVZKLFK¿WWRWKHJLYHQFRQFHQWUDWLRQ
values accidentally, if only few measurements are available.
Not until a substantial number of samples are collected, real
LHSK\VLFDOO\MXVWL¿HGFRUUHODWLRQVFDQEHIRXQGEHWZHHQ
the spectral data and the concentration values. With sample
sets larger than 20 spectra, the statistical probability of
DFFLGHQWDOFRUUHODWLRQVWHQGVWR]HUR7KHUHIRUHFDOLEUDWLRQV
or feasibility studies with only 5 or 6 samples are in my
opinion not legitimate.
88 Multivariate Calibration
5 Can I work with stock standards or dilution series?
Unfortunately not. Stock standards only vary in the concen-
tration of the analyte and do not take the variance of the other
components or the matrix into account. Stock standards can
only be used if there are enough samples which represent all
V\VWHP YDULDQFHV 7KH\ DUH WKHQ RQO\ XVHG WR ¿OO WKH
calibration range homogeneously with measurement values.
Working solely with stock standards is only possible if the
samples contain all possible variation of all components,
which is hardly ever the case.
Dilution series are completely unsuitable. By diluting, all
concentration values change collinearly (i.e. in the same
ZD\ 7KH 3/6 DOJRULWKP FDQ WKHUHIRUH QRW GLVWLQJXLVK
between changes of the component and changes of the
PDWUL[7KHUHVXOWLQJFDOLEUDWLRQVDUHFRPSOHWHO\XVHOHVV
Multivariate Calibration 89
7 Are there frequency ranges or data preprocessing
methods which are generally favorable?
'LIIHUHQWXVHUVZLOOVXUHO\JLYHGLIIHUHQWDQVZHUV7KHUHDUH
even well-known manufacturers of NIR spectrometers, who
XVHWKHIXOOVSHFWUXPUDQJHDQGWKH¿UVWGHULYDWLYHDVDVWDQ
GDUG SURFHGXUH7KLV ZD\ DOO VSHFWUDO LQIRUPDWLRQ LV XVHG
DQG ZLWK WKH ¿UVW GHULYDWLYH GLVWXUELQJ EDVHOLQH VKLIWV DUH
reduced. However, this way, disturbing water bands cannot
be eliminated and a derivative often enhances spectral noise
or instrument drifts along the wavelength scale. I personally
tend to think of a PLS calibration as a ’black box’, i.e. I
FKRRVHDVHWRIPRGHOVZLWKSUHRSWLPL]HGSDUDPHWHUV2YHU
the following weeks, the models are tested with real samples.
7KH¿QDOGHFLVLRQIRUDPRGHOLVPDGHDIWHU,KDYHGHYHORSHG
DIHHOLQJIRUWKHLQÀXHQFHRIWKHQDWXUDOGLVWXUEDQFHVRQWRWKH
precision of the calibration and I can choose on the most
robust model. I have stopped relying on my analytical
expertise when it comes to estimating the robustness of PLS
PRGHOVIURPWKHRXWVHWVHHDOVRTXHVWLRQ
90 Multivariate Calibration
inside and another where ambient humidity can diffuse
inside. It can be seen that the second instrument shows strong
absorbance bands of water vapor (7,500-6,700 cm-1 and
5,800 - 5,000 cm-1DFRXSOHRIGD\VDIWHUWDNLQJUHIHUHQFH
7KDWLVDWOHDVWQRSUREOHP1HDUO\HYHU\1,5LQVWUXPHQWLQ
the world has got humidity inside and there is no need to
work with sealed and dry spectrometers.
But due to the exchange of gaseous water vapor with the
surrounding atmosphere, small sharp banded peaks are
DGGHGWRWKHVSHFWUD2EYLRXVO\WKHHIIHFWLVYHU\ORZDWWKH
beginning, if samples are measured immediately after taking
reference. But even though the effect seems to be minor (and
cannot be clearly seen in the spectra shortly after taking
UHIHUHQFHWKHLQÀXHQFHRQKHFDOLEUDWLRQLVUHODWLYHO\KLJK
7KH3/6DOJRULWKPLVPDLQO\ORRNLQJIRUFKDQJHVLQVSHFWUD
LHVKDUSEDQGHGVWUXFWXUHVOLNHYDSRUVSHFWUDLQÀXHQFHWKH
analysis relatively strong. And the better the spectral
resolution of the collected spectra, the more badly this effect
will be.
7KHUHIRUHLWLVKLJKO\UHFRPPHQGHGHUDVLQJDOOIUHTXHQFLHV
from calibration equations which incorporate the regions
ZKHUHZDWHUYDSRUDEVRUEV7KLVLVWUXHHYHQLIWKHDQDO\WHV
themselves show nice bands in the corresponding regions.
7KLV PDLQO\ RFFXUV HJ LQ DJULFXOWXUH DSSOLFDWLRQV ZKHUH
water or protein need to be determined. Especially for the
calibration of the humidity itself (i.e. the content of liquid
ZDWHULQVLGHRIWKHVDPSOHVLWVRXQGVa priori not logical to
neglect the water vapor bands.
Nevertheless, for protein there are enough frequencies left to
guarantee that all relevant spectral information is inside of
the calibration. Protein frequencies in the water vapor region
RIIHURQO\UHGXQGDQWLQIRUPDWLRQ7KHVDPHLVWUXHIRUWKH
calibration of water inside of the samples. Here also the
edges of the big water bands contain enough information to
guarantee a disturbance free analysis. It has been proven in
numerous studies that mostly calibrations which avoid the
incorporation of water vapor bands outperformed the other
ones in terms of stability. Even if smaller frequency ranges
has been chosen. So, the frequency range of the water vapor
bands seems to be a „sleeping risk“ for any calibration and it
Multivariate Calibration 91
is pretty useful to neglect the corresponding areas inside the
calibration curves.
92 Multivariate Calibration
Figure A.1 Example of the prediction error of several chemometric models during method
development and by using a separate sample set
7KH UHVXOWV DUH DOVR SORWWHG LQ )LJXUH $ 1RZ GLIIHUHQW
models show enormous differences in their performance.
Model No.1 gives comparable RMSEP-values related to the
validation results. Model No. 10 is approximately one order
RIPDJQLWXGHZRUVH(YHQWKDWDUHSUHVHQWDWLYHQXPEHURI
samples and batches has been used to set-up the calibration
this models is so sensitive to minor differences which
occurred in separate test set that it gives totally awful results.
In contrast model No. 1 is robust - by using the same
calibration spectra and reference values from model No. 10 -
just due to a better setting.
7KLVH[DPSOHVKRZVKRZLPSRUWDQWLWLVWRFKHFNWKHPRGHO
stability continuously. Just by looking at the R2 and RMSEP
values during the method development is not enough. Espe-
cially if this model is chosen blindly, which gives the lowest
prediction errors during the validation routine. It can be seen
in Figure A.1 that model No. 9 gives the lowest RMSEP for
the validation. But this model is enormously unstable in
Multivariate Calibration 93
gives unacceptably bad results of the RMSEP for the separate
sample set. Model No. 9 should be chosen in no case.
2IWHQLWVKRZVXSWKDWWKHRSWLPDOSDUDPHWHUVIRUDUREXVW
model cannot be recognized a priori, since during the method
RSWLPL]DWLRQWKHLQÀXHQFHVRIDOOIXWXUHGLVWXUEDQFHVDUHQRW
known. Also analytically plausible measures (like calculating
WKHGHULYDWLYHIRUWKHFRPSHQVDWLRQRIEDVHOLQHGULIWVGRQRW
always lead to the development of the most stable model.
7KLVFDQLQP\RSLQLRQRQO\EHDFKLHYHGE\WKHDGGLWLRQDO
validation of the calibration from time to time. However only
VXI¿FLHQWO\ODUJHVDPSOHVHWVZKLFKUHÀHFWall characteristics
of the system representatively, guarantees an exact estimation
of analysis errors during the optimization of the method.
94 Multivariate Calibration
12 Is it helpful to enter the values of the disturbing
parameter, or the sample temperature into the
software when calibrating the system, to make sure
that they are taken into account?
Most of the chemometric software packages us the PLS1
algorithm, as it generally leads to good results. With this
algorithm, only the component values of the analyte are
taken into account and are correlated to the corresponding
spectral features. Anything else will be detected as distur-
bances, i.e. the concentration data are treated as a vector and
the other component values or temperature values do not
OHDGWRDFKDQJHRIWKHUHVXOWV7KHDGGLWLRQDOYDOXHVDUHQRW
taken into account during the analysis. If you want to analyze
a multiple-component system with PLS1, it is applied
successively to all calibrated components. Specifying the
sample temperature only makes sense if you want to calcu-
late the temperature of the unknown samples, i.e. use the
spectrometer as a thermometer.
Multivariate Calibration 95
from the other signals. Here, chromatographic methods (like
+/3& DUH LQ DGYDQWDJH DV WKH\ SK\VLFDOO\ VHSDUDWH WKH
different components and thus making them more clearly
visible.
7KHVHGLVWXUEDQFHVFDQQRUPDOO\QRWEHDYRLGHG+RZHYHU
by choosing a good sample preparation and by intelligent
PHDVXUHPHQW FRQ¿JXUDWLRQV WKHVH LQÀXHQFHV FDQ EH PLQL
mized. Some examples:
Heterogeneous samples: Milling or using an integrat-
ing sphere with sample rotator.
7HPSHUDWXUHGULIWV0HDVXUHPHQWLQWKHUPRVWDWWHG
cuvettes or vials.
Incidence of extraneous light: Shading of the mea-
VXUHPHQWZLQGRZRUXVLQJDQG)7LQVWUXPHQW
Filming or adhesion of the sample (e.g. with food
VDPSOHV0HDVXUHPHQWLQWUDQVPLVVLRQ
6WURQJVFDWWHULQJHJRIDWDEOHW0HDVXUHPHQWLQ
transmission with focussed beam.
Coated tablets: Measurement in transmission with
focussed beam.
Measurement through packaging material with probes.
But also the mathematical treatment of the raw spectra can
minimize perturbations. An MSC (Multiple Scattering Cor-
UHFWLRQLVIRUH[DPSOHXVHIXOIRUVWURQJO\VFDWWHULQJVDPSOHV
and a derivative helps against baseline drifts.
96 Multivariate Calibration
15 :KHQGRHQYLURQPHQWDOLQÀXHQFHVRQWKHLQVWUX
ment itself need to be compensated?
In general, the instruments are stable, at least for a few hours,
VR WKDW HQYLURQPHQWDO LQÀXHQFHV VKRXOG QRW PDWWHU LI D
background is taken from time to time. However, this does
not apply for many in-line applications. For these kinds of
measurements, it is often not possible to perform easily a
background measurement as the measurement probe is inte-
grated directly into the process. Long-term changes in tem-
perature and humidity can impair the measurement results
FRQVLGHUDEO\6HH)LJXUH,QWKLVFDVHWKHVSHFWURPHWHU
VKRXOGEHWKHUPRVWDWWHGLQDQDLUFRQGLWLRQHGFDELQHW7KH
LQÀXHQFHRIKXPLGLW\FDQEHHOLPLQDWHGE\XVLQJHQFDSVX
lated instruments, purging the spectrometer with dry air or
excluding water bands within the calibration.
Multivariate Calibration 97
D
98 Multivariate Calibration
GLOXWLRQVHULHVDQGLILWKDVQRWQDWXUDOUHDVRQV0DQ\V\VWHP
properties, however, have such a ‚natural‘ relation and
consequently the reference values are generally collinear. As
in this case, the component values are always related with
each other, there is no need to distinguish between them.
From one component value you can deduce all other
component values (which are in a collinear relation to the
FRPSRQHQWYDOXHLQTXHVWLRQ,QWKLVFDVHFROOLQHDUGDWDVHWV
can be used for the calibration without any problems.
An example for it is the spectroscopic determination of the
RFWDQHQXPEHU$W¿UVWVLJKWLWVHHPVWREHP\VWHULRXVWKDW
the knock behavior of SI engines for different fuel mixtures
can be predicted by a NIR measurement. Nevertheless, there
LV QR GRXEW DERXW WKH VHULRXVQHVV RI WKH FDOLEUDWLRQ 7KH
NQRFN EHKDYLRU LV GLUHFWO\ FROOLQHDUO\ UHODWHG WR WKH
individual fuel components. As these components can be
measured spectroscopically, from their values the octane
number can be deduced directly.
Multivariate Calibration 99
19 Is it allowed to improve the analysis results by
performing multiple measurements and averaging
the spectra afterwards?
<HV7KLVSURFHGXUHLVHVSHFLDOO\UHFRPPHQGDEOHLQFDVHRI
heterogeneous samples.
7KLVDGDSWDWLRQFDQEHDYRLGHGLI\RXXVHDQ)7VSHFWURPH
ter with a measurement accuracy about 20 times better
(0.1 cm-16RDKLJKUHVROXWLRQFDSDELOLW\RIDVSHFWURPHWHU
is indirectly decisive for the usefulness of the spectrometer
during its routine operation, even if most applications do not
require a high resolution.
31 +RZFDQ,¿QGRXWWKHRSWLPDOPHDVXUHPHQWVSRW
VL]HHJIRU'5,)7PHDVXUHPHQWV"
7KHQHFHVVDU\PHDVXUHPHQWVSRWVL]HGHSHQGVPDLQO\RQWKH
particle size and the heterogeneity of the sample. Finally, the
measurement spot must cover a representative sample cross-
VHFWLRQ 2IWHQ D PHDVXUHPHQW VSRW GLDPHWHU RI VHYHUDO
centimeters is required (e.g. in the food and animal food
LQGXVWU\ SRO\PHU LQGXVWU\ HWF $ VDPSOH URWDWRU ZLOO
improve the result once more as it rotates the sample
aFHQWULFDOO\RYHUWKHPHDVXUHPHQWVSRW
32 ,VLWUHFRPPHQGDEOHWR¿QGWKHRSWLPDOVSHFWUDO
ranges for the calibration on the basis of spectra
WDEOHVVHH7DEOHRU)LJXUHRUZLWKWKHDLG
of the spectra of the corresponding pure substances
(by measuring the exact positions of the main
absorption bands and entering these values into the
FKHPRPHWULFVRIWZDUH"
7KLV SURFHGXUH LV QRW UHFRPPHQGDEOH EHFDXVH WKH VSHFWUD
tables do not reveal for a special case which bands are over-
ODSSHGE\LQWHUIHUHQFHVWRZKDWGHJUHH7KHSHUVRQLQFKDUJH
with developing chemometric method should invest time in
providing representative calibration data sets and checking
critically the robustness of the method from time to time.
34 7KHPDLQWHQDQFHRIFKHPRPHWULFPHWKRGLVrec-
ommended, i.e. the quality should be checked from
time to time. But how am I to proceed if I detect a
JUDGXDOZRUVHQLQJRIWKHPHWKRGRYHUDORQJHU
time period?
Add further samples with new properties to an existing
PHWKRGDQGUHYDOLGDWHWKHPHWKRG7KHUHYDOLGDWLRQPD\
lead to a new parameter set (spectral range, data preprocess-
LQJPHWKRGVDQGUDQNZKLFKWDNHVWKHQHZV\VWHPSURSHU
ties better into consideration.
,ZLVKWRWKDQNP\FROOHDJXHVIURP%UXNHU2SWLN*PE+0U
7LP6WDGHOPDQQIRUDQXPEHURISUDFWLFDOKLQWV'U0DUNXV
Arnold for the expert review of the formulas in Chapter 7,
and Ms. Dagmar Behmer for proof-reading and layouting the
manuscript.