5.21. Chemometric Methods Applied To Analytical Data

EUROPEAN PHARMACOPOEIA 9.0 5.21.
Chemometric methods applied to analytical data
04/2016:52100 1-1-4. Introducing chemometrics

In chemometrics, a property of interest is evaluated solely in
terms of the information contained in the measured samples.
Algorithms are applied directly to the data set, and information
of interest is extracted with models (modelling or calibration
5.21. CHEMOMETRIC METHODS step). Chemometrics is associated with multivariate data
APPLIED TO ANALYTICAL DATA analysis, which usually depends less on assumptions about the
distribution of the data than many other statistical methods
The following chapter is published for information only. It is an since it rarely involves hypothesis testing. During modelling
introduction to the use of chemometric techniques for processing the most sensitive changes in properties of interest can be
analytical data sets. The objective is to provide indications on amplified, while the less relevant changes in disturbing factors,
good chemometric practice and requirements. whatever their origin, i.e. physical, chemical, experimental or
instrumental variation, are minimised to the level of noise.
1. GENERAL ASPECTS A model in chemometrics is a prediction method and not a
1-1. INTRODUCTION formal or simplified representation of a phenomenon from
physics, chemistry, etc. The ability of a model to predict
1-1-1. Scope of the chapter properties has to be assessed with regard to its performance.
This chapter is an introduction to the use of chemometric The best model or calibration will provide the best estimations
techniques for the processing of analytical data sets, which of properties of interest. A useful model is one that can be
is an area of interest for research, quality control and trusted and used for decision-making, for example. Adoption
manufacturing in the pharmaceutical industry. The objective of a model in decision-making must be based on acceptable,
is to provide information on the requirements for good reliable, and well-understood assessment procedures.
chemometric practice and to also present a selection of In univariate analysis, identified variables in a system are
established chemometric methods, but not an exhaustive analysed individually. However, in reality, systems tend
review of these techniques, as refinements and innovations are to be more complex, where interactions and combination
constantly being introduced. The principles of the proposed effects occur between sample variables and cannot be
methods will be briefly described along with their critical separated. Multivariate data analysis handles many variables
aspects and limitations. Mathematical details and algorithms simultaneously and the relationship within or between data
are mostly omitted and a glossary is provided at the end of sets (typically matrices) has to be rearranged to reveal the
the chapter. relevant information. In multivariate methods, the original
1-1-2. Definition data is often combined linearly to account as much as possible
The actual definition of chemometrics is “the chemical for the explainable part of the data and ideally, only noise will
discipline that uses mathematical and statistical methods, remain unmodelled. The model, when properly validated, can
(a) to design or select optimal measurement procedures be used in place of costly and time-consuming measurements
and experiments, and (b) to provide maximum chemical in order to predict new values.
information by analysing chemical data”. Generally, projection techniques such as principal components
From a more general point of view, chemometrics is not analysis (PCA), principal components regression (PCR) or
limited to chemical data and can contribute greatly to system partial least squares regression (PLS) are recommended.
understanding by analysing data, when limited knowledge and However, the approach will be different depending on whether
theory do not sufficiently explain observations and behaviour. the data has been generated using experimental design
Chemometric methods consist mainly of multivariate (i.e. designed data) or has been collected at random from a
data-driven modelling techniques that result in empirical given population (i.e. non-designed data). With designed data
mathematical models that are subsequently used for the matrices, the variables are orthogonal by construction and
indirect prediction of properties of interest. traditional multilinear statistical methods are therefore well
suited to describing the data within. However, in non-designed
1-1-3. Background
data matrices, the variables are seldom orthogonal, but are
Applications of chemometrics can be qualitative or more or less collinear, which favours the use of multivariate
quantitative, and it can help the analyst to structure the data data analysis.
set and to recognise hidden variable relationships within the
system. However, it should be stressed that although such 1-1-5. Qualitative and quantitative data analysis
data-driven methods may be powerful, they would not replace Qualitative data analysis can be divided into exploration, an
a verified or established theory if available. unsupervised analysis where data from a new system is to
Chemometric methods have revolutionised near infrared be analysed, and classification, a supervised analysis where
spectroscopy (NIR) and such techniques are now integral class-labels are predicted.
components of process analytical technology (PAT) and quality Unsupervised analysis
by design (QbD) for use in improved process monitoring and
In exploratory data analysis, multivariate tools are used to
quality control in a variety of fields. Chemometric methods
gather an overview of the data in order to build hypotheses,
can be found throughout the scientific and technological
select suitable analytical methods and sampling schemes,
community, with a principal but non-exclusive focus on life
and to determine how multivariate analysis of current and
and health sciences such as agriculture, food, pharmacy,
future data of similar type can be performed. When the
chemistry, biochemistry and genomics, but also other
first exploratory treatment is finalised, classification can be
industries such as oil, textiles, sensorics and cosmetics, with
subsequently carried out in the form of a secondary treatment,
the potential to expand even further into other domains.
where samples are organised into specific groups or classes.
The associated mathematical principles have been understood
since the early twentieth century, but chemometrics came of Supervised analysis
age with the development of digital technology and the related Classification is the process of determining whether or not
progress in the elaboration of mathematical algorithms. samples belong to the same class as those used to build the
Many techniques and methods are based on geometric data model. If an unknown sample fits a particular model well, it is
representations, transformations and modelling. Later, said to be a member of that class. Many analytical tasks fall
mathematical and theoretical developments were also into this category, e.g. materials may be sorted according to
consolidated. quality, physical grade and so on. Identity testing is a special
General Notices (1) apply to all monographs and other texts 783
5.21. Chemometric methods applied to analytical data EUROPEAN PHARMACOPOEIA 9.0
situation where unknown samples are compared with suitable Nevertheless, RMSEP is a good error estimate in cases where
reference materials, either by direct comparison or indirect both calibration and validation sample sets are representative
estimation, e.g. using a chemometric model. of future samples.
Quantitative data analysis, on the other hand, mainly consists A confidence interval for predicted y-values would
of calibration, followed by direct application to new and be ± n × RMSEP, with n fixed by the operator. A common
unknown samples. Calibration consists of predicting the choice is n = 2. This choice should be dependent on the
mathematical relationship between the property to be requirements of the specific analytical method.
evaluated (e.g. concentration) and the variables measured. Chemometric models can end up with better precision
1-2. GOOD CHEMOMETRIC PRACTICE than the reference methods used to acquire calibration
and testing data. This is typically observed for water
The following notation will be used in the chapter :
content determinations by NIR and PLS where semi-micro
X, Y data sets determination titration (2.5.12) is the reference method.
X independent variable 1-2-1-2. Standard error of calibration and coefficient of
determination
Y dependent variable
Figures of merit can be calculated to help assess how well
X, Y matrices the calibration fits the data. Two examples of such statistical
x, y vectors expressions are the standard error of calibration (SEC) and the
coefficient of determination (R2).
x, y scalar values
SEC has the same units as the dependent variables and
i, j indices, points reflects the degree of modelling error, but cannot be used
to estimate future prediction errors. It is an indication of
xi ith value of vector x whether the calculation using the calibration equation will be
sufficiently accurate for its intended purpose. In practice SEC
xi,j ith and jth value of matrix X has to be compared with the error of the reference method
transpose of matrix X
(SEL, Standard Error of Laboratory, see Glossary). Usually
X T
SEC is larger than SEL, in particular if modelling does not
X-1 inverse (if it exists) of matrix X account for all interferences in the samples or if other physical
phenomena are present.
mean centre of matrix X
The coefficient of determination (R2) is a dimensionless
estimate of matrix X measure of how well the calibration fits the data. R2 can
have values between 0 and 1. A value close to 0 indicates
|X| determinant of (square) matrix X that the calibration fails to relate the data to the reference
x norm of vector x values and as the coefficient of determination increases, the
X-data becomes an increasingly more accurate predictor of
b regression equation coefficient the reference values. Where there is more than 1 independent
e residuals of X variable, adjusted R2 should be used rather than R2, since the
number of independent variables in the model inflates the
f residuals of Y latter even if the fraction of variance explained by the model
is not increased.
1-2-1. Figures of merits for regression
1-2-2. Implementation steps
In quantitative analysis, building a regression model involves The implementation of chemometric methods varies case by
fitting a mathematical relationship to the corresponding case depending on the specific requirements of the system to
independent data (X) and dependent data (Y). The be analysed. The following generic approach can be followed
independent data may represent a collection of signals, i.e. when analysing non-designed data sets:
responses from a number of calibration samples, while the
dependent data may correspond to the values of an attribute, – in formulating the study problem, define the precise
i.e. the property of interest in the calibration samples. It objective of data collection and the expected analysis
is advisable to test the regression model with internal and results;
external test sets. The internal test set consists of samples – investigate the origin and availability of the data. The data
that are used to build the model (or achieve calibration) by set should cover the variation of the explored variable(s) or
applying resampling within the calibration data and samples attribute(s);
that are initially left out of the calibration in order to validate – if the available data does not cover the expected variation,
the model. Use of the internal test set is part of model prepare and measure samples that fill the gap;
optimisation and model selection. The external independent – variable selection : sometimes selecting the right variables
test set represents data that normally is available after the can give more robustness and also enhance model accuracy;
model has been fixed, thus the external test set challenges the – raw data may have to be transformed and mathematical
model and tests its robustness for the analysis of future data. pre-treatments performed;
1-2-1-1. Root mean square error of prediction – elaborate the model through calibration and validation;
The link between X and Y is explored through a common – challenge the model and check its performance on new
set of samples (calibration set) from which both x and samples or data;
y-values have been collected and are clearly known. For a – validate the method according to current pharmaceutical
second set of samples (validation set) the predicted y-values usage and requirements.
are then compared to the reference y-values, resulting in a
prediction residual that can be used to compute a validation 1-2-3. Data considerations
residual variance, i.e. a measure of the uncertainty of future 1-2-3-1. Sample quality
predictions, which is referred to as root mean square error of Careful sample selection increases the likelihood of extracting
prediction (RMSEP). This value estimates the uncertainty that useful information from the analytical data. Whenever it is
can be expected when predicting y-values for new samples. possible to actively adjust selected variables or parameters
Since no assumptions concerning statistical error distribution according to an experimental design, the quality of the
are made during modelling, prediction error cannot be used results is increased. Experimental design (also referred to
to report a valuable statistical interval for the predicted values. as design of experiments, DoE) can be used to introduce
784 See the information section on general monographs (cover pages)

EUROPEAN PHARMACOPOEIA 9.0 5.21. Chemometric methods applied to analytical data
systematic and controlled changes between samples, not only or error attributable to the modelled analytical technique.
for analytes, but also for interferences. When modelling, However, it is difficult to assess if this error is more significant
common considerations include the determination of which than the reference method error or vice versa.
variables are necessary to adequately describe the samples, 1-2-3-6. Pre-processing and variable selection
which samples are similar to each other and whether the data
The raw data may not be optimal for analysis and are generally
set contains related sub-groups.
pre-processed before performing chemometric calculations to
1-2-3-2. Data tables, geometrical representations improve the extraction of physical and chemical information.
Sample responses result in a group of numerical values relating Interferences, for example background effects, baseline shifts
to signal intensities (X-data), i.e. the independent variables. and measurements in different conditions, can impede the
However, it should be recognised that these variables are not extraction of information when using multivariate methods.
necessarily linearly independent (i.e. orthogonal) according to It is therefore important to minimise the noise introduced by
mathematical definitions. These values are best represented in such effects by carrying out pre-processing operations.
data tables and by convention each sample is associated with
a specific row of data. A collection of such rows constitutes A wide range of transformations (scaling, smoothing,
a matrix, where the columns are the variables. Samples normalisation, derivatives, etc.) can be applied to X-data as
can then be associated with certain features reflecting their well as Y-data for pre-processing prior to multivariate data
characteristics, i.e. the value of a physical or chemical property analysis in order to enhance the modelling. The main purpose
or attribute and these data are usually referred to as the Y-data, of these transformations is focussing the data analysis on
i.e. the dependent variables. It is possible to add this column the pertinent variability within the data set. For example,
of values to the sample response matrix, thereby combining pre-processing may involve mean centering of variables so
both the response and the attribute of each sample. that the mean does not influence the model and thus reduce
the model rank.
When n objects are described by m variables the data table
corresponds to an n×m matrix. Each of the m variables The selection of the pre-processing is mostly driven by
represents a vector containing n data values corresponding parameters such as type of data, instrument or sample, the
to the objects. Each object therefore appears as a point in an purpose of the model and user experience. Pre-processing
m dimensional space described by its m coordinate values methods can be combined, for example standard normal
(1 value for each variable in the m axes). variate (SNV) with 1st derivative, as an empirical choice.
1-2-3-3. First assessment of data 1-2-4. Maintenance of chemometric models
Before performing multivariate data analysis, the quality of Chemometric methods should be reassessed regularly to
the sample response can be optionally assessed using statistical demonstrate a consistent level of acceptable performance.
tools. Graphical tools are recommended for the 1st visual In addition to this periodical task, an assessment should
assessment of the data, e.g. histograms and/or boxplots for be carried out for critical parameters when changes are
variables for evaluation of the data distribution, and scatter made to application conditions of the chemometric model
plots for detection of correlations. Descriptive statistics are (process, sample sources, measurement conditions, analytical
useful for obtaining a rapid evaluation of each variable, taken equipment, software, etc.).
separately, before starting multivariate analysis. For example, The aim of maintaining chemometric models up-to-date is
mean, standard deviation, variance, median, minimum, to provide applications that are reliable over a longer period
maximum and lower/upper quartile can be used to assess the of use. The extent of the validation required, including the
data and detect out-of-range values and outliers, abnormal choice of the necessary parameters, should be based on risk
spread or asymmetry. These statistics reveal anomalies in analysis, taking into account the analytical method used and
a data table and indicate whether a transformation might the chemometric model.
be useful or not. Two-way statistics, e.g. correlation, show
1-3. ASSESSMENT AND VALIDATION OF CHEMOMETRIC
how variations in 2 variables may correlate in a data table.
METHODS
Verification of these statistics is also useful when reducing the
size of the data table, as they help in avoiding redundancies. 1-3-1. Introduction
1-2-3-4. Outliers Current use of the term ‘validation’ refers to the regulatory
An outlier is a sample that is not well described by the model. context as applied to analytical methods, but the term is
Outliers can be X or Y in origin. They reflect unexpected also used to characterise a typical computation step in
interference in the original data or measurement error. The chemometrics. Assessment of a chemometric model consists
predicted data that is very different from the expected value of evaluating the performance of the selected model in order
calls into question the suitability of the modelling procedure to design the best model possible with a given set of data
and the range spanned by the original data. In prediction and prerequisites. Provided sufficient data are available, a
mode, outliers can be caused by changes in the interaction distribution into 3 subsets should always be considered : 1) a
between the instrument and samples or if samples are outside learning set to elaborate models, 2) a validation set to select
the model’s scope. If this new source of variability is confirmed the best model, i.e. the model that enables the best predictions
and is relevant, the corresponding data constitutes a valuable to be made, 3) an independent test set to estimate objectively
source of information. Investigation is recommended to the performance of the selected final model. Introducing a
decide whether the existing calibration requires strengthening 3rd set for objective model performance evaluation is necessary
(updating) or whether the outliers should be ignored as to estimate the model error, among other performance
uncritical or unrelated to the process (i.e. operator error). indicators. An outline is given below on how to perform
a typical assessment of a chemometric model, starting
In the case of classification an outlier test should be performed
on each class separately. with the initial validation, followed by an independent test
validation and finally association/correlation with regulatory
1-2-3-5. Data error requirements.
Types of data error include random error in the reference
1-3-2. Assessment of chemometric models
values of the attributes, random error in the collected response
data and systematic error in the relationship between the two. 1-3-2-1. Validation during modelling
Sources of calibration error are problem specific, for example, Typically, algorithms are iterative and perform
reference method errors and errors due to either sample self-optimisation during modelling through an on-going
non-homogeneity or the presence of non-representative evaluation of performance criteria and figures of merit.
samples in the calibration set. Model selection during This step is called validation. The performance criteria
calibration usually accounts for only a fraction of the variance are specific to the chemometric technique used and to the
nature of the analytical data, as well as the purpose of the model. In some special cases, it might only be necessary to
overall method which includes both the analytical side and validate the chemometric model (see section 1-2-4.).
the chemometric model. The objective of the validation is 1-3-3-1. Qualitative models
to evaluate the model and provide help to select the best
performing model. Selected samples are either specifically For validation of qualitative models, the most critical
assigned for this purpose or are selected dynamically through parameters are specificity and robustness. When not
reselection/re-use of data from a previous data set (sometimes applicable, scientific justification is required.
called resampling – for clarification, see Glossary). A typical Specificity
example of data reselection is cross-validation with specific During validation it has to be shown that the model possesses
segments, for example ‘leave-one-out’ cross-validation when sufficient discriminatory capability. Therefore, a suitable
samples are only a few, or ‘leave-subset-out’ cross-validation set of materials that pose a risk of mix-up must be defined
(Figure 5.21.-1). Another type of resampling is bootstrapping. and justified. If, in addition to chemical identification,
1-3-2-2. Assessment of the model other parameters (such as polymorphism, particle size,
Once the model matches the optimisation requirements, moisture content, etc.) are relevant, a justification for these
fitness for purpose is assessed. Independent samples not parameters should also be included. The selection of materials
used for modelling or model optimisation are introduced at to be included when validating specificity should be based
this stage as an independent test-set in order to evaluate the on logistic considerations (e.g. materials handled close
performance of the model. Ideally, when sufficient data are to the process under review, especially those with similar
available, the sample set can be split into 3 subsets comprising appearance), chemical considerations (e.g. materials with
1 learning set for model computation, 1 validation set for similar structure) and also physical considerations where
optimisation of the model, and 1 test set for evaluation of the relevant (e.g. materials with different physical properties).
prediction ability, i.e. whether the model is fit for purpose. After definition of this set of materials, the discriminatory
The 3 subsets are treated independently and their separation ability of the chemometric method to reject them must be
should be performed in such a way that model computation is proven. Therefore, for each material a representative set of
not biased. The aim is to obtain a representative distribution samples covering typical variance within the material has to be
of the samples within the 3 subsets with regard to their analysed and evaluated. If the specificity of the chemometric
properties and expected values. model is insufficient, the parameters of the model should be
1-3-2-3. Size and partitioning of data sets optimised accordingly and the method revalidated.
The size of the data set needed for building the calibration Whenever new elements that may potentially affect
is dependent on the number of analytes and interfering identification are introduced, e.g. new materials that are
properties that needs to be handled in the model. The size of handled at the same site and represent a risk of mix-up,
the learning data set for calibration usually needs to be larger a revalidation of specificity should be carried out. This
when the interfering variations are acquired randomly than revalidation can be limited to the new element and does not
when all major interferences are known and they can be varied necessarily need to encompass the complete set of elements,
according to a statistical experimental design. The lowest whose constituents may not all be affected by the change.
possible number of samples needed to cover the calibration If properties of materials change over time (e.g. batches of
range can be estimated from the corresponding design. The materials with lower or higher particle size, lower or higher
size of the independent test set should be in the order of moisture content etc.) and these changes become relevant,
20-40 per cent of the samples used for the calibration model. they should also be included as part of the validation.
However, when representative samples are abundant, the This can be achieved for example, by an amendment to
larger the test data set (above 40 per cent), the more reliably the validation protocol and does not necessarily require a
the prediction performance can be estimated. It is common complete revalidation of the chemometric model.
practice to mix learning and model validation sets and as a
result, the definitive assessment of the model relies on the To assess specificity, the number of false-positive and
separate test set. false-negative errors can be evaluated by classification of the
test set.
1-3-3. Validation according to the regulatory framework
Validation principles and considerations are described in Robustness
established international guidelines and apply to the validation For validation of robustness, a comprehensive set of critical
of analytical methods. However, due to the special nature parameters (e.g. process parameters such as temperature,
of data treatment and evaluation, as carried out in most humidity, instrumental performance of the analytical
chemometric methods, additional aspects have to be taken equipment) should be considered. The reliability of the
into account when validating analytical procedures. In this analytical method should be challenged by variation of these
context, validation comprises both the assessment of the parameters. It can be advantageous to use experimental design
analytical method performance and the evaluation of the (DoE) to evaluate the method.
Figure 5.21.-1. – Cross-validation with leave subset of 3 out applied to linear regression. Regression model data = ●. Subset used
for test = ○. The errors of fit (interrupted lines) are collected to form the cumulated cross-validation error.

To assess robustness the number of correct classifications, (replicate measurements of the same sample by another person
correct rejections, false-positive and false-negative errors can on different days). Precision should be assessed at different
be evaluated by classification of samples under robustness analyte values covering the range of the chemometric model,
conditions. or at least at a target value.
1-3-3-2. Quantitative models Robustness
The following parameters should be addressed unless For validation of robustness, the same principles as described
otherwise justified : specificity, linearity, range, accuracy, for qualitative methods apply. Extra care should be taken
precision and robustness. to investigate the effects of any parameters relevant for
Specificity robustness on the accuracy and precision of the chemometric
model. It can be an advantage to evaluate these parameters
It is important to detect that the sample that is quantified is using experimental design.
not an outlier with respect to the calibration space. This can be
done using the correlation coefficient between the sample and The chemometric model can also be investigated using
the calibration mean, as well as Hotelling T2 among others. challenge samples, which may be samples with analyte
concentrations outside the range of the method or samples of
Linearity different identity. During the validation, it must be shown
Linearity should be validated by correlating results from the that these samples are clearly recognised as outliers.
chemometric model with those from an analytical reference
method. It should cover the entire range of the method and 2. CHEMOMETRIC TECHNIQUES
should involve a specifically selected set of samples that is A non-exhaustive selection of chemometric methods are
not part of the calibration set. For orientation purposes, a discussed below. A map of the selected methods is given in
‘leave-subset-out’ cross-validation based on the calibration set Figure 5.21.-2.
may be sufficient, but should not replace assessment using an
independent test set. Linearity can be evaluated through the 2-1. PRINCIPAL COMPONENTS ANALYSIS
correlation coefficient, slope and intercept. 2-1-1. Introduction
Range The complexity of large data sets or tables makes human
The range of analyte reference values defines the range of the interpretation difficult without supplementary methods to
chemometric model, and its lower limit determines the limits aid in the process. Principal components analysis (PCA) is a
of detection and quantification of the analytical method. projection method used to visualise the main variation in the
Controls must be in place to ensure that results outside this data. PCA can show in what respect 1 sample differs from
range are recognised as such and identified. Within the range another, which variables contribute most to this difference
of the model, acceptance criteria for accuracy and precision and whether these variables contribute in the same way and
have to be fulfilled. are correlated or are independent of each other. It also reveals
sample set patterns or groupings within the data set. In
Accuracy addition, PCA can be used to estimate the amount of useful
The accuracy of the chemometric model can be determined information contained in the data table, as opposed to noise
by comparison of analytical results obtained from the or meaningless variations.
chemometric model with those obtained using a reference 2-1-2. Principle
method. The evaluation of accuracy should be carried
out over the defined range of the chemometric model PCA is a linear data projection method that compresses data
using an independent test set. It may also be helpful to by decomposing it to so-called latent variables. The procedure
assess the accuracy of the model using a ‘leave-subset-out’ yields columns of orthogonal vectors (scores), and rows of
cross-validation, although, this should not replace assessment orthonormal vectors (loadings). The principal components
using an independent test set. (PCs), or latent variables, are a linear combination of the
original variable axes. Individual latent variables can be
Precision interpreted via their connection to the original variables.
The precision of the analytical method should be validated In essence, the same data is shown but in a new coordinate
by assessing the standard deviation of the measurements system. The relationships between samples are revealed by
performed through the chemometric model. Precision covers their projections (scores) on the PCs. Similar samples group
repeatability (replicate measurements of the same sample by together in respect to PCs. The distance between samples is a
the same person on the same day) and intermediate precision measure of similarity/dissimilarity.
Figure 5.21.-2. – Map of chemometric methods discussed in the chapter
The original data table is transformed into a new, rearranged It expresses the proportion of structure found in the data
matrix whose structure reveals the relationships between by the model. Total residual and explained variances show
rows and columns that may be hidden in the original matrix how well the model fits the data. Models with small total
(Figure 5.21.-3). The new structure constitutes the explained residual variance (close to 0 per cent) or large total explained
part of the original data. The procedure models the original variance (close to 100 per cent) can explain most of the
data down to a residual error, which is considered the variation in the data. With simple models consisting of only
unexplained part of the data and is minimised during the a few components, residual variance falls to 0 ; otherwise,
decomposition step. it usually means that the data contains a large amount of
The underlying idea is to replace a complex data table with a noise. Alternatively, it can also mean that the data structure
simpler counterpart version having fewer dimensions, but still is too complex to be explained using only a few components.
fitting the original data closely enough to be considered a good Variables with small residual variance and large explained
approximation (Figure 5.21.-4). Extraction of information variance for a particular component are well defined by the
from a data table consists of exploring variations between model. Variables with large residual variance for all or the
samples, i.e. finding out what makes a sample different from 1st components have a small or moderate relationship with
or similar to another. Two samples can be described as other variables. If some variables have much larger residual
similar if they have similar values for most variables. From variance than others for all or the 1st components, they may be
a geometric perspective, the combination of measurements excluded in a new calculation and this may produce a model
for 1 sample defines a point in a multidimensional space with that is more suitable for its purpose. Independent test set
as many dimensions as there are variables. In the case of variance is determined by testing the model using data that
close coordinates the 2 points are located in the same area was not used in the actual building of the model itself.
or volume. With PCA, the number of dimensions can be 2-1-4. Critical aspects
reduced while keeping similar samples close to each other
and dissimilar samples further apart in the same way as in PCA catches the main variation within a data set. Thus
the multidimensional space, but compressed into an alternate comparatively smaller variations may not be distinguished.
lower dimensional coordinate system.
2-1-5. Potential use
The principle of PCA is to find the directions in the data space
that describe the largest variation of the data set, i.e. where PCA is an unsupervised method, making it a useful tool for
the data points are furthest apart. Each direction is a linear exploratory data analysis. It can be used for visualisation,
combination of the initial variables that contribute most to the data compression, checking groups and trends in the data,
actual variation between samples. By construction, principal detecting outliers, etc.
components (PCs) are orthogonal to each other and are also
ranked so that each carries more information than any of those For exploratory data analysis, PCA modelling can be applied
that follow. Priority is therefore given to the interpretation to the entire data table once. However, for a more detailed
of these PCs, starting with the 1st, which incorporates the overview of where a new variation occurs, evolving factor
greatest variation and thereby constitutes an alternative less analysis (EFA) can be used and, in this case, PCA is applied in
complex system that is more suitable for interpreting the an expanding or fixed window, where it is possible to identify,
data structure. Normally, only the 1st PCs contain pertinent for example, the manifestation of a new component from a
information, with later PCs being more likely to describe series of consecutive samples.
noise. In practice, a specific criterion is used to ensure that
noise is not mistaken for information and this criterion should PCA also forms the basis for classification techniques such as
be used in conjunction with a method such as cross-validation SIMCA and regression methods such as PCR. The property
or evaluation of loadings in order to determine the number of PCA to capture the largest variations in the 1st principal
of PCs to be used for the analysis. The relationships between components allows subsequent regression to be based on
samples can then be subsequently viewed in 1 or a series of fewer latent variables. Examples of utilising components as
score plots. Residuals Ê keep the variation that is not included independent data in regression are PCR, MCR, and ANN.
in the model, as a measure of how well samples or variables
PCA is used in multivariate statistical process control (MSPC)
fit that model. If all PCs were retained, there would be no
to combine all available data into a single trace and to
approximation at all and the gain in simplicity would consist
apply a signature for each unit operation or even an entire
only of ordering the variation of the PCs themselves by size.
manufacturing process based on, for example, Hotelling
Deciding on the number of components to retain in a PCA
T2 statistics, PCA model residuals or individual scores. In
model is a compromise between simplicity, robustness and
addition to univariate control charts, 1 significant advantage
goodness of fit/performance.
with PCA is that it can be used to detect multivariate outliers,
2-1-3. Assessment of model i.e. process conditions or process output that has a different
Total explained variance R2 is a measure of how much of correlation structure than the one present in the previously
the original variation in the data is described by the model. modelled data.
Figure 5.21.-3. – Geometrical representation of 3 different X-data sets. On the left, objects are plotted in the multivariate space,
and the following examples reveal a hidden structure, i.e. a plane and a line respectively

X = original data matrix of n rows and m columns n = number of measurements (samples)

^
= score matrix with n rows and p columns p = number of factors
^
T = loadings matrix with p rows and m columns xu = data of unknown sample
^ ^
= residual matrix (same size as matrix X) = score values for unknown sample
m = number of data points (variables)
Figure 5.21.-4. – Decomposition of the X-matrix for principal components analysis (PCA)
2-2. MEASURES BETWEEN OBJECTS 2-2-2. Distance measures
The primary use of the following algorithms is to measure the In the object space, a collection of objects will be seen as
degree of similarity between an object and a group or the centre points that are more or less close to each other and will gather
of the data. into groups or clusters. Measuring the distance between
2-2-1. Similarity measures points will express the degree of similarity between objects. In
the same way, measuring the distance of a point to the centre
Calculation of the correlation is the simplest statistical tool of a group will give information about the group membership
used to compare data and to determine the degree of similarity, of this object. The following algorithms are given to illustrate
provided the data sets have the same dimension, e.g. spectral the way objects can be compared.
data. It is a measure of the linear association between a pair of 2-2-2-1. Euclidean distance
vectors. A correlation score between -1 and +1 is calculated for The Euclidean distance ed between 2 points i and j can be
i,j
the match, based on the system below, where a perfect match calculated as :
(mirror image) would have a score of +1 and 2 lines that are
complete opposites would have a score of -1 (Figure 5.21.-5).
Correlation is used to compare data sets in any of the following
ways :
– comparison of 2 selected samples ; Similarly, the Euclidean distance edi,c between the point i and
the centre c of the data can be calculated as the square root of
– comparison of 1 or more selected samples represented by the sum of the squared differences of the coordinates of point i
vectors with a reference data library (from a group or class). to the mean value of the x-coordinates for each of the m axes,
which can be expressed by the following matrix notation :
The reference data can be the average of a group of typical
characteristics.
Correlation r between 2 vectors x and y of the same dimension
can be calculated using the following equation : where xi denotes the m values of coordinates describing the
point i and denotes the mean coordinates calculated for
the m variables. The superscript T indicates that the 2nd term
of the equation is transposed.
Figure 5.21.-5. – Examples of correlation scores illustrated by matching of the shapes
2-2-2-2. Mahalanobis distance maximum amount of information, data reduction without

The Mahalanobis distance (md) takes into account the loss of information can be achieved by eliminating the later
correlation between variables by using the inverse of the insignificant PCs.
variance-covariance matrix: If a sufficient number of PCs are used to closely model the
data, the Euclidean distance of data points to the centre of
the data set will be identical when calculated from PC scores
and from the coordinates on the original variables. This can
be understood by considering that PCA calculation does not
transform data but only extracts latent variables to describe
the data space without distorting it. The same applies when
The variance-covariance matrix Cx is calculated using the using Mahalanobis distances, where the values are identical
following equation : whether the original data space or that of the PCs is used.
The only difference is the simplification of the calculation of
= (1⁄( − 1)) T
Mahalanobis distances. Due to the orthogonality between
PCs, Mahalanobis distances can be computed as the Euclidean
distances calculated over the range of the normalised scores
where Xc is the n×m data matrix centred over the mean of each using the multiplication factor .
column. Thus Cx is a square matrix that contains the variance 2-2-3. Linear and quadratic discriminant analysis
of each variable over its diagonal and the covariance between
variables on both sides of the diagonal. The Mahalanobis 2-2-3-1. Principle
distance of point i to the centre c of the data is given by the In Linear Discriminant Analysis and Quadratic Discriminant
following equation: Analysis (LDA, QDA), the assignment of a test object xi to one
of K predefined groups (or classes) identified in the data set is
determined by the classification score :
( , )= ( − ̅ )T −1(
− ̅ ) + ln| | − 2 ln( )
where πK is the prior probability of group K and is equal

to the number of objects contained in group K divided by
the total number of objects in the training set. C is the
The Cx-1 matrix is the inverse of the variance-covariance variance-covariance matrix and |C| is its determinant.
matrix and CxCx-1 = I, where I is the identity matrix. The 2-2-3-2. Critical aspects
number of variables or principal components involved in LDA assumes that the variance-covariance matrix for all
calculating the distance is designated p, and n is the number of classes is identical, while QDA estimates a variance-covariance
objects in the group or in the data set. Under the assumption matrix for each class. Hence in QDA far more parameters
that the data is normally distributed, the random variable need to be estimated which should only be done if sufficient
(n-1)2/n×md2 is beta distributed with degrees of freedom data are available.
u = p/2 and v = (n-p-1)/2. Thus, if for a point xi this expression
exceeds the (1 - α)-quantile of the beta distribution then the 2-2-3-3. Potential use
point can be classified as an outlier with the significance level It can be used in the case of straightforward classification
α (i.e. α is the probability of the type-I error classifying the schemes.
point as an outlier although it is not). 2-3. SOFT INDEPENDENT MODELLING OF CLASS
In the same way, the leverage effect (h) of a data point located ANALOGY
at the extremity of the X-space on the regression parameters 2-3-1. Introduction
of a multivariate model can be calculated using the following
Soft independent modelling of class analogy (SIMCA) is a
equation :
method for supervised classification of data. The method
requires a training set, which consists of samples with known
attributes that have been pre-assigned to different classes.
SIMCA classes can be overlapping and share common
Data points with high leverage have a large influence on the elements. Therefore, a sample can belong to 1, multiple or
model. none of the classes.
2-2-2-3. Critical aspects 2-3-2. Principle
Euclidean distances only express the similarities or PCA models are 1st established for individual classes. The
differences between data points when the variables are samples of the training set have to be analysed by PCA (see
strictly uncorrelated. If correlations between variables exist the section on PCA) and for each class a distinct principal
they contain at least partially the same information and components model is generated. The number of relevant
the dimensionality of data space is in fact smaller than the principal components can be adjusted for each class of objects
number of variables. Mahalanobis distances allow for the separately. According to this procedure the data sets of each
correction of correlations but their calculation supposes the class can be reduced to the relevant principal components
variance-covariance matrix to be invertible. In some instances models.
where there is high collinearity in the data set, this matrix New objects are then classified based on the individual PCA
is singular and cannot be inverted. This is especially the models. A new object is projected into each of these models
case with spectroscopic data where the high resolution of and assigned to a certain class when its residual distance from
spectrometers introduces redundancy by essentially describing this model is below the limit for this class (Figure 5.21.-6).
the same signal through measurements at several consecutive Distances of objects to the respective classes can be calculated
wavelengths. Another constraint in variance-covariance by procedures such as either Euclidean or Mahalanobis
matrix inversion is that the number of variables has to be distance. Consequently an object may belong to either 1 or
smaller than the number of objects (n>m). multiple classes if the corresponding distances are within the
Distances can be computed in the PC space, thus providing required threshold. If the distance of an object to all of the
the benefits of reduced dimensionality, orthogonality SIMCA classes is above the threshold, then it will be classified
between PCs and also PC ordering. As the 1st PCs carry the as an outlier.

Figure 5.21.-6. – Plot representing the 4 possible classifications of test objects in

a two-class SIMCA analysis (□ = unknown sample to classify, Δ = class 1 sample,
○ = class 2 sample)
2-3-3. Critical aspects repeating this step until the complete data set is classified
Since SIMCA is mainly based on PCA principles, the (Figure 5.21.-7). Divisive mode starts by considering the entire
validation of the method should follow that of PCA. In data set as a single cluster, which is then recursively divided
addition to this, the overlap of different classes must also be until only clusters containing a unique data point are obtained.
taken into account. For example, a molecule can have several Algorithms differ in the way they calculate the similarity
chemical groups that appear in its spectroscopic profile. Thus, between clusters. Complete link and single link algorithms
grouping such data into chemical subgroups results in overlap calculate the distance between all pairs of objects that belong
since separation is not possible. to different clusters in order to evaluate the similarity between
them. In the single link method, this distance corresponds to
2-3-4. Potential use
the minimum distance separating 2 objects originating from
SIMCA is often used for the classification of analytical 2 different clusters whereas in complete link algorithm this
data from techniques such as near-infrared (NIR) or mass distance corresponds to the largest distance between 2 objects
spectroscopy, and other analytical techniques such as from 2 different clusters. Ward’s algorithm, also called the
chromatography and chemical imaging. SIMCA is more minimum variance algorithm, calculates the similarity
suitable than PCA for discriminating between classes that are between clusters by means of decreasing cluster variance when
difficult to separate. the 2 most similar clusters are merged.
2-4. CLUSTERING
2-4-1. Introduction
A cluster consists of a group of objects or data points similar
to each other. Clustering tools can be used to visualise how
data points ‘self-organise’ into distinct groups or to highlight
the degree of similarity between data objects. Data points
from a particular cluster share some common characteristics
that differentiate them from those gathered in other clusters.
Clusters are characterised using 3 main properties ; size,
shape and distance to the nearest cluster. Clustering is an
unsupervised method of data analysis and is used either
for explanatory or confirmatory analysis. It differs from
discriminant analysis, which is a supervised classification
technique, where an unlabelled object is assigned to a group
of pre-classified objects.
2-4-2. Principle
Numerous data clustering approaches are available and are
typically classed as either hierarchical or non-hierarchical.
Hierarchical clustering leads to the classical dendrogram
graphical representation of the data, whereas non-hierarchical Figure 5.21.-7. – Dendrogram for agglomerative hierarchical
clustering finds clusters without imposing a hierarchical clustering until clusters containing a unique data point are
structure. Numerous algorithms are described in the obtained
literature, where data is partitioned either in a specific way or Non-hierarchical clustering cannot be described and
by optimising a particular clustering criterion. This simple categorised as easily as hierarchical clustering. Different
and exclusive distinction is incomplete since mixed algorithms algorithms exist, which give rise to different classification
have similarities to both approaches. Hierarchical clustering schemes. An overview of the different categories of algorithms
recursively finds clusters either in agglomerative (bottom-up) is given below ranging from simple distance based methods
or divisive (top-down) mode to form a tree shaped structure. such as the minimum spanning tree and the nearest neighbour
Agglomerative mode starts with defining each data point as its algorithms, to more sophisticated methods such as the
own cluster and then merging similar clusters in pairs before K-means algorithm (often cited as a classical partition
method), the expectation-maximisation algorithm (for share the same characteristics and in addition, hierarchical
‘model-based’ methods) and DBSCAN for ‘density-based’ clustering allows for classification within data objects.
algorithms and, also, the ‘grid-based’ methods which are Clustering is used in a vast variety of fields, in particular for
exemplified by the statistical information grid (STING) information retrieval from large databases. For the latter, the
algorithm. term ‘data mining’ is frequently used, where the objective is
Minimum spanning tree clustering, such as Kruskal’s to extract hidden and unexploited information from a large
algorithm, is similar to the graph theory algorithm as all the volume of raw data in search of associations, trends and
data points are first of all connected by drawing a line between relationships between variables.
the closest points. When all data points are linked, the lines of 2-5. MULTIVARIATE CURVE RESOLUTION
largest length are broken, leaving clusters of closely connected
2-5-1. Introduction
points. For nearest neighbour clustering, an iterative
procedure is used to assign a data point to a cluster when the Multivariate curve resolution (MCR) is related to principal
distance between this point and its immediate neighbour (that components analysis (PCA) but, where PCA looks for
belongs to a cluster) is below a pre-defined threshold value. directions that represent maximum variance and are
The K-means algorithm is one of the most popular and as mutually orthogonal, MCR strives to find contribution
with partition algorithms, the number of clusters must be profiles (i.e. MCR scores) and pure component profiles
chosen a priori, together with the initial position of the cluster (i.e. MCR loadings). MCR is also known as self-modelling
centres. A squared error criterion measures the sum of the curve resolution (SMCR) or end-member extraction. When
squared distance between each object and the centroid of its optimising MCR parameters the alternating least squares
corresponding cluster. The K-means algorithm starts with (ALS) algorithm is commonly used.
a random initial partition and progresses by reassigning 2-5-2. Principle
objects to clusters until the desired criteria reach a minimum. MCR-ALS estimates the contribution profiles C and the pure
Some variants of the K-means algorithm allow the splitting component profiles S from the data matrix X, i.e. X = C∙ST
or merging of clusters in order to find the optimum number + E just as in classical least squares (CLS). The difference
of clusters, even when starting from an arbitrary initial between CLS and ALS is that ALS is an iterative procedure
clustering. Model-based clustering attempts to find the best that can incorporate information that is known about the
fit for the data using a preconceived model. An example physicochemical system studied and use this information
of this is the EM or expectation-maximisation algorithm, to constrain the components/factors. For example, neither
which assigns each object to a particular cluster according contribution nor absorbance can be negative by definition.
to the probability of membership for that object. In the EM This fact can be used to extract pure component profiles and
algorithm, the probability function is a multivariate Gaussian contributions from a well-behaved data set. There are also
distribution and that is iteratively adjusted to data by use of other types of constraints that may be used, such as equality,
the maximum-likelihood estimation. The EM algorithm is unimodality, closure and mass balance.
considered as an extension of the K-means algorithm since It is often possible to obtain an accurate estimation of the
the residual sum of squares used for K-means convergence is pure component spectra or the contribution profiles and these
similar to the maximum-likelihood criterion. estimates can then be used as initial values in the constrained
Density-based (DB) clustering, such as the DBSCAN ALS optimisation. New estimates of the profile matrix S
algorithm, assimilates clusters to regions of high density and of the contribution profile C are obtained during each
separated by regions of low or no density. The neighbourhood iteration. In addition, the physical and chemical knowledge of
of each object is examined to determine the number of the system can be used to verify the result, and the resolved
other objects that fit within a specified radius and a cluster pure component contribution profiles should be explainable
is defined when a sufficient number of objects inhabit this using existing knowledge. If the MCR results do not match
neighbourhood. the known system information, then other constraints may
Grid-based algorithms, such as STING, divide the data space be needed.
into a finite number of cells. The distribution of objects 2-5-3. Critical aspects
within each cell is then computed in terms of mean, variance,
minimum, maximum and type of distribution. There are Selection of the correct number of components for the
several levels of cells, providing different levels of resolution ALS calculations is important for a robust solution and a
and each cell of a particular level corresponds to the union of good estimate can be obtained using for example, evolving
4 child cells from the lower level. factor analysis (EFA) or fixed-size moving window EFA.
Furthermore, the constraints can be set as either ‘hard’ or
2-4-3. Critical aspects ‘soft’, where hard constraints are strictly enforced while soft
Algorithms are sensitive to the starting conditions used to constraints leave room for deviations from the restricted
initialise the clustering of data. For example, K-means needs a value. Generally, due to inherent ambiguities in the solution
pre-set number of clusters and the resultant partitioning will obtained, the MCR scores will need to be translated into,
vary according to the chosen number of clusters. The metrics for example, the concentration of the active pharmaceutical
used in distance calculation will also influence data clustering. ingredient, using a simple linear regression step. This means
For Euclidean distances, the K-means algorithm will define that the actual content must be known for at least 1 sample.
spherical clusters whereas they could be ellipsoidal when using When variations of 2 or more chemical entities are in some
Mahalanobis distances. The cluster shape can be modified by way correlated, rank deficiency occurs, for example 1 entity is
data pre-treatments prior to cluster analysis. DB algorithms formed while the other is consumed, or 2 entities are consumed
can deal with arbitrarily shaped clusters, but their weakness at the same rate to yield a third. As a result, the variation of the
is their limitation in handling high-dimensional data, where individual substance is essentially masked and in such cases,
objects are sparsely distributed among dimensions. simultaneous analysis of data from independent experiments
When an object is considered to belong to a cluster with using varied conditions or combined measurements from
a certain probability, algorithms such as density based 2 measurement techniques generally results in better strategies
clustering, allow a soft or fuzzy clustering. In this case, the than analysing the experiments separately one by one.
border region of 2 adjacent clusters can house some objects 2-5-4. Potential use
belonging to both clusters.
MCR can be applied when the analytical method produces
2-4-4. Potential use multivariate data for which the response is either linear or
Clustering is an exploratory method of analysis that helps in linearisable. This has the advantage that only 1 standard is
the understanding of data structure by grouping objects that needed per analyte, which is particularly beneficial when the

measurements are at least partly selective between analytes. – MLR tends to over-fit.
When linearity and selectivity is an issue, more standards per To avoid overfitting MLR is often used with variable selection.
analyte may be required for calibration. When there is no
The selection of the optimal number of X-variables can be
pure analytical response for an analyte, it is also possible to
based on their residual variance, but also on the prediction
estimate starting vectors by applying PCA to analyte mixtures
error.
together alongside varimax rotation of the PCA coordinate
system. ALS implementations of MCR may also allow analyte 2-6-4. Potential use
profiles that are freely varied by the algorithm, which can MLR is typically suited to simple matrices/data sets, where
then be used to model a profile that is difficult to estimate there is a high degree of specificity and full rank. As matrices
separately, for example a baseline. become more complex, more suitable methods such as
2-6. MULTIPLE LINEAR REGRESSION PLS may be required to provide more accurate and/or
robust calibration. In these cases, MLR may be used as a
2-6-1. Introduction screening technique prior to the application of more advanced
Multiple Linear Regression (MLR) is a classical multivariate calibration methodologies.
method that uses a combined set of x-vectors (X-data matrix)
in linear combinations that are fitted as closely as possible to 2-7. PRINCIPAL COMPONENTS REGRESSION
the corresponding single y-vector. 2-7-1. Introduction
MLR extends linear regression to more than 1 selected variable Principal components regression (PCR) is an expansion of
in order to perform a calibration using least squares fit. principal components analysis (PCA) for use in quantitative
2-6-2. Principle applications. It is a two-step procedure whereby the calibration
matrix X is first of all transformed by PCA into the scores and
In MLR, a direct least squares regression is performed loadings matrices and respectively. In the following step,
between the X- and the Y-data. For the sake of simplicity, the the score matrix for the principal components is used as the
regression of only 1 column vector y will be addressed here, input for an MLR model to establish the relationship between
but the method can be readily extended to a Y-matrix, as is the X- and the Y-data.
common when MLR is applied to data from experimental
design (DoE), with multiple responses. In this case, single 2-7-2. Principle
independent MLR models for each y-variable can be applied As in PCA, the calibration matrix is decomposed into scores
to the same X-matrix. and loadings matrices in such a way as to minimise the
The following MLR model equation is an extension of the residual matrix that ideally consists only of random errors,
normal univariate straight line equation ; it may also contain i.e. noise. For quantitative calibration, an additional matrix Y
cross and square terms : with the reference analytical data of the calibration samples
is necessary. As the concentration information is contained
in the orthogonal score vectors of the -matrix it can be
optimally correlated by multiple linear regression using
This can be compressed into the convenient matrix form :
the actual concentrations in the Y-matrix via the matrix
y = Xb + f (Figure 5.21.-8), while minimising the entries in the residual
matrix .
The objective is to find the vector of regression coefficients b
2-7-3. Critical aspects
that best minimises the error term f. This is where the least
squares criterion is applied to the squared error terms, i.e. A crucial point in the development of a model is the selection
to find b-values so that y-residuals f are minimised. MLR of the optimal number of principal components. In this
estimates the model coefficients using the following equation : respect, the plot of the number of principal components versus
the residual Y-variance is an extremely useful diagnostic tool
b = (XTX)-1XTy when defining the optimal number of PCs, i.e. when the
minimum of the residual Y-variance observed during model
This operation involves the matrix inversion of the assessment has been reached. In most cases, additional PCs
variance-covariance matrix (XTX)-1. If any of the X-variables beyond this point do not improve the prediction performance
show any collinearity with each other i.e. if the variables are but the calibration model falls into overfitting.
not linearly independent, then the MLR solution will not be Despite its value as an important tool when dealing with
robust or a solution may not even be possible. collinear X-data, the weakness of PCR lies in its independent
2-6-3. Critical aspects decomposition of the X and Y matrices. This approach
MLR requires independent variables in order to adequately may take into account variations of the X-data that are
explain the data set, but as pharmaceutical samples are not necessarily relevant for an optimal regression with the
comprised of a complex matrix in which components interact Y-data. Also, Y-correlated information may even get lost in
to various degrees, the selection of appropriate variables is not higher order principal components that are neglected in the
straightforward. For example, in ultra-violet spectroscopy, above-mentioned selection process of the optimal number
observed absorbance values are linked because they may of PCs.
describe related behaviours in the spectroscopic data set. A stepwise principal component selection (e.g. selection
When observing the spectra of mixtures, collinearity is of PC2 instead of PC1) may be useful to improve the
commonly found among the wavelengths, and consequently, performance of the calibration model.
MLR will struggle to perform a usable linear calibration. 2-7-4. Potential use
The ability to vary the x-variables independently of each other
PCR is a multivariate technique with many diagnostic tools
is a crucial requirement when using variables as predictors
for the optimisation of the quantitative calibration models and
with this method. This is why in DoE the initial design matrix
the detection of erroneous measurements. In spectroscopy
is generated in such a way as to establish this independence
for example, PCR provides stable solutions when dealing
(i.e. orthogonality) from the start. MLR has the following
with the calibration data of either complete spectra or large
constraints and characteristics :
spectral regions. However, it generally requires more principal
– the number of X-variables must be smaller than the number components than PLS and in view of the limitations and
of samples (n>m), otherwise the matrix cannot be inverted ; disadvantages discussed above, PLS regression has become
– in case of collinearity among X-variables, the b-coefficients the preferred alternative for quantitative modelling of
are not reliable and the model may be unstable ; spectroscopic data.
X = original data matrix of n rows and m columns m = number of data points (variables)
^
= scores matrix with n rows and p columns n = number of measurements (samples)
^
T = loadings matrix with p rows and m columns j = number of property values per sample
^
= residual matrix (same size as X-matrix) p = number of principal components (factors)
Y = property matrix of n rows and j columns = data of unknown sample
^
T = correlation matrix of p rows and j columns ^
= predicted property values of unkown
sample
^ ^
= residual matrix (same size as Y-matrix) = score values for unknown sample
^
= matrix of regression coefficients
Figure 5.21.-8. – Decomposition of the matrices for principal components regression (PCR)
2-8. PARTIAL LEAST SQUARES REGRESSION information that is most relevant for the prediction of the
2-8-1. Introduction Y-values of unknown samples. In practice PLS can be applied
to either 1 Y-variable only (PLS1), or to the simultaneous
Partial least squares regression (PLSR, generally known as PLS calibration of several Y-variables (PLS2 model).
and alternatively named projection on latent structures) has
developed into the most popular algorithm for multivariate As the detailed PLS algorithms are beyond the scope of this
regression. chapter, a simplified overview is instead given (Figure 5.21.-9).
PLS relates 2 data sets (X and Y) irrespective of collinearity. Arrows have been included between the and scores
PLS finds latent variables from X and Y data blocks matrices in order to symbolise the interaction of their
simultaneously, while maximising the covariance structure elements in the process of this iteration. While the Y-matrix
between these blocks. In a simple approximation PLS can be is decomposed into the loadings and scores matrices and
viewed as 2 simultaneous PCA analyses applied to the X and respectively, the decomposition of the X-matrix produces
Y-data in such a way that the structure of the Y-data is used not only the loadings and scores matrices and , but also a
for the search of the principal components in the X-data. The loading weights matrix , which represents the relationship
amount of variance modelled, i.e. the explained part of the between the X and Y-data.
data, is maximised for each component. The non-explained To connect the Y-matrix with the X-matrix decomposition
part of the data set is made up of residuals, which function as
a measure of the modelling quality. for the first estimation of the score values, the Y-data are
used as a guide for the decomposition of the X-matrix. By
2-8-2. Principle interchanging the score values of the and matrices, an
The major difference between PCR and PLS regression is interdependent modelling of the X and Y data is achieved,
that the latter is based on the simultaneous decomposition of thereby reducing the influence of large X-variations that do
the X and Y-matrices for the derivation of the components not correlate with Y. Furthermore, simpler calibration models
(preferably denoted as PLS factors, factors, or latent-variables). with fewer PLS-factors can also be developed where, as is the
Consequently, for the important factors, the information that case for PCR, residual variances are used during validation to
describes a maximum variation in X, while correlating as determine the optimal number of factors that model useful
much as possible with Y, is collected. This is precisely the information and consequently, avoid overfitting.

original data matrix of n rows and m columns matrix of regression coefficients

property matrix of n rows and j columns m number of data points (variables)
scores matrix with n rows and p columns n number of measurements (samples)
loadings matrix with p rows and m columns j number of property values per sample
loading weights matrix with p rows and m columns p number of factors
residual matrix (same size as X-matrix) data of unknown sample
loadings matrix of Y-data predicted property values of unknown sample
scores matrix of Y-data score values for unknown sample
residual matrix (same size as Y-matrix)
Figure 5.21.-9. – Decomposition of the data matrices for PLS regression
2-8-3. Critical aspects cases of highly correlated Y-variables of interest ; otherwise

separate PLS1 models for the different Y-variables will yield
A critical step in PLS is the selection of the number of more satisfactory prediction results.
factors. Selecting too few factors will inadequately explain 2-8-4. Potential use
variability in the training data set, while too many factors will
cause overfitting and instability in the resulting calibration PLS has emerged as a preferable alternative to PCR
(Figure 5.21.-10). The optimal number of factors is estimated for quantitative calibration because it incorporates the
during validation of the calibration. Figure 5.21.-10 shows intervention of the Y-data structure for the decomposition
the changes in the calibration error (A) of a model and of the calibration X-matrix. Consequently, information
2 cases of prediction errors (B, C) according to the number from the most important factors is collected and is capable
of factors used in the model. The calibration error decreases of describing maximal variation in the X-data, while also
continuously as the number of factors increases. In case B correlating as closely as possible with the Y-data. In general,
prediction error reveals that no minimum can be observed ; this yields simpler models with fewer factors compared to
however, a minimum is observed in case C. In the absence of PCR and also provides superior interpretation possibilities
a minimum, the number of components can be chosen based and visualisation diagnostics for the optimisation of the
on where no significant decreasing of error is observed. calibration performance. In addition PLS can handle presence
of noise in both X- and Y-data.
As far as the decision between PLS1 or PLS2 models PLS discriminant analysis (PLS-DA) is a special case of PLS
is concerned PLS1 modelling is chosen if there is only where X-matrix is regressed into a dummy Y-matrix consisting
1 Y-variable of interest. In cases where there is more than of ones and zeroes. Ones and zeroes indicate the class to
1 Y-variable of interest, either one PLS2 model or individual which samples are belonging or not. PLS-DA is used as a
PLS1 models for each Y-variable can be calculated. In general, semi-quantitative method in, for example, chemical imaging
PLS2 is the preferred approach for screening purposes and in estimating the pixel components.
value corresponds to the certainty of classification. During

modelling the distance between the training points and the
hyperplane contributes to the weight attributed to the point.
Very distant points will have a lesser weight, and to avoid
overfitting, distances smaller than a trade-off parameter will
not be considered.
For non-separable object groups overlapping is allowed
to a certain extent. So-called slack variables are added to
objects, with value 0 if correctly classified and a positive value
otherwise. The optimal hyperplane is found by maximising
the margin at the same time allowing for a minimum number
of training points to be misclassified (Figure 5.21.-12).
The proportion of misclassified points becomes a control
parameter during margin maximisation.
In practice SVM computation is extremely complex and would
be infeasible without simplifying of the optimisation problem.
To project X-data into the feature space the original data is
expanded by a set of basis functions. Selecting particular
basis functions makes it possible to reformulate the whole
Figure 5.21.-10. – Effects of adding factors to a model; the optimisation procedure. Only products of the expanded
calibration becomes more accurate i.e. the residual error variables remain part of the optimisation procedure and they
diminishes, while the predictive performance of the model could can be advantageously replaced by a Kernel function.
deteriorate as for prediction C. The optimum number of factors
reflects a compromise. 2-9-3. Critical aspects
2-9. SUPPORT VECTOR MACHINES Numerous algorithms and different types of software can be
used to compute SVMs, which may lead to differing results.
2-9-1. Introduction Optimisation will vary depending on the algorithm used.
To achieve classification, multivariate techniques reduce the Control criteria may differ, thus leading either to divergence
dimensionality and complexity of the data set. Kernel methods during the iterations, or to unstable computations sensitive to
project data into higher dimensional feature spaces. redundant and uninformative data.
2-9-2. Principle During SVM computation training points that are well inside
their class boundaries have little or even no effect on the
Support vector machines (SVMs) project X-data of the training position of the decision plane. The latter focusses mainly on
set into a feature space of usually much higher dimension than points that are difficult to separate and not on objects that
the original data space. In the feature space a hyperplane (also are clearly distinct. Thus, SVMs are sensitive to redundant
called decision plane) is computed that separates individual values and atypical points like outliers, for example. As
points of known group membership (Figure 5.21.-11). The a consequence it may be pertinent to select or screen out
best discriminating separation is achieved by maximising the specific variables prior to performing SVM. The data should
margin between groups. The margin is defined by 2 parallel be normalised and standardised to avoid having input data of
hyperplanes at an equal distance from the decision plane. The different scales, which may lead to poor conditions for the
optimum position of the decision plane is obtained if the boundary optimisation.
margin is maximal. Points in the feature space that define the The best performing model must be adequately validated and
margin are called support vectors. test data that is completely untouched during iterations is
For each training point the distance to the decision plane is required for this purpose. It should be ensured that this data is
computed. In the case of a two-class separation for example, well balanced in the sense that both easy and difficult samples
the sign of the distance gives the group membership and the are equally represented in the training and validation sets.
Figure 5.21.-11. – The object space, where separation of the 2 classes is not possible, is mapped into a feature space where
separation is possible

Figure 5.21.-12. – In the feature space, the separation of classes 1 and 2 was achieved with toleration of certain misclassified
samples
2-9-4. Potential use of the function (called transfer function). The weights and
bias are the coefficients of the ANN model and are determined
SVMs are mainly used for binary supervised classification.
through a learning process using known examples. An ANN
They can be generalised to multiclass classification or
often contains many neurons arranged in layers, where the
extended to regression problems, though these applications
neurons in each layer are arranged in parallel. They are
are not considered within the scope of this chapter. Objects
connected to neurons in the preceding layer from which
that are difficult to classify rather than those that are clearly
they receive inputs and also to neurons in the following layer
distinct drive the optimisation process in SVMs. SVMs can
where the outputs are sent (Figure 5.21.-13). The output of
be used for the separation of classes of objects, but not for
1 neuron is therefore used as the input for neurons in the
the identification of these objects. They operate well on large
following layer. The input layer is a special layer that receives
data sets that are obtained, for example, by NIR spectroscopy,
data directly from the user and sends this information directly
magnetic resonance, chemical imaging or process data mining,
to the next layer without applying a transfer function. The
where PCA and related methods fail. Their strength mainly
output layer is similar in that its output is also directly used
lies in separation of samples featuring highly correlated
as the model output without any additional processing. The
signals, i.e. polymorphs, excipients, tracing of adulterated
unlimited possibilities when connecting different numbers
substances, counterfeits etc.
and layers of neurons, is often called an ANN architecture,
2-10. ARTIFICIAL NEURAL NETWORKS and provides the potential for ANNs to meet any complicated
2-10-1. Introduction data modelling requirements.
Artificial neural networks (ANNs) are general computational
tools, whose initial development was inspired by the need
for further understanding of biological neural networks and
which have since been widely used in various areas that
require data processing with computers or machines. The
methods for building ANN models and their subsequent
applications can be dramatically different depending on the
architecture of the neural networks themselves. In the field
of chemometrics, ANNs are generally used for multivariate
calibration and unsupervised classification, which is achieved
by using multi-layer feed-forward (MLFF) neural networks,
or self-organising maps (SOM) respectively. As a multivariate
calibration tool, ANNs are more generally associated with the
mapping of non-linear relationships.
2-10-2. Principle
2-10-2-1. General
The basic data processing element in an artificial neural
network is the artificial neuron, which can be understood as a
mathematical function that uses the sum of a weighted vector
and a bias as the input. The vector is the ‘input’ of the neuron
and is obtained either directly from a sample in the data set or Figure 5.21.-13. - Typical arrangements of neuron layers and
calculated from previous neurons. The user chooses the form their inter-connections
2-10-2-2. Multi-layer feed-forward artificial neural network also make the interpretation of the coefficients more difficult.
A multi-layer feed-forward network (MLFF ANN) contains However, when linear modelling methods are not flexible
an input layer, an output layer and 1 or more layers of neurons enough to provide the required prediction or classification
in-between called hidden layers. Even though there is no limit accuracy, ANNs may be a good alternative.
on how many hidden layers may be included, an MLFF ANN
with only 1 hidden layer is sufficiently capable of handling 3. GLOSSARY
most multivariate calibration tasks in chemometrics. In an β-distribution : continuous probability distribution with
MLFF ANN, each neuron is fully connected to all the neurons density function
in the neighbouring layers. A hyperbolic tangent sigmoid
transfer function is usually used in MLFF ANN, but other
transfer functions, including linear functions, can also be used.
The initial weights and biases can be set as small random where u > 0, v > 0 are shape parameters (degrees of freedom)
numbers, but can also be initialised using other algorithms. and B is the beta function,
The most popular training algorithm for determining the final
weights and biases is the back-propagation (BP) algorithm or
its related variants. In the BP algorithm, the prediction error,
calculated as the difference between the ANN output and the
actual value, is propagated backward to calculate the changes
needed to adjust the weights and biases in order to minimise The γ-quantile of the Β(u,v)-distribution is denoted by βu,v ;γ
the prediction error. and it is the value q such that the value of the distribution
function F is γ :
An MLFF ANN must be optimised in order to achieve
acceptable performance. This often involves a number of
considerations including the number of layers, the number
of neurons in each layer, transfer functions for each layer or
neuron, initialisation of weights, learning rate, etc.
2-10-2-3. Self-organising map Bootstrapping : a number of sample sets of size n that is
produced from an original sample set of the same size n by
The aim of the self-organising map (SOM) is to create a means of a random selection of samples with replacement.
map where observations that are close to each other have
more similar properties than more distant observations. Centring : a data set is mean centred by calculating the mean
The neurons in the output layer are usually arranged value of each variable and subtracting the variable mean
in a two-dimensional map, where each neuron can be values from each column of variables, in order to make the
represented as a square or a hexagon. SOMs are trained using comparison and interpretation of the data easier.
competitive learning that is different from the above described Collinear/non-collinear: a family of vectors is collinear
method using BP. The final trained SOM is represented as a if at least 1 of the vectors can be represented as a linear
two-dimensional map of properties. combination of the other vectors. Hence a family of vectors
2-10-3. Critical aspects is non-collinear if none of the vectors can be represented as a
linear combination of the others.
The 2 most common pitfalls of using ANNs are over-training
and under-training. Over-training means that an ANN Component (or factor, latent variable) : in chemometrics :
model can predict the training set very well but ultimately underlying, non-observed, non-measured, hypothetical
fails to make good predictions. Under-training means that variable that contributes to the variance of a collection of
the ANN training ended too soon and therefore the resultant measured variables. The variables are linear combinations of
ANN model underperforms when making predictions. Both the factors and these factors are assumed to be uncorrelated
of these pitfalls should be avoided when using ANNs for with each other.
calibration. A representative data set with a proper size, i.e. Data mining : process of exploration, extraction and
more observations or samples than variables, is required modelling of large collections of data in order to discover a
before a good ANN model can be trained. Generally, since priori unknown relationships or patterns.
the models are non-linear, more observations are needed
Dependent variable : also a response, regressand : a variable
than for a comparable data set subjected to linear modelling.
that is related by a formal (explicit) or empirical mathematical
As for other multivariate calibration methods, the input
relationship to 1 or more other variables (typically the Y-data).
may need pre-processing to balance the relative influence of
variables. One advantage of pre-processing is the reduction Empirical model : a data-driven model established without
in the number of degrees of freedom of input to the ANN, for assuming an explicit mathematical relationship, or without a
example by compression of the X-data to scores by PCA and description of the behaviour of a system based on accepted
then using the resulting scores for the observations as input. laws of physics.
2-10-4. Potential use Exploratory data analysis : the process for uncovering
The advantage of MLFF ANN in multivariate calibration lies unexpected or latent patterns in order to build future
in its ability to model non-linear relationships. Since the hypotheses.
neurons are fully connected, all the interactions between Factor : see component.
variables are automatically considered. It has been proven that Hotelling T2 statistics : multivariate version of the t-statistic.
a MLFF ANN with sufficient hidden neurons can map any In general, this statistic can be used to test if the mean vector
complicated relationship between the inputs and outputs. of a multivariate data set has a certain value or to compare
SOMs can be used to visualise high-dimensional data while the means of the variables. The T2 statistic is also used for
preserving the topology in the original data. They are based detection of outliers in multivariate data sets. A multivariate
on unsupervised learning, and are mainly useful as tools to statistical test using the Hotelling T2 statistic can be done.
explore features in data sets where no prior knowledge of the A confidence ellipse can be included in score plots to reveal
patterns and relationships of the samples exists. points outside the ellipse as potential outliers.
The ANNs often have a large number of coefficients (weights Independent variable : input variable on which other
and biases) that give the ANN the potential to model any variables are dependent through a mathematical relationship
complicated relationships in the data set but as a result, can (typically the X-data).

Indirect prediction : process for estimating the value of a Scores or factor score coefficients : coordinates of the
response on the basis of a multivariate model and observed samples in the new coordinate system defined by the principal
data. components. Scores represent how samples are related to each
Interference : effect of substances, physical phenomena or other’s given the measurement variables.
instrument artefacts, separate from the target analyte, that can Score (normalised) : jth score value ti,j of the ith sample divided
be measured by the chosen analytical method. Then there is a by the norm of the scores matrix :
risk of confusion between the analyte and interference if the
interference is not varied independently or at least randomly
in relation to the analyte.
Latent variable : see component.
where p is the number of parameters in the model.
Leave-one-out : in a ‘leave-one-out’ procedure only 1 sample
Standard error of calibration : a function of the predictive
at a time is removed from the data set in order to create a
residual sum of squares to estimate the accuracy considering
new data set.
the number of parameters :
Leave-subset-out : in a ‘leave-subset-out’ procedure a subset
of samples is removed from the data set in order to create a
new data set.
Leverage : a measure of how extreme a data point or a variable
is compared to the majority. Points or variables with a high where n is the number of samples of the learning set, p the
leverage are likely to have a large influence on the model. number of parameters in the model to be estimated by using
the sample data, ŷi the ith fitted value in the calibration, and yi
Loadings : loadings are estimated when information carried the ith reference value. In multiple regression with m variables
by several variables is focussed onto a few components. Each p = m + 1 (1 coefficient for each of the m variables and
variable has a loading alongside each model component. The 1 intercept).
loadings show how well a variable is taken into account by
the model components. Standard error of laboratory : concerns to the intermediate
precision or reproducibility, whichever is applicable.
Orthogonal : 2 vectors are orthogonal if their scalar product
is 0. Supervised : refers to modelling data labelled by classes or
values.
Orthonormal vectors : orthogonal and normalised
(unit-length) vectors. Unsupervised (non-supervised) : refers to exploring data
without prior assumptions.
Outlier: for a numerical data set, it relates to a value
Underfitting : the reverse of overfitting.
statistically different from the rest. Also refers to the sample
associated with that value. Specific statistical testing for Variable : property of a sample that can be assessed (attribute,
outliers may be used. descriptor, feature, property, characteristics).
Overfitting : for a model, overfitting is a tendency to Varimax rotation : orthogonal analytical rotation of factors
describe too much of the variation in the data, so that in that maximises the variance of squared factor loadings, thereby
addition to the consistent underlying structure, some noise increasing the large factor loadings and large eigenvalues and
or non-informative variation is also taken into account and decreasing the small ones in each factor.
unreliable predictions will be obtained.
4. ABBREVIATIONS
Property : see variable.
Resampling : the process of impartial rearrangement and ALS alternating least squares
sub-sampling of the original data set. This occurs during
optimisation/validation procedures that repeatedly calculate a ANN artificial neural network
property and the error associated with it. Typical examples are
BP back-propagation
cross-validation and bootstrapping, which create successive
evaluation data sets by repeated sub-sampling.
CLS classical least squares
Reselection : reuse of samples (see resampling).
DB density-based
Residuals : a measure of the variation that is not taken into
account by the model or a deviation between predicted and
DBSCAN density-based spatial clustering of
reference values.
applications with noise
Root mean square error of prediction : a function of the
predictive residual sum of squares to estimate the accuracy : DoE design of experiments
EFA evolving factor analysis
EM expectation maximisation
where ŷi is the predicted response for the ith sample of the test LDA linear discriminant analysis
data set and yi the observed response of the ith sample, and
n is the number of samples. MCR multivariate curve resolution
Sample : object, observation, or individual from which data MLFF multi-layer feed-forward
values are collected.
Sample attribute : qualitative or quantitative property of the MLR multiple linear regression
sample.
MSPC multivariante statistical process control
Sample selection : the process of drawing a subset or
a collection from a population in order to estimate the NIR near infrared
properties of the population.
PAT process analytical technology RMSEP root mean square error of prediction
PC principle component SEC standard error of calibration
PCA principal components analysis SEL standard error of laboratory
PCR principal components regression SIMCA soft independent modelling of class analogy
PLS partial least squares regression SMCR self-modelling curve resolution
PLS-DA partial least squares discriminant analysis SNV standard normal variate
PLSR partial least squares regression SOM self-organising map
QbD quality by design STING statistical information grid
QDA quadratic discriminant analysis SVM support vector machine

5.21. Chemometric Methods Applied To Analytical Data

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

5.21. Chemometric Methods Applied To Analytical Data

Uploaded by

Copyright:

Available Formats

EUROPEAN PHARMACOPOEIA 9.0 5.21.

Chemometric methods applied to analytical data

04/2016:52100 1-1-4. Introducing chemometrics

784 See the information section on general monographs (cover pages)

786 See the information section on general monographs (cover pages)

Figure 5.21.-2. – Map of chemometric methods discussed in the chapter

788 See the information section on general monographs (cover pages)

X = original data matrix of n rows and m columns n = number of measurements (samples)

Figure 5.21.-5. – Examples of correlation scores illustrated by matching of the shapes

2-2-2-2. Mahalanobis distance maximum amount of information, data reduction without

where πK is the prior probability of group K and is equal

790 See the information section on general monographs (cover pages)

Figure 5.21.-6. – Plot representing the 4 possible classifications of test objects in

792 See the information section on general monographs (cover pages)

794 See the information section on general monographs (cover pages)

original data matrix of n rows and m columns matrix of regression coefficients

Figure 5.21.-9. – Decomposition of the data matrices for PLS regression

2-8-3. Critical aspects cases of highly correlated Y-variables of interest ; otherwise

value corresponds to the certainty of classiﬁcation. During

796 See the information section on general monographs (cover pages)

798 See the information section on general monographs (cover pages)

EFA evolving factor analysis

PC principle component SEC standard error of calibration

PCA principal components analysis SEL standard error of laboratory

PLS partial least squares regression SMCR self-modelling curve resolution

PLSR partial least squares regression SOM self-organising map

QbD quality by design STING statistical information grid

QDA quadratic discriminant analysis SVM support vector machine

800 See the information section on general monographs (cover pages)

You might also like