Quality Analysis With PCA and PLS - MTK337

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 36

B: PCA/PLS quality analysis

Tutorial B: Quality analysis with PCA and PLS

Description
Main learning outcomes
Data table
Preparing the data
Insert category variables
Check column (variable) sets
Define sample sets from category variable column
Objective 1: Find the main sensory qualities
Make a PCA model
Interpret the variance plot in the PCA overview
Interpretation of the scores plot for the PCA
Interpretation of the correlation loadings plot
Interpretation of scores and loadings
Interpretation of the influence plot
Objective 2: Explore the relationships between instrumental/chemical data (X) and sensory data (Y)
Make a PLS regression model
Interpretation of the variance plot
Interpretation of the scores plot
Interpretation of the loadings and loading weights plot
Interpretation of the predicted vs. reference plot
Objective 3: Predict user preference from sensory measurements
Make a PLS regression model for preference
Interpretation of the regression overview
Interpretation of the regression coefficients
Open result matrices in the Editor
Predict preference for new samples
Interpretation of Predicted with Deviation
Check the error in original units – RMSE

Description
This tutorial aims to use multivariate techniques to analyze the quality of raspberry jam in order to
determine which sensory attributes are relevant to “perceived quality”. The analysis will cover three
aspects as follows.

1. A trained tasting panel has provided scores for a number of different variables using descriptive
sensory analysis. In this tutorial the first objective is to find the main sensory quality properties
relevant for raspberry jam.

2. The second objective is to find a way of rationalizing quality control, since the use of taste
panels is very costly. In this application a number of laboratory instrumental measurements
were investigated to potentially replace the sensory testing panel.

3. The third and final objective of this application is to be able to predict consumer preference for

Page 1 of 36
B: PCA/PLS quality analysis

raspberry jam from descriptive sensory analysis. The use of PLS regression modeling techniques
were investigated in order to potentially find a relationship between sensory data and
preference.

Main learning outcomes


This tutorial contains the following parts and learning objectives:

 Explore methods for inserting category variables.


 Define ranges in data sets.
 Investigate the relationships existing in a single data table by the use of PCA.
 Interpret scores and loadings of the PCA and draw relevant conclusions.
 Run a PLS regression for understanding the relationships between two data tables.
 Export models developed within The Unscrambler® to use with other applications.
 Predict response values from new samples.
 Estimate regression coefficients and interpret them.
 Find optimal number of components or factors in multivariate models.

References:

 Basic principles in using The Unscrambler®


 PCA Analysis
 About Regression methods
 Exporting data from The Unscrambler®
 Prediction

Data table
Click the following link to import the Tutorial B data set used in this tutorial.

The analysis is based on 12 samples of jam (objects), selected to span the expected, normal quality
variations inherent in such products. Several observations and measurements were made on the
samples.

Agronomic production variables

The samples were taken from four different cultivars, at three different harvesting times. The table
below describes the sampling plan for this analysis.

Sample description

No Name Cultivar Harvest time No Name Cultivar Harvest time

1 C1-H1 1 1 7 C3-H1 3 1

2 C1-H2 1 2 8 C3-H2 3 2

3 C1-H3 1 3 9 C3-H3 3 3

4 C2-H1 2 1 10 C4-H1 4 1

5 C2-H2 2 2 11 C4-H2 4 2

Page 2 of 36
B: PCA/PLS quality analysis

6 C2-H3 2 3 12 C4-H3 4 3

Note that the agronomic production variables are not used as input variables in any of the matrices.
These represent known information which may be extremely valuable for the interpretation of the
results of the data analysis. They will be utilized as category variables in the analyses performed in
this tutorial.

Column (variable) set Instrumental

Three chemical and three instrumental variables (APHA colorimetry) variables were also measured on
the samples tested by the sensory panel. These are described in the table below.

Instrumental variables

No Name Method

1 L Lightness

2 a Green-red axis

3 b Blue-yellow axis

4 Absorbance Absorbance

5 Soluble Soluble solids (%)

6 Acidity Titrable acidity (%)

Column (variable) set “Sensory”

A trained sensory panel evaluated 12 different sensory attributes of the raspberries used to make the
jam, using a 1-9 point intensity scale. The entries in the data matrix are the average ratings over all
judges. The observed variables are listed in the table below.

Sensory variables

No Name Type

1 Redness Redness

2 Colour Color intensity

3 Shininess Shininess

4 R.Smell Raspberry smell

5 R.Flav Raspberry flavor

6 Sweetness Sweetness

7 Sourness Sourness

8 Bitterness Bitterness

9 Off-flav Off-flavor

10 Juiciness Juiciness

11 Thickness Viscosity/thickness

12 Chew.res Chewing resistance

Column (variable) set Preference

Page 3 of 36
B: PCA/PLS quality analysis

114 representative consumers were invited to taste the 12 jam samples used in this application. They
each provided an individual preference score on a scale from 1-9. The average over all consumers for
each sample is provided in the data table.

Row (sample) sets

The data table, “JAMdemo”, consists of 20 samples. The first twelve samples will be used to develop
the models in this application and are hereafter referred to as training samples.

Eight new jam samples were assessed by the trained panel and given a sensory rating. These samples
represent the eight last samples in the table, and are referred to as Prediction samples. The
preference and the instrumental values are missing for these samples, as measurements were not
performed on these samples. The calibration model will be used to predict the preference for these
eight samples.

Preparing the data


Insert category variables
Category variables are useful for interpreting patterns in data sets. Here, the raspberries used to make
the jam samples originated from different cultivars and were harvested at different times. These
parameters represent excellent candidates for using category variables in an analysis.

Task

Insert two category variables, Cultivar and Harvest Time.

How to do it

The data table should be opened by following the above link and are already organized into two row
sets for training and prediction. The different types of variables have been defined in the column sets
as Instrumental, Sensory and Preference, based on the definitions in the data tables above. These
defined sets can be seen by expanding the folders in the project navigator.

Jam data organization

Page 4 of 36
B: PCA/PLS quality analysis

Some additional information about the cultivar and harvest time now needs to be added to this data
as two new columns.

To select a column, click on the header cell containing the column number. Activate the first column
of the table, right mouse click and select Insert - Category Variable or use the menu options and
select Edit - Insert - Category variable.

Highlight column to activate insert options

In the dialog box, enter the category variable name “Harvest Time”. Keep the default option Specify
the level manually selected.

Page 5 of 36
B: PCA/PLS quality analysis

Enter the level names: “H1”, “H2” and “H3” followed by a click on Add.

Click OK.

In the new column, click in each cell and select the appropriate value for each sample as given in the
sample names.

Note: Category variable cells are orange in the editor to distinguish them
from ordinary variables.

Add a second column in the same way, after highlighting the first column: Edit - Insert - Category
Variable. In the dialog box, enter the category variable name “Cultivar”.

Keep the default option Specify the level manually selected.

Enter the level names: “C1”, “C2”, “C3”, and “C4” followed by a click on Add.

Page 6 of 36
B: PCA/PLS quality analysis

Click OK.

In the new column, double click in each cell and select the appropriate value for each sample as given
in the sample names. Alternatively, select all cells of each cultivar in sequence and fill in the category
level using the right-click Fill function.

The Tutorial_b data table displayed in the Editor (after insertion of Cultivar and Harvest Time)

Check column (variable) sets


In The Unscrambler® matrices are defined by Row and Column (Sample and Variable) Sets. A
recommended good practice is to define all sets before any analyses are performed. The information
entered to organize the data can later be used to color-code graphics according to these sample

Page 7 of 36
B: PCA/PLS quality analysis

groups.

Task

Check that the three column (Variable) Sets: “Instrumental”, “Sensory” and “Preference” have been
defined.

Verify the existence of two sample sets “Training” samples and “Prediction” samples. These sets can
be visualized in the project navigator.

How to do it

To create column and row ranges, select Edit - Define Range to open the Define Range dialog.

Three sets have been predefined in the project Tutorial_B data set.

Column name: Instrumental


Interval: 3-8

Column name: Preference


Interval: 14

Column name: Sensory


Interval: 9-13, 15-21

To verify these definitions use the Edit - Define range and inspect the information in this dialog.

The Define range dialog with three column sets

Page 8 of 36
B: PCA/PLS quality analysis

Verify also the row sets:

1. Row Name: Training, Interval: 1-12


2. Row Name: Prediction, Interval: 13-20

Exit from the Define Range dialog box by clicking Cancel.

Define sample sets from category variable column


Task

Additional row sets will be added for the various levels of the category variables harvest time and
cultivar.

How to do it

Begin by selecting the column “Cultivar” in the data editor, and select Edit- Group Rows…, which will
open the Create row ranges from column dialog.

Edit- Group rows…

The column that was selected, “Cultivar”,is already in the Cols field.

Page 9 of 36
B: PCA/PLS quality analysis

There is no need to specify the Number of Groups as it is based on a category variable.

Create row ranges from column

Click OK.

Automatically 4 row ranges have been added. Look in the Row folder to see them:

New row ranges

Do the same for the variable “Harvest time”.

Objective 1: Find the main sensory qualities


The main variations in the sensory measurements may be found by decomposing them by Principal
Component Analysis (PCA). This data decomposition results in valuable graphical diagnostic tools
including scores, loadings and residuals. The results will be interpreted in order to establish whether
sensory measurements made on the jam samples have any practical meaning.

Make a PCA model


Task

Make a PCA model using the column set “Sensory” as the variable set.

Page 10 of 36
B: PCA/PLS quality analysis

How to do it

Select Tasks – Analyze - Principal Component Analysis… Specify the following parameters in the
dialog box:

Model inputs
 Data matrix: “JAMdemo” (20x21)
 Rows: Training (12)
 Cols: Sensory (12)
 Maximum components: 6
Check the Identify outliers and Mean center data boxes, if these check boxes are not
already selected.
Principal Component Analysis dialog: Model inputs

Weights
From the Weights tab verify that the weights are all 1.0 (constant).
No weighting is used in this model as the sensory panel is known to be well trained.
However, sensory variables are often weighted when there is evidence that the panel is
not well trained, or when investigating relationships with other variables. The most
common weighting to use is 1/SDev.
Weights tab dialog

Page 11 of 36
B: PCA/PLS quality analysis

Validation
From the Validation tab select the option Cross Validation and press Setup which
opens the Cross Validation Setup dialog. Here select Full from the drop-down list for
cross validation method.
Validation Dialog

Page 12 of 36
B: PCA/PLS quality analysis

This validation method is more time consuming than other options, but the estimate of the residual
variance is more reliable.

Ensure that NIPALS is selected in the algorithm pane. Otherwise, the second principal component
might change direction. This has no impact on the drawn conclusions, but might be confusing for
beginners.

Click OK to start the PCA. After PCA analysis is completed, the program will request a user, “Do you
want to view plots of model PCA now?”. Click Yes to see the PCA Overview plots. A new node has been
added to the project navigator containing all the PCA result matrices and plots.

Interpret the variance plot in the PCA overview


Task

Determine the optimal number of PCs.

How to do it

The PCA Overview contains the most commonly used plots for interpreting PCA models, including

 Scores plot.

 Loadings plot.

Page 13 of 36
B: PCA/PLS quality analysis

 Influence plot.

 Explained/Residual Variance plot.

PCA Overview plots

The scores plot is a map of the samples, and shows how they are distributed. It can be used to isolate
samples that are similar, or dissimilar to one another. In this analysis, the plot labels show that PC-1
explains 58% and PC-2 28% of the total variance in the data. The explained variance curve (in the
lower right corner) is an excellent tool for selecting the optimal number of components in the model.

The explained variance increases until PC 5 is reached. The software does suggest the optimal
number of PCs for a model, but it is up to the user to analyze the data and confirm the optimal
number of PCs in this model, usually based on this plot.

The highest explained variance is found with 5 PCs, but the explained variance in a model using 3 PCs
contains similar explained variation. A simple (parsimonious) model is usually more robust than a
complex one, and easier to interpret. It is always suggested to work with a model consisting of as few
PCs as possible. The info box in the lower left corner of the main workspace indicates that 3 PCs are
considered optimal for this model.

Info Box

Page 14 of 36
B: PCA/PLS quality analysis

Task

Change the explained variance plot to a residual variance plot.

How to do it

Activate the lower right plot by clicking in it. Toggle between the Explained / Residual buttons from

toolbar shortcuts .

The explained variance is now converted to residual variance. The information is the same, but
presented in another way. The residual variance is well suited to finding the optimal number of PCs to
use in a model, while the explained variance is a better measure for explaining how much of the
variation is described by the model. The plot layout can be changed to a bar chart by using the plot

layout shortcut .

The PCA Explained Variance Bar plot

Page 15 of 36
B: PCA/PLS quality analysis

The model with 3 PCs describes 92% of the total validation variance in the data; for calibration it is
96%. These values may be obtained by clicking on the specific data point in the plot.

Use the toolbar buttons to change between having only the calibration or validation variance
curve plotted, or both.

Interpretation of the scores plot for the PCA


The scores plot, which is a map of samples, displays information about the sample relationships for a
particular data set.

Task

Interpret Scores plot. Use different plot options for ease of interpretation.

How to do it

The scores plot shows the projected locations of the samples onto the calculated PCs. By studying
patterns in the samples a meaningful interpretation of the PCs may be possible.

PCA Scores plot

Page 16 of 36
B: PCA/PLS quality analysis

The scores plot for this analysis indicates that the 12 samples are not arranged in a random way. By
moving from left to right along this plot, a pattern can be observed where samples harvested at time
H1 are mainly found on the left. These then change to H2 and finally H3. Moreover, moving from the
top to the bottom, C1 samples occupy the top region, followed by C2, then C3, and finally C4.

The row sets based on the category variables that were inserted into the data table can be used to
better visualize these trends.

In the scores plot, right mouse click and select Sample Grouping to open the dialog where different
row sets can be used for grouping and color-coding the plot.

Select the Value of variable and the Cultivar category variable. In Labels, select Name to display the
real name of each sample.

The marker color, shape and size can be customized here for optimized viewing of the data.

Sample Grouping Dialog

Page 17 of 36
B: PCA/PLS quality analysis

When the desired settings have been defined, click OK to complete the operation.

PCA Scores with Sample Grouping

Repeat the above sample grouping process, this time using the category variable Harvest Time.

Page 18 of 36
B: PCA/PLS quality analysis

Interpretation of the correlation loadings plot


The loadings plot, which is a map of the variables, displays information about the variables analyzed
in the PCA model. Correlation Loadings provide a scale independent assessment of the variables and
may, in some cases, provide a clearer indication of variable correlations.

Task

Interpret variable relationships in the correlation loadings plot.

How to do it

Activate the X-Loadings plot by clicking in it, then use the corresponding shortcut button to make
it the correlation loadings plot.

The Correlation Loadings plot may be used to study the variable correlations that exist in a particular
data set.

Correlation Loadings plot

The plot shows that two variables (redness and colour) have an extreme position to the right of the
plot along PC1. They are close to each other (i.e. they are highly positively correlated), and far from
the center and are very close to the edge of the 100% explained variance ellipse. This also means that
samples lying to the right of the scores plot have higher values for those two variables.

Along the vertical axis (PC2), two variables can be observed, with high negative values for this PC.
These are R.SMELL and R.FLAV. These two variables are opposite to the variable OFF FLAV which has
higher values for this PC. This indicates that raspberry smell and flavor correlate positively with each
other, and negatively with off-flavor.

Interpretation of scores and loadings

Page 19 of 36
B: PCA/PLS quality analysis

Task

Relate Scores (samples) information to Loadings (variables) information.

How to do it

The Scores plot and Correlation Loadings plot show that samples C2H3 and C1H3 have high color and
redness intensities, while sample C1H2 is more likely to have an off-flavor character. Samples located
in a specific part of a 2-vector scores plot have, in general, much of the properties of the variables in
the same location in the 2-vector loadings plot, provided that the plotted PCs describe a large
proportion of the variance.

PC 3 describes the variation in sweetness, bitterness and chewing resistance. Confirm this by
activating the loadings plot (upper right quadrant) and selecting Plot - Loadings. Display PC 1 vs. PC 3

by changing Vector 2 using the arrows in the toolbar .

PCA Loadings 1 vs. 3

In this new plot, the horizontal axis is unchanged (PC1) and the vertical axis now shows PC3.

Interpretation of the influence plot


Task

Interpret the influence plot, which is used for the detection of outliers.

How to do it

The influence plot is displayed in the lower left quadrant of the PCA Overview. The strongest outliers
are placed in the upper right corner of the plot, i.e. they have a large leverage and a high residual
variance. In the current analysis, there is no evidence of outliers.

Page 20 of 36
B: PCA/PLS quality analysis

PCA Influence plot

All of the results for the PCA are now part of the project Tutorial_B. Save the project to capture the
PCA results. The next steps in this tutorial will make use of the sensory, instrumental and preference
data.

Close the PCA overview by selecting its name in the navigation bar at the bottom of the viewer and
right clicking to select Close.

Objective 2: Explore the relationships between


instrumental/chemical data (X) and sensory data (Y)
Is it be possible to predict the quality variations observed in the jam data by using instrumental
measurements only? Training and employing a sensory panel is costly and time consuming. Producers
of jam would find it most convenient if they could predict quality variations by measuring some
properties by instrumental means. The next task in this tutorial is to make a regression model
between the sensory and instrumental data and analyze the results for a possible solution.

Make a PLS regression model


In The Unscrambler® the regression between two matrices can be performed using a number of
common multivariate methods. Partial Least Squares (PLS) regression is used in this case in order to
maximize the information obtained from both X and Y.

Task

Make a PLS regression model that predicts the variations in sensory variables from instrumental and
chemical variables.

How to do it

Select Tasks - Analyze - Partial Least Squares Regression…. Specify the following parameters in the

Page 21 of 36
B: PCA/PLS quality analysis

Regression dialog:

Partial Least Squares Model Inputs

Model inputs tab


Predictors

 Rows: Training (12)


 Cols/X-variables: Instrumental (6)
Responses

 Rows: Training (12)


 Cols/Y-variables: Sensory (12)
Maximum components: 6
X and Y weights tabs
Select the X and Y Weights tabs to access their dialogs. Weighting will be applied to all
the X and Y variables for regression purposes.
X Weights Dialog

Page 22 of 36
B: PCA/PLS quality analysis

Press All to change the weighting of all variables at the same time. Variables can also
be selected by clicking on them in the list. Remember to hold the Ctrl key down while
selecting several variables. Choose the A / (SDev +B) button with the constants A = 1
and B = 0. Ensure that the weights change in the list.
All variables are weighted by dividing them with their own standard deviations. This
allows all variables to contribute to the model, regardless of whether they have a small
or large standard deviation from the outset; only the systematic variation is of interest
here.
Now go to the Y Weights tab and do the same. Do not click Finish, but go to the
Validation tab.
Validation tab
Select Cross validation from the Validation tab.
Press the Setup button to access the Cross Validation Setup dialog and choose Full
from the drop-down list. It is always recommended to use test set or cross validation
to develop final models. Ensure that NIPALS is selected in the algorithm pane.

Click Finish in the regression dialog when all parameters have been set up. The computation of the
model will begin. After PLS analysis is completed, the system will ask “Do you want to view the plots of
model PLS now?”.

Click Yes to see the PLS Overview plots. A new node, PLS, has been added to the project navigator.

PLS Regression Overview

Page 23 of 36
B: PCA/PLS quality analysis

This overview provides the most useful and common predefined result plots for PLS, including loading
weights and residuals, etc. The model can always be reviewed during the analysis stage by selecting
any of the result plots under the PLS - Plots node in the project navigator. For this exercise, various Y
response values were used for model development. Therefore the overview results for each of these
responses are available by choosing the Y value of interest in the tool bar. When performing this type
of analysis with multiple responses the non-significant variables may be determined for each of the
responses. It can also provide information on which sensory responses can best be predicted from the
instrumental measurements without making a separate PLS model for each response. When a
Predicted vs. reference plot is selected (lower right quadrant) active, the name of the Y value being

analyzed appears in the toolbar . Another Y-response can be chosen from the
drop-menu menu, or one can scroll through the values using the arrow tool on the right.

Interpretation of the variance plot


Task

Interpret the explained variance curve, which can be shown as residual variance, or as explained
variance. The two different views are useful for different tasks.

How to do it

The Y-explained variance plot is in the lower left quadrant. This plot can be changed to the residual

variance plot by using the toolbar and as the X-explained variance by clicking on the X button

.
A local maximum is achieved for five PLS factors. The next task is to determine why the validation
curve does not follow the general trend. This can be done by looking at the explained variance for the

Page 24 of 36
B: PCA/PLS quality analysis

variables individually.

Y-explained variance plot

From the plot menu select Variances and RMSEP - X- and Y-Variance… Make sure the bottom plot
shows the Explained Variance for the 12 individual Y variables. If not, change it by using the toolbar

shortcut. Also do not select Total, but select Cal from the toolbar shortcuts .

Add a legend to the plot by right clicking and selecting Properties. Select legend, and check the box
visible to add the legend to the plot.

PLS, Explained Validation Variance Plot displayed for the 12 individual Y-variables

The conclusion reached from the residual variance curve was that two PLS factors were optimal. The
variables that are well described are reflected in the information conveyed by these factors.

About 85% of the color variation (variables 1 and 2), and 80% of the variation in sweetness (variable 6)
can be explained by a combination of the chemical and instrumental variables.

Page 25 of 36
B: PCA/PLS quality analysis

Note that only 23% of the total Y-variance is explained by the model using two factors.

Interpretation of the scores plot


The scores plot shows how the samples are related to each other.

Task

Interpret the scores plot.

How to do it

Return to the Regression Overview Plot (by selecting it from the Plots node in the project navigator).
The Scores plot is always found in the upper left quadrant of the overview. The scores plot shows
patterns in the samples. This is often difficult to see without some other powerful visual tools. Use the
category variables as markers in the same way it was performed in the “Interpretation of the Scores
plot” for the PCA model. This can be performed by highlighting the scores plot and right clicking to
select Sample Grouping. The category variables harvest time, will be used for the sample grouping.

PLS factor 1 describes the harvesting time. Harvest time 3 is found on the right in the plot and harvest
time 1 to the left. The scores plot does not reveal information about the cultivars.

A comparison with the loadings plot provides more information. Interpret the two plots (Scores and
Loadings) by analyzing them together.

Interpretation of the loadings and loading weights plot


Study the loading weights plot to find correlating variables.

Task

Interpret the loadings and the loadings weight plots.

How to do it

The loadings plot is located in the upper right quadrant of the Regression Overview. Activate it (if it is
present), or choose it from the project navigator under the PLS - Plots node. Make sure both X and Y
loadings are plotted.

To interpret variable relationships, visualize straight lines between the variables through the origin.
Variables along the same line, far from the origin, may be correlated. (Negatively correlated when
situated on opposite sides of the origin.)

PLS, X-Loading Weights and Y-Loadings Plot

Page 26 of 36
B: PCA/PLS quality analysis

The spectrophotometric color measurements (L, a, and b) appear to be strongly negatively correlated
with color intensity and redness. Sweetness is, as expected, strongly negatively correlated with
measured Acidity. But the R. Flavor shows weak correlation to the PLS-factors (near origin = low PLS
loadings).

The regression coefficients may also be analyzed to understand which X variables are important in
describing each of the Y responses. These can be selected from the project navigator, or from the
menu Plot- Regression Coefficients - Raw coefficients (B)- Line. The coefficients for each of the Y
responses can be displayed by selecting them from the drop-down list in the toolbar.

From Problem I it was concluded that the jam quality varied with respect to color, flavor, and
sweetness. But the results so far in Problem II show that the chemical and instrumental variables
mainly predict variations in color and sweetness (which is indicated by the low explained Y-variance of
Flavor). This indicates that the Y-variable Flavor cannot be replaced with the present set of X-
variables, i.e. there is no information in the chemical and instrumental measurements related to the
Flavor of the jam samples.

Use of other instrumental X-variables, e.g. gas chromatographic data, may have increased the flavor
prediction ability of the raspberry jam data.

Interpretation of the predicted vs. reference plot


The predicted vs. reference plot displays the predictive ability of the developed model.

Task

Interpret the predicted vs. reference plot.

How to do it

The predicted vs. reference plot in the regression overview currently displays the results for the first
Y-variable, in this case, “Redness”.

Page 27 of 36
B: PCA/PLS quality analysis

PLS, Predicted vs. Reference Plot for variable “Redness”, model with two factors

Use the drop-down list in the toolbar to observe the prediction quality for other variables measured in
this analysis. Make sure these plots are displayed for two PLS factors, as this is the correct number for
this model. Note that for several of the properties, including raspberry flavor, raspberry smell, and
off-flavor, the instrumental values do not provide any real information. This analysis shows that the
chosen instrumental measurements are not a good substitution for the sensory analysis of these jams.

Objective 3: Predict user preference from sensory


measurements
Is it possible to develop a model for predicting consumer preference data from new sensory data? If
so, expensive consumer tests can be replaced by cheaper sensory tests. The PLS model previously
developed was used for interpretation purposes. The focus is now on prediction. A new model will be
built relating the sensory data to consumer preference data, and this model will be applied to
unknown samples to predict their preference.

Make a PLS regression model for preference


First, develop a model relating sensory data to preference, and interpret it. PLS regression will be used
as the regression method

Task

Make a PLS regression model for describing the relationships between sensory data and preference.

How to do it

From the Main Menu, select Tasks - Analyze - Partial Least Squares Regression…, and specify the
following parameters in the PLS Regression dialog:

Model Inputs

Page 28 of 36
B: PCA/PLS quality analysis

Predictors
 X data set: “JAMdemo”
 Rows/Samples: Training (12)
 Col/X-variables: Sensory (12)
Responses
 Y data set: “JAMdemo”
 Rows/Samples: Training (12)
 Cols/Y-variables: Preference (1)
Maximum components: 6
PLS Regression Dialog

Weights in X and Y
It is necessary to standardize all variable with the option 1/SDev.
Select the X Weights tab and weight all the X variables with 1/SDev so that each
variable will contribute equally in the modeling step. Also weight the Preference values
(Y) by 1/SDev in the Y Weights tab.
Validation
Full Cross Validation
Press Setup to access the Cross Validation Setup dialog and choose Full cross
validation as the cross validation method.

Press OK.

Interpretation of the regression overview

Page 29 of 36
B: PCA/PLS quality analysis

Task

A new PLS node has been added to the project navigator. Rename this to “PLS Sensory” by highlighting
it, then right clicking and selecting the Rename option. Interpret the model using the regression
overview plots and other diagnostic tools available.

How to do it

It is of primary interest to determine how well the model can predict new values. Therefore only the
residual variance and the Predicted vs. reference plots have most meaning.

The residual variance

Activate the explained variance plot in the lower left quadrant, and change it to the residual Y variance

plot by using the toolbar shortcuts . The prediction error tapers off significantly after two PLS
factors. This represents the optimal model conditions.

Residual Y Validation Variance Plot

Predicted vs. reference

Activate the predicted vs. reference plot and specify to display it for 2 PLS factors, using the arrows in

the toolbar .

Turn on the regression line and the target line with the toolbar shortcuts .

Predicted vs. reference Plot with Trend Lines

Page 30 of 36
B: PCA/PLS quality analysis

It can be observed that the predictions are of good quality. Some samples are not so well predicted,
but the overall correlation is satisfactory.

Interpretation of the regression coefficients


The regression coefficients are used to calculate the response value from the X-measurements. The
size of the coefficients provides an indication of which variables have an important impact on the
response variables.

There are two kinds of regression coefficients, B and B. The B coefficients are calculated from the
w w
weighted data table and are used for interpretation. The B coefficients (raw) are calculated from the
raw data table and are used for predictions.

Task

Find which variables are important for predicting the Y-variable Preference.

How to do it

The estimated regression coefficients indicate the cumulative importance of each of the sensory
variables to the consumer preference.

Select Plot - Regression Coefficients. Choose the Weighted coefficients (B ) option. Using the arrows
w
in the toolbar, change the plot to show regression coefficients for 2 PLS factors, and change the plot
layout to a bar chart.

Regression Coefficients Plot

Page 31 of 36
B: PCA/PLS quality analysis

Redness, Color and Sweetness (B1, B2 and B6) are significant in predicting Preference. Raspberry Smell
(B4) is also significant, but contributing negatively to the Preference. Thickness (B11) seems to be of
importance also as it has a large (negative) coefficient.

Save the project file with the name “Tutorial_B “. It may also be saved as the model file itself, providing
a smaller file with just the model information that can be used for predicting new samples in real time
using The Unscrambler® Prediction Engine and The Unscrambler® X Process Pulse products. To save
the model only, right click on the model node in the project navigator and select the option Save
Model. In the dialog choose what size model to save. Models other than the full model do not include
all the results matrices, and therefore provide fewer results in addition to the predicted values when
used.

Save Model

Rename the model if desired and click on Save.

Open result matrices in the Editor


The result matrices may also be observed numerically. Comparison of results may be easier in tables
and the Editor is a good starting point for exporting data into other programs.

The plot Raw regression Coefficients (B) is available as a predefined plot from the Plot menu in the
regression results viewer. However, for this exercise the B coefficients will be viewed from the list of

Page 32 of 36
B: PCA/PLS quality analysis

numerous available matrices.

Task

View the regression coefficients in the editor.

How to do it

Open the Results folder under the PLS node in the project navigator and select the Beta Coefficients
(raw) matrix. Any of the other validation matrices may be selected from the validation folder of the PLS
model. The beta coefficients can then be treated as every other data in an Editor. They may be plotted
from the Plot menu, etc.

Predict preference for new samples


Regression models are mainly used to predict the response value for new samples. Models are
developed to allow the prediction of these values rather than performing reference measurements,
which often are time consuming and expensive.

The purpose of the model previously developed was to predict the jam preference for some
consumers based on sensory values that were measured for the samples.

Task

Predict the Preference for the jam samples.

Interpret the prediction results to see whether the predictions can be trusted.

How to do it

Activate the “JAMdemo” data matrix. Select Tasks - Predict - Regression… and specify the following
parameters in the Prediction dialog:

 Select model: PLS Sensory


 Data matrix: “JAMdemo”

Page 33 of 36
B: PCA/PLS quality analysis

 Rows/Samples: Prediction (8)


 Cols/X-variables: Sensory (12)
 Prediction type: Full Prediction
 Y-reference: Not included
 Number of Components: 2

Check the boxes for Inlier statistics and Sample Inlier dist (Mahalanobis distance) to provide valuable
statistical measures of the similarity of the prediction samples to the calibration samples.

Click OK to perform the prediction.

The Prediction dialog

Interpretation of Predicted with Deviation


There were no reference measurements available for the new samples in the “Prediction” Set. This
makes it impossible to check predicted vs. reference values. Since a model has been developed based
on projection, the only option available is to check the reliability of the predictions from the
deviations. There are also some statistical measurements of the similarity of predicted samples to
those used in developing the calibration model that can be used: inlier statistics and Mahalanobis
distance.

Page 34 of 36
B: PCA/PLS quality analysis

Task

Interpret the Predicted with Deviation plot, and other plots related to prediction results.

How to do it

Click OK in the Prediction dialog to display the predicted with deviation plot, and the tabulated
prediction results.

Prediction results

Predicted preference for the “unknown” new jams have some uncertainty limits, i.e. the accuracy of
new predictions is not so reliable, however, this model can be used to predict the preference of new
jam samples providing an indication of which ones will be accepted or not by consumers.

View the Inlier vs. Hotelling’s T² plot by selecting Plot – Residuals and Influence - Inlier vs Hotelling’s
T². This plot shows how similar the new samples are to those used in developing the calibration
model. For a prediction to be trusted the predicted sample must not be too far from a calibration
sample. This is checked by the Inlier distance. The projection of the new sample onto the model also
should not be too far from the center. This may be checked using the Hotelling’s T² distance.

Save the project file under the name “Tutorial B_complete”. This now includes all the data, three
models, and the predicted results for preference.

Check the error in original units – RMSE


Finally, observe how large the expected error is in predicted preference results, i.e. determine what an
approximate RMSEP is for such an analysis.

Task

Plot the RMSE.

How to do it

Return to the PLS Sensory node in the project navigator. In the plots folder select Regression

Page 35 of 36
B: PCA/PLS quality analysis

Overview, then select Plot - Variances and RMSEP - RMSE.

Two curves are plotted, one for the calibration: RMSEC and one for validation. In this particular case it
is the cross-validation error: RMSECV.

PLS, Root Mean Square Error Plot

To gain a better approximation of what to expect in future predictions, the RMSECV should be
analyzed.

The RMSECV may be studied for Preference for all PLS factors. RMSECV (using two factors) is 0.83. This
means that any predicted new sample on the scale from 1 to 9 will have a prediction error around 0.8.
This is an acceptable error level in sensory analysis, which has much uncertainty in all measurements.

Page 36 of 36

You might also like