Professional Documents
Culture Documents
Tutorials For Multivariate Analysis On NIR and Raman Spectrum R2
Tutorials For Multivariate Analysis On NIR and Raman Spectrum R2
I. PCA analysis on NIR spectrum 1. Data set The data set consist of raw unpreprocessed NIR spectra of diesel fuels. Total 784 diesel samples were measured in the absorbance model and the spectra cover the region of 750 to 1550 nm in 2 nm increments.
1.2
0.8
abs orbanc e
0.6
0.4
0.2
-0.2 700
800
900
1000
1300
1400
1500
1600
2. Analysis routine for PCA The analysis routines will be explained based on the example data set. A usual routine for constructing PCA model follows in the given order. A. Load data Data should be loaded from Excel, csv, text or other file format. The provided file is *.CSV format. B. Edit or Define data - Assign observation ID and variable label for the imported data set. - Define category of each variable (Ex. X (independent) vs. Y (dependent)). In PCA, there is only X dataset. - Exclude or Include each variable which will be used during the construction of PCA model C. Preview data - Data should be previewed using histogram or time-series plot in order to check the statistical properties of each variable visually.
abs orbanc e
frequenc y
0 -0.05
- Based on the preview results, data can be edited or some abnormal data can be excluded from the analysis. D. Preprocessing (this is an only example, and other methods can be used, depending on the characteristics of data set, noise level, etc.) - First, baseline subtraction will be conducted using polynomial fitting method. - After baseline subtraction, first derivative spectra will be computed using Savitzky-Golay algorithm - Then, finally, each variable will be scaled by subtracting its mean value (this is called mean-centering). In other cases, auto-scaling (subtracting mean value followed with dividing its standard deviation) can be done.
bas eline s ubtrac ted s pec tra 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 700 800 900 1000 1100 1200 1300 w avelengt n m )
1
abs orbanc e
abs orbanc e
abs orbanc e
50
100
250 nm )
300
350
400
His t
r am for a riable #1
0.1
0.05
-0.05
100
200
600
700
800
F irs t derivat ive s pec t ra 0. 08 0. 06 0. 04 0. 02 0 -0. 02 -0. 04 1400 1500 1600 -0. 06 50 100 150 200 w avelengt 250 nm ) 300 350 400
E. Compute PCA model - The PCA model will be constructed using preprocessed data set. - Optimal number of principal components (PCs) will be chosen automatically using Cross-validation methods. - In this example, 7 PCs are optimal because RMSECV is minimum at 7 PCs (alternatively Q2 is maximum at 7 PCs)
E igenvalues and Cros s -validation Res ults for X 0.016 0.014 0.012 RM S E CV 0.01 0.008 0.006 0.004
16
18
20
F. Analysis of the computed PCA model - Following list of plot can be used to examine the constructed PCA model y y y y y Score plot (1D/ 2D/ 3D) Loading plot (1D/ 2D/ 3D) Hotelling's T2 plot SPE or dModX plot Contribution plot o Contribution plot will allow to bring single sample spectra data by double-clicking a contribution.
- If any outlying samples are identified using the above plots, corresponding samples are excluded from the data set, and then PCA model is reconstructed. G. Prediction using new data set - Prediction set should be defined with either part of the training data or secondary data. In either case, the program should be able to handle that. The following plots should be available 1. Score prediction 2. Hotellings T2 prediction
II. PLS analysis on Raman spectrum 1. Data set This data set consists of 80 samples of corn measured with NIR spectrometer (Consider this is a Raman data). The wavelength range is 1100-2498 nm at 2 nm intervals (700 channels). The protein value of each of the sample is also included.
1 0.8 abs orbanc e 0.6 0.4 0.2 0 1000
pro te in c o n c e n tra tio n 10
9.5
8 .5
2. Analysis routine for PCA The analysis routines will be explained based on the example data set. In this case, X (independent) will be NIR spectrum and Y (dependent) will be protein concentration in corn. A usual routine for constructing PLS model follows in the given order. A. Load data Data should be loaded from Excel, csv, text or other file format. The provided files are *.xls format. B. Edit or Define data - Assign observation ID and variable label for the imported data set. - Define category of each variable (Ex. X (independent) vs. Y (dependent)). - Exclude or Include each variable which will be used during the construction of PLS model C. Preview data
2500
7.5
20
40 o b s e rva tio n I
60
80
- Data should be previewed using histogram or time-series plot in order to check the statistical properties of each variable visually.
cy
fr qu
s r
s rv t i
- Based on the preview results, data can be edited or some abnormal data can be excluded from the analysis. D. Preprocessing (this is an only example, and other methods can be used, depending on the characteristics of data set, noise level, etc.) - First, each spectra of X will be smoothed using Savitzky-Golay algorithm. - After smoothing, the spectra will be preprocessed using SNV to remove baseline drift and scattering effect. - Finally, each variable in X and Y will be auto-scaled for normalization.
- .
. X
& % ! # $
- .
#! !
"
' ) ' ( (
#! !
"!
is t
f r v ri
l #
ti
-s ri s f r v r i
l #
$ $ $
0.6
0.4
0.2
-1
3 2 1
ab orban e
protein
0 -1 -2
-5
-3 100 200
2 1
E. Compute PLS model - The PLS model will be constructed using preprocessed X and Y datasets. - Optimal number of latent variables (LVs) will be chosen automatically using Cross-validation methods. - In this example, 6 LVs are optimal because RMSECV plot shows a knee at 6 LVs (alternatively Q2 can be used)
DB
S IM P LS V arian e Capture an S tati ti 0.4 0.35 RM S E CV 0.3 0.25 0.2 0.15 0.1 0.05 5 10 15 Latent V ariable Nu ber
A D C C B
for
500
600
700
20
40 ob er a tion I
85
85
auto
ale
auto
ale Y
0 1000
2500
100
200
5 6
oothe
e tra
S N V treate
15
4 3
500
600
700
60
80
20
F. Analysis of computed PLS model - Following list of plot can be used to examine the constructed PLS model y y y y y y Score plot (1D/ 2D/ 3D) Loading plot (1D/ 2D/ 3D) Weight plot (1D/2D/3D) Hotelling's T2 plot SPE or dModX plot Contribution plot o Contribution plot will allow to bring single sample spectra data by double-clicking a contribution. Regression coefficient plot VIP score plot Y measured vs. Predicted plot Residual plot
y y y y
- If any outlying samples are identified using the above plots, corresponding samples are excluded from the data set, and then PLS model is reconstructed. - Alternative options: y To evaluate the computed PLS plot, permutation analysis can be performed by permutating each row randomly with given percentage and then comparing Q2 value with that of original PLS model To select informative region in the original spectra, variable selection methods (Ex. VIP, UVE, GA, etc) can be applied to dataset. This feature is not supported by other commercial software package, such as SIMCA-P and Unscrambler.
G. Prediction using new data set 5. 6. 7. 8. 9. 10. Score prediction Hotellings T2 prediction SPE ot DModX Contribution plot. Y measured vs. Predicted plot Residual plot
Revision 1.0 Haewoo Lee (Draft) Revision 2.0 Seongkyu Yoon (edited a few sentences)