Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

Tutorials for multivariate analysis on NIR and Raman spectrum

I. PCA analysis on NIR spectrum 1. Data set The data set consist of raw unpreprocessed NIR spectra of diesel fuels. Total 784 diesel samples were measured in the absorbance model and the spectra cover the region of 750 to 1550 nm in 2 nm increments.
1.2

0.8

abs orbanc e

0.6

0.4

0.2

-0.2 700

800

900

1000

1100 1200 W avelength (nm )

1300

1400

1500

1600

2. Analysis routine for PCA The analysis routines will be explained based on the example data set. A usual routine for constructing PCA model follows in the given order. A. Load data Data should be loaded from Excel, csv, text or other file format. The provided file is *.CSV format. B. Edit or Define data - Assign observation ID and variable label for the imported data set. - Define category of each variable (Ex. X (independent) vs. Y (dependent)). In PCA, there is only X dataset. - Exclude or Include each variable which will be used during the construction of PCA model C. Preview data - Data should be previewed using histogram or time-series plot in order to check the statistical properties of each variable visually.

800 700 600

abs orbanc e

frequenc y

500 400 300 200 100 0 0.05 X 0. 0. 5


0 -0.05

- Based on the preview results, data can be edited or some abnormal data can be excluded from the analysis. D. Preprocessing (this is an only example, and other methods can be used, depending on the characteristics of data set, noise level, etc.) - First, baseline subtraction will be conducted using polynomial fitting method. - After baseline subtraction, first derivative spectra will be computed using Savitzky-Golay algorithm - Then, finally, each variable will be scaled by subtracting its mean value (this is called mean-centering). In other cases, auto-scaling (subtracting mean value followed with dividing its standard deviation) can be done.
bas eline s ubtrac ted s pec tra 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 700 800 900 1000 1100 1200 1300 w avelengt n m )

1
abs orbanc e

abs orbanc e

m ean-c entered data 0.03 0.02 0.01 0 -0.01 -0.02 -0.03

abs orbanc e

50

100

150 200 w avelengt

250 nm )

300

350

400

His t

r am for a riable #1

tim e-s eries for variable #1 0.15

0.1

0.05

-0.05

100

200

300 400 500 obs ervation ID

600

700

800

F irs t derivat ive s pec t ra 0. 08 0. 06 0. 04 0. 02 0 -0. 02 -0. 04 1400 1500 1600 -0. 06 50 100 150 200 w avelengt 250 nm ) 300 350 400

E. Compute PCA model - The PCA model will be constructed using preprocessed data set. - Optimal number of principal components (PCs) will be chosen automatically using Cross-validation methods. - In this example, 7 PCs are optimal because RMSECV is minimum at 7 PCs (alternatively Q2 is maximum at 7 PCs)
E igenvalues and Cros s -validation Res ults for X 0.016 0.014 0.012 RM S E CV 0.01 0.008 0.006 0.004

6 8 10 12 14 P rinc ipal Com ponent Num ber

16

18

20

F. Analysis of the computed PCA model - Following list of plot can be used to examine the constructed PCA model y y y y y Score plot (1D/ 2D/ 3D) Loading plot (1D/ 2D/ 3D) Hotelling's T2 plot SPE or dModX plot Contribution plot o Contribution plot will allow to bring single sample spectra data by double-clicking a contribution.

- If any outlying samples are identified using the above plots, corresponding samples are excluded from the data set, and then PCA model is reconstructed. G. Prediction using new data set - Prediction set should be defined with either part of the training data or secondary data. In either case, the program should be able to handle that. The following plots should be available 1. Score prediction 2. Hotellings T2 prediction

3. SPE ot DModX 4. Contribution plot.

II. PLS analysis on Raman spectrum 1. Data set This data set consists of 80 samples of corn measured with NIR spectrometer (Consider this is a Raman data). The wavelength range is 1100-2498 nm at 2 nm intervals (700 channels). The protein value of each of the sample is also included.
1 0.8 abs orbanc e 0.6 0.4 0.2 0 1000
pro te in c o n c e n tra tio n 10

9.5

8 .5

2. Analysis routine for PCA The analysis routines will be explained based on the example data set. In this case, X (independent) will be NIR spectrum and Y (dependent) will be protein concentration in corn. A usual routine for constructing PLS model follows in the given order. A. Load data Data should be loaded from Excel, csv, text or other file format. The provided files are *.xls format. B. Edit or Define data - Assign observation ID and variable label for the imported data set. - Define category of each variable (Ex. X (independent) vs. Y (dependent)). - Exclude or Include each variable which will be used during the construction of PLS model C. Preview data

1500 2000 w avelength (nm )

2500

7.5

20

40 o b s e rva tio n I

60

80

- Data should be previewed using histogram or time-series plot in order to check the statistical properties of each variable visually.


cy

fr qu

s r

s rv t i

- Based on the preview results, data can be edited or some abnormal data can be excluded from the analysis. D. Preprocessing (this is an only example, and other methods can be used, depending on the characteristics of data set, noise level, etc.) - First, each spectra of X will be smoothed using Savitzky-Golay algorithm. - After smoothing, the spectra will be preprocessed using SNV to remove baseline drift and scattering effect. - Finally, each variable in X and Y will be auto-scaled for normalization.

- .

. X

& % ! # $  

- .

#! !

"

  ' ) ' ( (

#! !

 "! 

is t

f r v ri

l #

ti

-s ri s f r v r i

l #

$ $ $

        

0.8 ab orban e ab orban e


4

0.6

0.4

0.2

-1

3 2 1

ab orban e

protein

0 -1 -2

-5

-3 100 200
2 1

E. Compute PLS model - The PLS model will be constructed using preprocessed X and Y datasets. - Optimal number of latent variables (LVs) will be chosen automatically using Cross-validation methods. - In this example, 6 LVs are optimal because RMSECV plot shows a knee at 6 LVs (alternatively Q2 can be used)
DB

S IM P LS V arian e Capture an S tati ti 0.4 0.35 RM S E CV 0.3 0.25 0.2 0.15 0.1 0.05 5 10 15 Latent V ariable Nu ber
A D C C B

for

300 400 w a e length (n


0

500

600

700

20

40 ob er a tion I

85

85

auto

ale

auto

ale Y

0 1000

-2 1500 2000 w a e length (n


2 1 9 0

2500

100

200

300 400 w a e length (n


0

5 6

oothe

e tra

S N V treate

15

4 3

500

600

700

60

80

20

F. Analysis of computed PLS model - Following list of plot can be used to examine the constructed PLS model y y y y y y Score plot (1D/ 2D/ 3D) Loading plot (1D/ 2D/ 3D) Weight plot (1D/2D/3D) Hotelling's T2 plot SPE or dModX plot Contribution plot o Contribution plot will allow to bring single sample spectra data by double-clicking a contribution. Regression coefficient plot VIP score plot Y measured vs. Predicted plot Residual plot

y y y y

- If any outlying samples are identified using the above plots, corresponding samples are excluded from the data set, and then PLS model is reconstructed. - Alternative options: y To evaluate the computed PLS plot, permutation analysis can be performed by permutating each row randomly with given percentage and then comparing Q2 value with that of original PLS model To select informative region in the original spectra, variable selection methods (Ex. VIP, UVE, GA, etc) can be applied to dataset. This feature is not supported by other commercial software package, such as SIMCA-P and Unscrambler.

G. Prediction using new data set 5. 6. 7. 8. 9. 10. Score prediction Hotellings T2 prediction SPE ot DModX Contribution plot. Y measured vs. Predicted plot Residual plot

Revision 1.0 Haewoo Lee (Draft) Revision 2.0 Seongkyu Yoon (edited a few sentences)

You might also like