Professional Documents
Culture Documents
Liu 2020
Liu 2020
a r t i c l e i n f o a b s t r a c t
Article history: Based on near-infrared spectrum and interval random forest, a fast quantitative analysis method for the content
Received 8 May 2020 of sunset yellow content was established. The spectra of 132 cream pigment samples were obtained by FT-NIR
Received in revised form 30 June 2020 spectrometer, and various preprocessing methods such as standard normal variable (SNV), wavelet transform
Accepted 7 July 2020
(WT), and SG (Savitzky-Golay) were used to smooth and denoise the original spectrum. In this paper, WT and
Available online 20 July 2020
first-order differentiation were used as pretreatment and the Kennard-Stone algorithm was used to divide the
Keywords:
data set. Finally interval partial least squares, partial least squares, interval random forest and random forest
NIR were used to construct an optimal quantitative analysis model. The experimental results show that the interval
Interval random forest random forest can find the best sub-interval to achieve the prediction ability of the model. The R2 (the coefficient
Non-destructive of determination) and RMSEP (root mean square error of the prediction) of the prediction set are 0.8965 and
Sunset yellow 0.2454, respectively. The research results show that near-infrared spectroscopy combined with interval random
forest algorithm is a fast and non-destructive method to detect the content of sunset yellow in cream.
© 2020 Elsevier B.V. All rights reserved.
1. Introduction very complicated [7]. Therefore, this paper chose the method of NIR
spectroscopy to implement this study.
Margarine is commonly used in baked goods such as cakes and pas- Thanks to its rapid detection and non-destructive testing, NIR tech-
tries. To brighten its color, various artificial colors are often added, one nology has been widely applied in many fields like medical analysis, pe-
of which is artificial pigment. Generally made from aniline dyes in troleum product analysis, molecular material analysis, etc. [9–11].
coal, it has no nutritional value to the human body and will affect However, it also has its shortcomings, that is, the noise in the acquired
children's physical development and cause diseases and even cancer spectral image is too cluttered to be used on a large scale. Therefore, it
[1–4]. Thus, the analysis of artificial pigments in cream has become par- is necessary for pre-processing before establishing the analytical
ticularly urgent. model. The pre-processing process generally includes SNV (standard
Since we have done previous research on indigotine [5], and normal variable), MSC (multiple scattering correction), SG (Savitzky-
achieved good results, with R2 (the coefficient of determination) Golay), and others [12]. In practical application analysis, the most
reaching 0.9402 and RMSEP(root mean square error of the prediction) widely used methods in NIR are PLSR (partial least squares regression),
0.2509, the focus of this study has shifted to sunset yellow, one of the ar- MLR (multiple linear regression), etc. as well as many other linear and
tificial colors listed in China's hygienic use standards. Previous studies nonlinear analysis methods [13–15].
have shown that yellow pigments can harm human health, such as With the popularity of machine learning, various types of machine
liver cells. learning algorithms have also been applied in the field of NIR by
At present, artificial color detection methods include thin-layer scholars. For example, in 2018, Liu et al. used the SVM (Support Vector
chromatography, high-performance liquid chromatography, polarogra- Machine) method to analyse the content of camelina protein and ob-
phy, spectrophotometry, capillary electrophoresis and others [6–8], but tained RMSEC (root mean square error of calibration) and RMSEP(root
all have their technical limitations. For example, thin-layer chromatog- mean square error of the prediction) of 0.83963 and 0.96578, respec-
raphy is cumbersome and has poor quantitative accuracy; spectropho- tively, which proves more efficient than PLSR and PCR (Principal com-
tometry requires some chemometric methods, and data processing is ponent regression) [16]. And in recent years, Yang et al. employed
machine learning methods such as SVM to analyse soil organic matter
⁎ Corresponding author. and pH [17]. Another common algorithm applied in the field of machine
E-mail address: tanzhenglin@hbue.edu.cn (Z. Tan). learning is Random Forest (RF), which is often used as a classification
https://doi.org/10.1016/j.saa.2020.118718
1386-1425/© 2020 Elsevier B.V. All rights reserved.
2 J. Liu et al. / Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 242 (2020) 118718
characteristic in a given data set, and the combined analysis can obtain Dataset The number of Minimum Maximum Mean value
more objective and fair results [19,20]. samples (g/kg) (g/kg) (g/kg)
In the field of near-infrared spectroscopy, random forest is not only Data set 132 0.0 0.1014 0.04631
widely used for classification, but also for quantitative analysis [21,22].
In 2006, some scholars performed adaptive discrete wavelet transform
(DWT) on NIR and then used penal discriminant analysis (PDA), multi- 12, Russia). The cream samples containing sunset yellow pigment
variate adaptive regression spline discriminant analysis (MARS-DA), were carefully placed and loaded in a 40 cm3 sample cell to avoid air
and RF for modeling analysis to determine the quality of the wine of bubbles (air bubbles affect the machine's near-infrared scanning,
the wine, with the corresponding accuracy rates of 99.93%, 99.2% and resulting in highly inaccurate near-infrared spectroscopy). The near-
76.4% [23]. However, recent studies have shown that when used for infrared spectral images of samples in the range of 8000-14,000 cm−1
quantitative analysis, multiple combination trees are cascaded into RF were recorded by the spectrometer, with an average spectral resolution
to form a comprehensive learner. Considering various parameters, the of 3 scans. Further information about the samples in this experiment is
prediction of concentration will be more accurate and robust. In 2017, shown in Table 1.
Chemura et al. [24] used the RF algorithm to test the ability of selected
bands in the VIS/NIR range to predict plant water content (PWC) in cof- 2.2. Data preprocessing
fee. Their research selected three bands after determining appropriate
parameters and the results showed that the selected bands could reli- As the original data has a lot of irrelevant information, a preprocess-
ably predict PWC. In the establishment of RF, the trees in the selected ing of the original image is needed prior to the building of a model. The
data set are randomly constructed, so each tree is relatively indepen- standard normal variable transform (SNV) [27,28], Savitzky-Golay
dently distributed. Then comprehensive analysis and band selection smoothing [29] and wavelet transform [30,31] are often used, which
are performed to avoid over-fitting. Therefore, improved RF band will lead to different results when used separately or in combination.
screening was an accurate regression prediction analysis. Therefore, a large number of experiments are needed to verify the re-
In this study, we explored a new method of band screening for NIR sults so that the optimal model and the best preprocessing method
spectroscopy analysis of margarine pigment called sunset yellow. It is can be obtained.
worth noting that in previous studies where FT-NIR technology was Near-infrared spectral images usually contain a lot of unwanted
used, the detection limit reached 1% or one thousandth [25,26]. How- physical information about non-target factors, such as background
ever, it reached one in ten thousand in this study, which helps us to noise and baseline drift. To obtain useful feature information, SNV pro-
make this study more accurate. Since hydrocarbons of artificial colors cessing can be performed. SNV is used to eliminate the influence of un-
have very similar analytical structures and even isomers, this study se- even particle distribution, surface scattering and different particle sizes
lected the sunset yellow because of its high use in margarine food. on the spectrum as well as the influence of optical path reflection on the
Fast and reliable testing is a task in the field of food supervision. This diffuse reflection spectrum. The calculation formula is as follows:
study used partial least squares (PLS) and RF for quantitative analysis.
Then it is time to use corresponding NIR spectra to determine the spec- X i;k −X i
X i;SNV ¼ sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1Þ
trum of the samples in margarine, and the corresponding artificial pig- Pm 2
ment concentrations. To better determine the model, this study also k¼1 X i;k −X i
m−1
compared the preprocessing methods such as WT, selected the best pre-
processing method, and then optimized the final model. At the same
time, the partial least square (PLS) and random forest (RF) processing where Xi is the average of the spectrum of the first sample, k = 1, 2, Λ, m.
results were compared. In the analysis process, the influence of the se- m is the number of wavelength points and Xi, SNV is the transformed
lected wavelength on the final model was tested, and the interval partial spectrum.
least square method and the interval random forest were proposed. In Savitzky-Golay aims to smooth the noisy data and eliminate the data
addition, it was clearly concluded that when the number of trees con- points with large obstacle so that the map can be operated simply and
structed in the random forest and the construction method changed, preliminarily [32]. SG is based on polynomials in the time domain. By
the evaluation performance index of the final model would also change. moving windows and fitting polynomials with continuous subsets, the
convolution coefficients and related differential orders of all data points
2. Material and methods can be obtained. This formula is as follows:
Table 2 Table 4
Predictive performance of random forest (RF) calibration models with different wavelet Prediction performance under different interval numbers.
functions.
Interval number selection R2 MSE RPD RMSEP
Wavelet functions R2 MSE RPD RMSEP
5 0.8476 0.1009 2.5173 0.2795
Haar 0.7895 0.3443 1.4473 0.5868 10 0.8965 0.0902 2.6195 0.2454
Daubechies(4,4) 0.8271 0.0908 2.2219 0.3014 15 0.7831 0.1127 2.1290 0.2957
Daubechies(12,3) 0.8386 0.0848 2.3771 0.2912 20 0.7915 0.1013 2.2374 0.2864
Symlet(5,3) 0.8193 0.0949 2.3037 0.3081 30 0.7264 0.1739 1.9452 0.3798
Symlet(4,2) 0.8147 0.0973 2.2475 0.3120
Coif(5,4) 0.8310 0.0888 2.3336 0.2980
Table 5 4. Conclusion
Prediction performance under different models.
Method R2 MSE RPD RMSEP Wavelet transform and interval random forest algorithm were com-
PLSR 0.7572 0.1275 1.8749 0.3572
bined in this study to analyse the content of sunset yellow in margarine
RF 0.8121 0.0987 2.4723 0.3142 quickly and quantitatively. The FT-NIR spectra of 132 groups of cream
Inv-PLSR 0.8570 0.0779 2.5868 0.2792 were collected by using Fourier transform near-infrared spectrometer.
Inv-RF 0.8965 0.0902 2.6195 0.2454 Four different spectral preprocessing methods (MSC, SNV, D1st, WT)
were compared and the FT-NIR spectra of cream samples were proc-
essed by combining various preprocessing methods. The WT combined
with D1st was selected as the preprocessing method for the analysis of
latent variable was determined to be 10. Then, the optimization the cream pigment spectrum. The band selection of the NIR spectrum
between different zones was carried out. On the basis of wavelength was performed, and the influence of the important threshold of differ-
interval selection, the comparison is made. The specific results were ent bands on the prediction performance of the model was compared.
shown in Table 5 below. Under the same preprocessing method, R2, To further explore the prediction performance of the correction model
MSE, RPD, and RMSEP of RF were 0.8121, 0.0987, 2.4723 and 0.3142, in this study, it was compared with the PLSR model under the same pre-
respectively. Compared with PLSR, the effect was better. In the optimi- processing method. The results showed that the correction model of this
zation of the interval wavelength selection method, the R2, MSE, RPD, study had better prediction performance, and its prediction set R2 and
and RMSEP of Inv-RF were 0.8965, 0.0902, 2.6195 and 0.2454, respec- RMSEP were 0.8965 and 0.2454, respectively. Studies showed that
tively. Compared with Inv-PLSR under the same conditions, the effect D1st + WT-Inv-RF can accurately and quickly quantify the amount of
was better. After interval variable wavelength selection was used, the sunset yellow pigment in margarine. At the same time, this study pro-
prediction effects of both PLSR model and RF model have improved to vided a theoretical basis and technical support for the detection of
some extent. baked foods and analysis of other indicators in the field of food
Fig. 3 showed the predictions of the four models for the validation supervision.
set. As can be seen from Fig. 3, the scatter the scatter plot of the random
forest model was more concentrated and close to the 45 degree regres- CRediT authorship contribution statement
sion line than that of PLSR. The closer it was to the 45-degree line, the
better the regression fitting effect was. When longitudinal comparison Jun Liu, Siqi Sun, Zhenglin Tan and Yang Liu hereby solemnly declare
was performed with each wavelength selection, it can be known that that the submitted paper "Nondestructive detection of sunset yellow in
the prediction effect was better after wavelength selection. cream based on near-infrared spectroscopy and interval random
forest" is the result of our research work, and there is no intellectual [15] D.M.M. Gila, et al., Rapid quantification of total polyphenol content in EVOO using
NIR sensor with wavelength selection and FS-MLR, 2015 IEEE International Confer-
property dispute. The paper is completed by our cooperation. Except ence on Imaging Systems and Techniques (IST), 2015.
for the content quoted in the article, this paper does not contain any [16] J. Liu, et al., Predicting the content of camelina protein using FT-IR spectroscopy
work that has been published or written by any other individual or coupled with SVM model, Clust. Comput. (2018)https://doi.org/10.1007/s10586-
018-1838-3.
group. We fully understand that the legal results of this statement are [17] M. Yang, et al., Evaluation of machine learning approaches to predict soil organic
borne by us. matter and pH using vis-NIR spectra, Sensors (2019) 19(2).
[18] V. Svetnik, A. Liaw, C. Tong, J.C. Culberson, R.P. Sheridan, B.P. Feuston, Random for-
est: a classification and regression tool for compound classification and QSAR
modeling, J. Chem. Inf. Comput. Sci. 43 (6) (2003) 1947–1958, https://doi.org/10.
Declaration of competing interest 1021/ci034160g.
[19] C. Strobl, et al., Bias in random forest variable importance measures: illustrations,
sources and a solution 8 (1) (2007) (p. 25-0).
We declare that we have no financial and personal relationships [20] D.R. Cutler, J.T.C. E., Random forests for classification in ecology, Ecological Society of
with other people or organizations that can inappropriately influence America ESA Online Journals (2008)https://doi.org/10.1890/07-0539.1.
our work. There is no professional or other personal interest of any na- [21] N. Said, M. Abdul, Comparison between random forests, artificial neural networks
and gradient boosted machines methods of on-line vis-NIR spectroscopy measure-
ture or kind in any product, service and/or company that could be con-
ments of soil total nitrogen and total carbon, Sensors 17 (10) (2017) 2428.
strued as influencing the position presented in, or the review of, the [22] W. JI, et al., Using different data mining algorithms to predict soil organic matter
manuscript entitled. based on visible-near infrared spectroscopy, Spectroscopy & Spectral Analysis 32
(9) (2012) 2393.
[23] D. Donald, et al., Adaptive wavelet modelling of a nested 3 factor experimental de-
Acknowledgements sign in NIR chemometrics, Chemometrics & Intelligent Laboratory Systems 82 (1–2)
(2006) 122–129.
This work was supported by the National Natural Science Founda- [24] A. Chemura, M.O.D. T., Remote sensing leaf water stress in coffee (Coffea arabica)
using secondary effects of water absorption and random forests, Physics and Chem-
tion of China (61906139, 61172150, 61803286), Hubei Provincial Natu- istry of the Earth, Parts A/B/C (2017)https://doi.org/10.1016/j.pce.2017.02.011.
ral Science Foundation of China under Grant (2019CFB173), the [25] E. Ercioglu, H.M. Velioglu, I.H. Boyaci, Determination of terpenoid contents of aro-
Foundation of Hubei Provincial Key Laboratory of Intelligent Robot matic plants using NIRS, 178 (2018) 716.
[26] A.L.D.O. Antonio José Steidle Neto, A.L.D.A. Lopes, C.L. Ferraza, Non-destructive pre-
(HBIR 201802) and the eleventh Graduate Innovation Fund of Wuhan diction of pigment content in lettuce based on visible–NIR spectroscopy, Journal of
Institute of Technology (CX2019240, CX2019241). the Science of Food & Agriculture 97 (2017) (2017) 2015–2022.
[27] R.J. Barnes, M.S. Dhanoa, S.J. Lister, Letter: correction to the description of Standard
Normal Variate (SNV) and De-Trend (DT) transformations in Practical Spectroscopy
References with Applications in Food and Beverage Analysis 2nd edition, J. Near Infrared
Spectrosc. (1993) 1(1).
[1] A. Shakeri, V. Soheili, M. Karimi, S.A. Hosseininia, B.S. Fazly Bazzaz, Biological activ- [28] T. Fearn, et al., On the geometry of SNV and MSC, Chemometrics & Intelligent Labo-
ities of three natural plant pigments and their health benefits, J. Food Meas. Charact. ratory Systems 96 (1) (2009) 22–26.
12 (1) (2017) 356–361, https://doi.org/10.1007/s11694-017-9647-6. [29] P.A. Gorry, General least-squares smoothing and differentiation by the convolution
[2] N. Martins, C.L. Roriz, P. Morales, L. Barros, I.C.F.R. Ferreira, Food colorants: chal- (Savitzky-Golay) method, Anal. Chem. 6 (62) (1990) 570–573.
lenges, opportunities and current desires of agro-industries to ensure consumer ex- [30] M. Antonini, et al., Image coding using wavelet transform 1 (2) (1992) 205–220.
pectations and regulatory practices, Trends Food Sci. Technol. 52 (2016) 1–15, [31] Y. Xu, et al., Wavelet transform domain filters: a spatially selective noise filtration
https://doi.org/10.1016/j.tifs.2016.03.009. technique 3 (6) (1994) 747–758.
[3] R.G. Ackman, S.N. Hooper, Isoprenoid fatty acids in the human diet: distinctive geo- [32] R.W. Schafer, What is a Savitzky-Golay filter? [lecture notes], Signal Processing Mag-
graphical features in butterfats and importance in margarines based on marine oils, azine IEEE 28 (4) (2011) 111–117.
Canadian Institute of Food Science & Technology Journal 6 (3) (1973) 159–165. [33] A.S. Lewis, G. Knowles, Image compression using the 2-D wavelet transform, IEEE
[4] S.A.H. Goli, et al., The production of an experimental table margarine enriched with Trans. Image Process. 1 (2) (2002) 244–250.
conjugated linoleic acid (CLA): physical properties, Journal of the American Oil [34] A. Saptoro, T.M.O. V., A modified Kennard-Stone algorithm for optimal division of
Chemists Society 86 (5) (2009) 453–458. data for developing artificial neural network models, Chem. Prod. Process. Model.
[5] Z.T.J.L. Supei Zhang, Determination of the food dye indigotine in cream by 1 (7) (2012) (p. 16-16).
nearinfrared spectroscopy technology combined with random forest model, [35] D.D. Claeys, T. Verstraelen, E. Pauwels, et al., Conformational sampling of macrocy-
Spectrochimica Acta Part A 2019. clic alkenes using a Kennard-Stone-based algorithm.[J], J. Phys. Chem. A 114 (25)
[6] D.P. Song, H. Zhang, L.I. Qi, Comparison of national standards for edible pigments be- (2010) 6879–6887.
tween China and foreign countries and progress on analytical techniques, Food Sci. [36] M. Zhu, J. Xia, M.L. Yan, S.Y. Zhang, G.L. Cai, J. Yan, G.M. Ning, Feature selection and
35 (3) (2014) 295–300. optimization of random forest modeling, Appl. Mech. Mater. 687-691 (2014)
[7] J. Wang, et al., Highly sensitive electrochemical determination of Sunset Yellow 1416–1419, https://doi.org/10.4028/www.scientific.net/AMM.687-691.1416.
based on gold nanoparticles/graphene electrode, Anal. Chim. Acta 893 (2015), [37] J. Bin, A.F.F. F., A modified random forest approach to improve multi-class classifica-
S0003267015010533. . tion performance of tobacco leaf grades coupled with NIR spectroscopy, RSC Adv. 36
[8] T. Pocock, M. Król, N.P. Huner, The determination and quantification of photosyn- (6) (2016) 30353–30361.
thetic pigments by reverse phase high-performance liquid chromatography, thin- [38] D. Sharma, De-Biased Random Forest Variable Selection, Social Science Electronic
layer chromatography, and spectrophotometry, 274 (2004) 137–148. Publishing, 2011.
[39] K.J. Archer, R.V. Kimes, Empirical characterization of random forest variable impor-
[9] S. Jiang, et al., NIR-to-visible upconversion nanoparticles for fluorescent labeling and
tance measures, Computational Statistics & Data Analysis 52 (4) (2008) 2249–2260.
targeted delivery of siRNA, Nanotechnology 20 (15) (2009) 155101.
[40] J. Li, et al., Analysis of soil nutrient content based on near infrared reflectance spec-
[10] F. Berset, Percentage of body fat and risk factors of coronary heart disease, Tidsskrift
troscopy in Beijing region, Transactions of the Chinese Society of Agricultural Engi-
for Den Norske Lgeforening Tidsskrift for Praktisk Medicin Ny Rkke 112 (22) (1992)
neering 28 (2) (2012) 176–179.
2848–2851.
[41] Blakey, J. Robert, Evaluation of avocado fruit maturity with a portable near-infrared
[11] M. Schnaiter, et al., UV-VIS-NIR spectral optical properties of soot and soot- spectrometer, Postharvest Biology & Technology 121 (2016) 101–105.
containing aerosols, J. Aerosol Sci. 34 (10) (2003) 1421–1444. [42] V.J.B.R. Hamilton, Anal. Chem. 1 (69) (1997) 78–90.
[12] H. Fang, et al., Detection of activity of POD in tomato leaves based on hyperspectral [43] K.L. Alex Ander, C.F.G. J., Application of wavelet transform in infrared spectrometry:
imaging technology, Spectrosc. Spectr. Anal. 32 (8) (2012) 2228. spectral compression and library search, Chemometrics & Intelligent Laboratory
[13] Y. Liu, X. Sun, A. Ouyang, Nondestructive measurement of soluble solid content of Systems 1-2 (43) (1998) 69–88.
navel orange fruit by visible–NIR spectrometric technique with PLSR and PCA- [44] H.Y. Yoo, L.K.W. J., Selecting optimal basis function with energy parameter in image
BPNN 43 (4) (2010) 0–607. classification based on wavelet coefficients, 대한원격탐사학회지 5 (24) (2008)
[14] N. Shetty, G. R., Quantification of fructan concentration in grasses using NIR spec- 437–444.
troscopy and PLSR, Field Crop Res. 1 (120) (2011) 0–37.