Professional Documents
Culture Documents
1 s2.0 S016599361400079X Main
1 s2.0 S016599361400079X Main
1 s2.0 S016599361400079X Main
A R T I C L E I N F O A B S T R A C T
Keywords: The use of vibrational spectroscopy for diagnosis and staging of cancer is extremely attractive, promis-
Biochemical fingerprint ing many benefits over the currently used histopathology methods. The hypothesis underlying this ap-
Cancer diagnosis proach is that cancers have characteristic biochemical fingerprints that can be captured using spectroscopy.
Cancer staging
To relate complex multivariate spectra to disease state, machine-learning methods are typically used to
Data analysis
Machine learning
recognize diagnostic spectral patterns. This article provides an extensive review of this field. The average
Sensitivity diagnostic performance of the reviewed studies is impressive (>90% sensitivity and specificity) but most
Specificity studies were small (<40 samples). Furthermore, diagnostic performance has often been calculated using
Spectroscopy methods now known to be overoptimistic. We conclude that, if the combination of spectroscopy and
Validation machine learning is to translate into clinical practice, larger studies are needed and researchers should
Vibrational spectroscopy routinely provide spectral data in support of their publications so that the data can be reanalyzed by other
groups.
© 2014 Elsevier B.V. All rights reserved.
Contents
1. Introduction ........................................................................................................................................................................................................................................................... 18
1.1. Machine learning .................................................................................................................................................................................................................................... 18
1.2. Assessment of diagnostic performance .......................................................................................................................................................................................... 18
2. Studies by cancer type ....................................................................................................................................................................................................................................... 18
2.1. Gastrointestinal cancer ......................................................................................................................................................................................................................... 18
2.1.1. Esophageal cancer ................................................................................................................................................................................................................. 18
2.1.2. Stomach cancer ...................................................................................................................................................................................................................... 19
2.1.3. Colorectal cancer ................................................................................................................................................................................................................... 19
2.2. Urological cancer ................................................................................................................................................................................................................................... 20
2.2.1. Prostate cancer ....................................................................................................................................................................................................................... 20
2.2.2. Bladder cancer ....................................................................................................................................................................................................................... 20
2.3. Breast cancer ........................................................................................................................................................................................................................................... 20
2.4. Cervical cancer ........................................................................................................................................................................................................................................ 20
2.5. Skin tumors .............................................................................................................................................................................................................................................. 20
2.6. Lymph-node metastases ...................................................................................................................................................................................................................... 20
2.7. Lung cancer .............................................................................................................................................................................................................................................. 21
2.8. Brain tumors ............................................................................................................................................................................................................................................ 21
Abbreviations: ANN, Artificial neural networks; CART, Classification and regression trees; CV, Cross validation; DA, Discriminant analysis; IQR, Inter-quartile range; LDA,
Linear discriminant analysis; LOOCV, Leave-one-out cross-validation; LR, Logistic regression; MNLR, Multinomial logistic regression; PC-DF, Principal component discrimi-
nant function; PCA, Principal components analysis; PLS-DA, Partial least squares discriminant analysis; QDA, Quadratic discriminant analysis; SIMCA, Soft independent mod-
elling of class analogies; SM-LR, Sparse multinomial logistic regression; SVM, Support vector machine; VS, Vibrational spectroscopy.
* Corresponding author. Tel.: +44 (0)20 7882 6510; Fax: +44 (0)20 7882 7732.
E-mail address: c.bessant@qmul.ac.uk (C. Bessant).
http://dx.doi.org/10.1016/j.trac.2014.02.016
0165-9936/© 2014 Elsevier B.V. All rights reserved.
18 M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25
Table 1
Summary of all investigated articles, including methods and results. In some studies, multiple tissue samples were derived from one subject and analyzed as
individual samples
Classification Method Test method Spectroscopy method Tissue type Tissue samples Subjects Sens. % Spec. % Year Ref
Fourier transform IR (FTIR) spectroscopy in attenuated total re- differentiate between cancerous and non-cancerous tissue samples
flection was investigated for premalignant (dysplastic) mucosa in with a sensitivity of 73.0% and a specificity of 73.0%.
the esophagus. An LDA model achieved a sensitivity of 92.0% and Soft independent modeling of class analogies (SIMCA) was em-
a specificity of 80.0% for Barrett’s specimens sub-classified non- ployed for prediction of three different stomach pathologies (normal,
dysplasia and dysplasia when tested by LOOCV [16]. adenoma and cancer) based on IR spectroscopic measurements. Al-
though the data set was small, consisting of only 11 patient samples,
an independent test set was used to evaluate the classification model.
2.1.2. Stomach cancer The SIMCA model achieved a sensitivity from 30.0–87.0% and a speci-
A PC-fed LDA model built for classifying dysplasia from normal ficity of 77.0% [39].
gastric tissue based on Raman spectra (44 patients, 76 specimens) An LDA model was developed by Li et al. [15] for predicting four
achieved a sensitivity of 95.2% and a specificity of 90.9% when as- different stomach-tissue pathologies (healthy, superficial gastritis,
sessed by LOOCV [21]. atrophic gastritis, and gastric cancer). The developed model achieved
A different machine-learning approach was taken by The et al. sensitivities of 66.0–74.0% and a specificity of 90.0% for healthy
[11], who investigated classification and regression trees (CART) for samples when assessed by LOOCV.
differentiating between normal and cancerous gastric tissue speci-
mens (73 tissue samples from 53 patients). A sensitivity of 88.9% 2.1.3. Colorectal cancer
and a specificity of 92.9% were estimated when tested with an in- Although many Raman studies investigated spectral differ-
dependent test set. ences between normal and cancerous tissue, only a small number
A three-class model for diagnosing and typing adenocarci- reported development and assessment of a classification model to
noma in the stomach was built using multinomial logistic regres- provide an actual diagnosis. For example, Widjaja et al. [43] devel-
sion (MNLR). This model predicted the pathology of 125 tissue oped multi-class SVM models for predicting colon pathology (normal,
specimens (from 72 patients) with sensitivities of 75.0–91.0% and hyperplastic polyps and adenocarcinoma) using Raman spectra
specificities of 80.0–96.0% when assessed by LOOCV [31]. derived from 105 tissue specimens (59 patients). A radial basis func-
In a study investigating 10 patient samples Kawabata et al. [33] tion SVM model achieved sensitivities of 99.4–99.9% and a speci-
reported that information derived from PCA can be used to ficity of 99.3% when tested by LOOCV.
20 M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25
2.2. Urological cancer invasive malignancies. In this study, sensitivities of 79.0–90.0% and
specificities of 82.0–98.0% were reported [46].
2.2.1. Prostate cancer FTIR spectroscopy was investigated to differentiate between types
LDA was applied to distinguish benign from malignant pros- of breast disease based on breast calcifications. An SVM-ensemble
tate samples (37 patient samples) measured with a fiber optic probe, classifier was developed, was tested with an independent test set
as suitable for laparoscopic and endoscopic use. A sensitivity of 87.0% and correctly classified with sensitivities of 85.0–100.0% and a speci-
and a specificity of 84.0% were achieved when the model was as- ficity of 75.0% [45].
sessed by LOOCV [20].
IR spectroscopy was also applied for grading of prostate cancer- 2.4. Cervical cancer
tissue specimens. In a study of 39 patients, classification models using
a PC discriminant function (PC-DF) analysis achieved an overall sen- Only a few studies investigated this potential application area
sitivity of 92.3% and a specificity of 99.4% when assessed with an of VS. IR Raman spectroscopy in combination with LDA modeling
independent test set [34]. has been investigated for in-vivo diagnostics of cervical cancer. The
classifier built by using spectra derived from 46 patients yielded a
2.2.2. Bladder cancer diagnostic sensitivity of 93.5% and a specificity of 97.8% when tested
The predominantly applied machine-learning method in bladder by LOOCV [24].
cancer studies is LDA {e.g., it was used to develop a diagnostic model Raman spectroscopy has also been investigated for in-vitro di-
to discriminate between non-tumor and tumor bladder tissue by agnosis for cervical pre-cancer. A partial least squares discrimi-
de Jong et al. [18]}. The resulting model, which was built from Raman nant analysis (PLS-DA) model was employed and tested by leave two
data obtained from 15 patient samples, yielded a sensitivity of 94.0% samples out cross-validation and achieved a sensitivity of 72.5% and
and a specificity of 92.0% when tested by LOOCV. In a similar ap- a specificity of 89.2% [32].
proach, 24 patient samples, representing normal urothelium, cys-
titis and transitional cell carcinoma-tissue samples were used to 2.5. Skin tumors
develop a diagnostic LDA model. This classifier achieved a sensi-
tivity of 89.0% and a specificity of 79.0% when tested by LOOCV [20]. Skin is the most accessible organ of all, so it is most suitable for
In more recent work, it was investigated if the application of non-invasive in-vivo diagnostics using VS. Raman spectroscopy and
Raman spectroscopy with fluorescence-guided cystoscopy could sparse multinomial LR (SM-LR) have been used to distinguish
improve specificity for diagnostic prediction of bladder biopsies. The between normal, basal cell carcinoma, squamous cell carcinoma and
LDA model employed, built on data derived from 38 patient samples, melanoma. In this study, based on 39 patients, an overall sensitiv-
achieved a sensitivity of 100% and a specificity of 80.8% when as- ity and specificity of 100% in the training data was reported [41].
sessed by LOOCV [27]. Based on this study, a Raman hand-held probe was developed
and used to measure skin samples in 19 patients in vivo. SM-LR was
2.3. Breast cancer employed to differentiate between normal and abnormal (basal cell
carcinoma, squamous cell carcinoma and inflamed scar tissue)
In a Raman study investigating ex-vivo samples from breast tissue spectra. The assessment by cross-validation achieved a sensitivity
(normal, fibrocystic change, fibradenoma and invasive cancer), a lo- of 100% and a specificity of 91.0% [40].
gistic regression (LR) was employed to differentiate between ma- ANN were applied for diagnostic prediction of five different skin-
lignant and benign spectra. The model yielded a sensitivity of 94.0% lesion types, including normal skin, pigmented nevi, seborrheic kera-
and a specificity of 96.0% when tested by LOOCV [29]. The same tosis, basal cell carcinoma and malignant melanoma. In this study,
machine-learning method was further investigated for capability of a total of 222 tissue samples were measured by Raman spectros-
classifying fresh resected tissue samples, mimicking an in-vivo ap- copy. The resulting spectra were used to build and test ANN by
plication. Thus, 129 tissue sites from 21 patients were measured and LOOCV, which achieved a sensitivity of 94.2% and a specificity of
their pathology predicted by the LR mode. A sensitivity of 83.0% and 98.6% [10].
a specificity of 93.0% were reported [28].
A different classification approach was taken by Moreno et al. [37], 2.6. Lymph-node metastases
who employed quadratic discriminant analysis (QDA) for distin-
guishing invasive ductal carcinoma (22 patients), fibrocystic breast Lymph-node assessment is an important step in staging cancers,
conditions (six patients) and normal breast tissues (six patients). especially since it is known that the presence of metastasis carries
The QDA model separated normal from altered tissue with a sen- a worse prognosis for the patient. In order to allow a better assess-
sitivity of 99.0% and specificity of 98.0% in the training data – un- ment of the lymph node status in breast-cancer patients, Raman
fortunately, no results were reported for test data. spectroscopy has been investigated for a potential inter-operative
ANN were used to distinguish between IR spectra representing application. SVMs have been successfully applied to differentiate
fibroadenoma and ductal carcinoma in situ. ANN were tested with metastatic lymph node tissue samples from non-metastatic tissue
an independent test set and achieved a sensitivity of 93.0% and a samples (43 samples) with a sensitivity and specificity of 100% in
specificity of 100% [9]. an independent test set [42].
Micro-calcifications are commonly found in breast tissue and are Similarly, SVMs were investigated by Horsnell et al. [44] for their
often an indicator for malignant disease development. In an effort potential for lymph-node diagnostics using Raman spectroscopy. The
to exploit this, Haka et al. [30] investigated Raman spectroscopy and model achieved 71.0–81.0% sensitivities and 91.0–97.0% specifici-
LR for predicting malignancies in breast tissue based on micro- ties when tested with LOOCV. In another approach, 38 lymph nodes
calcifications. Spectra, derived from 11 patient samples were clas- have been measured with a Raman hand-held probe. The achieved
sified with a sensitivity of 88.0% and a specificity of 93.0%. IR spectra were used to develop a principal component fed LDA model,
spectroscopy was also investigated for the potential to diagnose which achieved a sensitivity of 92.0% and a specificity of 100% when
breast pathology based on micro-calcifications. Pathology specific tested by LOOCV [26].
patterns (carbonate content and protein matrix: mineral ratios) were In another study, 103 lymph nodes represented different pa-
used to generate a two-matrix linear discriminant model (LDM) thologies, including primary lymph nodes from Hodgkin’s and non-
for differentiating between benign, ductal carcinoma in situ and Hodgkin’s lymphomas, and lymph nodes containing metastases from
M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25 21
squamous cell carcinomas and adenocarcinomas. An LDA model, de- which separated the different pathologies with sensitivities of 95.0–
veloped for differentiating between these four groups, achieved sen- 100% and specificities of 92.3–100% [36].
sitivities of 75.0–100% and specificities of 86.0–99.0% when tested The same group investigated in a similar manner if PCA and a
by LOOCV [23]. discriminant model built using Raman spectra obtained from frozen
Using FTIR spectroscopy, Liu et al. [14] measured 184 freshly samples would be capable of predicting the pathology of fresh
removed cervical lymph nodes from 22 patients with papillary samples. This classification approach yielded sensitivities of 80.8–
thyroid cancer undergoing thyroid surgery with lymph-node dis- 100% and specificities of 64.3–100% [35].
section. They developed an LDA model to predict metastasis in lymph In a study of 20 patients, an LDA model was applied to discrim-
nodes and tested by LOOCV. The model achieved a testing sensi- inate meningioma from normal dura. The resulting LDA model
tivity of 80.3% and a specificity of 91.9%. achieved a sensitivity of 100% and a specificity of 100% when as-
sessed by LOOCV [25].
2.7. Lung cancer IR spectroscopy has also been investigated for diagnosis of brain
tumors. For example, Beleites et al. applied a classifier system, con-
Only one study reported the investigation of Raman sisting of a genetic algorithm for feature selection and an LDA model
microspectroscopy for the diagnosis and prognosis of non-small cell for discriminating cancerous brain tissue (astrocytoma, glioblas-
lung cancer [38]. A total of 62 lung-tissue samples (28 normal, 34 toma) from normal brain tissue (a total of 59 tissue specimens). The
cancerous) derived from 43 patients were analyzed. A random forest developed classifier separated the IR spectra into four distinctive
classification model was developed and assessed with an indepen- groups with sensitivities of 17.0–71.0% and a specificity of 95% when
dent test set. This model yielded a diagnostic sensitivity of 90.0% tested using four-fold cross-validation [12].
and a specificity of 75.0%.
Excisional biopsy can be a potential hazard for vulnerable organs, LDA was the most frequently applied method for developing clas-
such as the brain. Taking this into account, VS would be an ideal tool sification models since it was used in 18 out of the 40 reviewed pub-
for future in-vivo application in brain-tumor diagnostics. In addi- lications (45%).
tion, inter-surgery application for estimation of tumor during re- The second most popular method was SVMs, used in about 10%
section would be highly desirable since excessive resection might of the studies, followed by LR used in 7% of studies, PC-DF used in
result in brain damage. Conversely, an incomplete resection can cause 7% of studies, SM-LR in 5% of studies and ANN in 5% of the studies.
the reoccurrence of the tumor. In the remaining 20% of studies, varying methods were used, in-
Biopsies from three normal adrenal glands, 16 neuroblastomas, cluding CART, MNLR, PLS-DA, QDA, random forest, SIMCA and two-
five ganglioneuromas, six nerve-sheath tumors, and one pheochro- matrix discriminant analysis.
mocytoma were collected for a Raman study. PCA and discrimi- The prevalence of the various machine-learning methods is shown
nant function analysis were used to build a classification model, in Fig. 1. The number of publications in the field has increased year
Fig. 1. Chart showing the frequency of machine-learning methods applied in 40 studies reporting the use of vibrational spectroscopy for cancer diagnostics.
22 M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25
Fig. 2. A. Boxplots showing the sensitivity of machine-learning methods used in the reviewed work. Highest sensitivity was achieved by SM-LR. B. Boxplots showing the
specificity of machine-learning methods. Strongest specificity was achieved by ANN.
on year, but we were unable to discern any significant trends in the limiting factor. The methods used to test classification models also
popularity of individual machine-learning methods over time. vary from paper to paper, so comparisons of reported perfor-
The frequent use of LDA is most probably due to its easy appli- mance must be treated with caution. Indeed, in the absence of the
cability, since the optimization and development of LDA models is original spectral data used in these studies, it is impossible to make
simple and requires little computing time and power. Another reason an objective comparison between the machine-learning methods
might be that, by using LDA, it is easy to identify what the discrim- employed.
ination is based upon in the spectral domain, thus enabling greater There is a small number of studies in which different machine-
understanding of the disease-related changes in the spectra. learning methods were applied to the same data and performance
The overall median sensitivity across classification models re- reported in a consistent way {e.g., we have demonstrated that LDA
ported in the reviewed studies was 90.2% (IQR = 11.2). can perform well on classifying cancerous from non-cancerous
Looking into the individual performance of individual methods spectra (sensitivity = 100%, specificity = 91.9%), but SVMs per-
showed that SM-LR achieved the highest median sensitivity of 100% formed better (sensitivity = 100%, specificity = 100%) on the same
(IQR = 0). ANN yielded the second best median sensitivity of 93.6% data set [42]}.
(IQR = 0.6), followed by PC-DF with a median sensitivity of 92.3%
(IQR = 3.5), LDA with a median sensitivity of 89.8% (IQR = 8.1), LR
with a median sensitivity of 88.0 % (IQR = 5.5), and SVM with a 4. Sample sizes
median sensitivity of 87.7% (IQR = 23.8). All classification methods
achieved sensitivity well above 80% (Fig. 2A). Across the reviewed studies, we found that the median number
In comparison, the best median specificity of 99.3% (IQR = 0.7) of subjects was 38. With the exception of the 2004 skin-cancer study
was achieved by ANN. Second best median specificity of 96.2% [10], which involved 222 subjects, this median number varied very
(IQR = 8.6) was achieved by PC-DA, followed by SVM with a median little (inter-quartile range of 32 across all reviewed studies). In several
specificity of 96.1% (IQR = 6.2), SM-LR with a median specificity of studies, multiple tissue samples were taken from each patient, so
95.5% (IQR = 4.5), LR with a median specificity of 93.0% (IQR = 1.5) the median number of tissue samples was higher at 61 (IQR = 61).
and LDA with a median specificity of 92.3% (IQR = 8.8). The overall Clearly, such a low sample number limits the applicability of a di-
median specificity is 93.0% (IQR = 9.7), so the overall specificity was agnostic model, particularly because of the highly multivariate nature
higher than the overall sensitivity (Fig. 2B). of spectral data. Interestingly, there is no discernible trend in sample
Interestingly, non-linear methods, such as SVMs, demonstrated size over time.
lower performance than simpler methods, such as LDA or LR. This In other research areas dealing with highly multivariate data (e.g.,
might be because a simple classifier performs sufficiently well if genome-wide association studies), it is now commonplace to in-
sample groups are relatively easily separable. According to Occam’s crease sample numbers by sharing data. This is currently not
razor, there is no reason to use a more complex classifier in this case. common practice among research groups using VS for cancer
By contrast, when facing a more complex classification problem, diagnostics. This may be due to the assumption that it could be
methods such as SVMs might be applied because simpler methods, difficult to combine data from different datasets because of the wide
such as LDA, have failed. variety of laboratory protocols and spectroscopic instrumentation
Availability of data is another confounding factor because simple used in different laboratories. However, this is strong motivation for
models are easier to train, so simple classification methods can out- data sharing – comparison of data from different laboratories would
perform complex methods when the quantity of training data is a allow inter-instrument variation to be characterized and would help
M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25 23
Fig. 3. Chart showing the frequency different testing methods were used. Notably LOOCV is the most commonly applied method. No group reported the application of boot-
strapping for testing classification models.
the community to develop solutions. To date, the only study of this Some 22% of all reviewed studies used an independent test set
type has been performed by applying simulated data artifacts to in- to assess their diagnostic models. Generally, this is the most thor-
dependent test data that were then presented to an already trained ough way of testing a diagnostic model, if a representative test set
classification model [47]. In that study, classification models were can be created. The only way for even stricter testing would be if
generally found to be able to deal with a significant amount of the the test set consisted of an independent cohort, which none of
instrument-to-instrument variation. the reviewed studies did. Combining data together from different
studies would be one way to attempt this, if such data were made
5. Prevalence of model-testing methods publicly available.
Six groups reported only training results. This is bad practice,
As already mentioned, the choice of methods used to deter- because it does not give any indication of a model’s power in pre-
mine diagnostic performance can significantly affect sensitivity and dicting unknown samples. It might demonstrate that diagnostic
specificity reported for a given study. The prevalence of testing groups are separable, but this must be taken with caution because
methods used in the literature is shown in Fig. 3. powerful non-linear classification methods, such as ANN, can easily
The majority (61%) of reviewed studies tested their diagnostic be trained to separate samples within any training set (assuming
models using cross-validation. Some 23 studies used LOOCV, there are no samples with identical spectra assigned to different
one used two-fold cross-validation and one study used four-fold groups) with 100% accuracy but fail when tested on unseen samples.
cross-validation. Diagnostic potential will therefore not have been demonstrated.
However, Westerhuis et al. [8] showed with permutation tests
that improper use of cross validation leads to an overly optimistic 6. Conclusions and recommendations
assessment of diagnostic performance. Interestingly, none of the in-
vestigated studies used bootstrapping as a test approach; only cross- The median sensitivity across all reviewed studies was 90.2% and
validation, independent test set or no validation at all were reported the median specificity 93.0%, with these metrics reaching 100% in
in the literature reviewed. some cases. These results suggest that VS coupled with machine
One reason why LOOCV is widely applied might be the gener- learning has considerable promise for cancer diagnostics, and raises
ally low sample numbers in the studies reviewed. Splitting such a the question of why it is not already being translated into clinical
relatively small data set into a training set and independent test set practice. Here, we identify three potential obstacles to clinical uptake,
might result in a training set that is too small to develop a stable and discuss how they might be overcome.
classifier, and the limited size of the test set would inevitably result
in a coarse estimate of model performance. 6.1. Histopathology informs the training data
However, bootstrap resampling is well suited to low sample
numbers and we would encourage the use of bootstrapping where In the studies reviewed, machine learning was carried out using
people are currently using LOOCV. training data for samples from which the “correct” diagnosis was
24 M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25
determined by histopathology. This means that the spectroscopic [11] S.K. Teh, W. Zheng, K.Y. Ho, M. Teh, K.G. Yeoh, Z. Huang, Diagnosis of gastric
cancer using near-infrared Raman spectroscopy and classification and regression
approach can only ever be as good as histopathology, which is un-
tree techniques, J. Biomed. Opt. 13 (2008) 034013.
likely to be 100% accurate due to inter-observer variability [1,2]. If [12] C. Beleites, G. Steiner, M.G. Sowa, R. Baumgartner, S. Sobottka, G. Schackert, et al.,
the diagnostic accuracy of spectroscopy is to exceed histopathol- Classification of human gliomas by infrared imaging spectroscopy and
ogy, then more accurately diagnosed training samples are re- chemometric image processing, Vib. Spectrosc. 38 (2005) 143–149.
[13] C. Kendall, J. Day, J. Hutchings, B. Smith, N. Shepherd, H. Barr, et al., Evaluation
quired, perhaps derived from patient-outcome data or molecular of Raman probe for oesophageal cancer diagnostics, Analyst 135 (2010)
biology. 3038–3041.
[14] Y. Liu, Y. Xu, Y. Zhang, D. Wang, D. Xiu, Z. Xu, et al., Detection of cervical
metastatic lymph nodes in papillary thyroid carcinoma by Fourier transform
6.2. Small study sizes infrared spectroscopy, Br. J. Surg. 98 (2011) 380–384.
[15] Q.B. Li, X.J. Sun, Y.Z. Xu, L.M. Yang, Y.F. Zhang, S.F. Weng, et al., Diagnosis of gastric
inflammation and malignancy in endoscopic biopsies based on Fourier
The number of samples used was small in all but one of the transform infrared spectroscopy, Clin. Chem. 51 (2005) 346–350.
studies reviewed, and that always raises questions about the va- [16] T.D. Wang, G. Triadafilopoulos, J.M. Crawford, L.R. Dixon, T. Bhandari, P. Sahbaie,
et al., Detection of endogenous biomolecules in Barrett’s esophagus by Fourier
lidity of findings, especially with highly multivariate data, such as transform infrared spectroscopy, Proc. Natl Acad. Sci. U.S.A. 104 (2007) 15864–
spectra. It is to be hoped that the promising diagnostic perfor- 15869.
mance demonstrated by multiple small studies can be used to justify [17] C. Kendall, N. Stone, N. Shepherd, K. Geboes, B. Warren, R. Bennett, et al., Raman
spectroscopy, a potential tool for the objective identification and classification
investment in larger studies that will carry more weight. In the of neoplasia in Barrett’s oesophagus, J. Pathol. 200 (2003) 602–609.
absence of such larger studies, if research groups in this field rou- [18] B.W. de Jong, T.C. Schut, K. Maquelin, T. van der Kwast, C.H. Bangma, D.J. Kok,
tinely shared their data, then spectra from similar studies could be et al., Discrimination between nontumor bladder tissue and tumor by Raman
spectroscopy, Anal. Chem. 78 (2006) 7761–7769.
combined to create larger datasets, to which machine-learning
[19] M.S. Bergholt, W. Zheng, K. Lin, K.Y. Ho, M. Teh, K.G. Yeoh, et al., In vivo diagnosis
methods could be applied. of esophageal cancer using image-guided Raman endoscopy and biomolecular
modeling, Technol. Cancer Res. Treat. 10 (2011) 103–112.
[20] P. Crow, A. Molckovsky, N. Stone, J. Uff, B. Wilson, L.M. WongKeeSong,
6.3. Limited model testing Assessment of fiberoptic near-infrared raman spectroscopy for diagnosis of
bladder and prostate cancer, Urology 65 (2005) 1126–1130.
[21] S.K. Teh, W. Zheng, K.Y. Ho, M. Teh, K.G. Yeoh, Z. Huang, Diagnostic potential
The almost total reliance on cross validation (especially LOOCV) of near-infrared Raman spectroscopy in the stomach: differentiating dysplasia
will be of concern to most machine-learning practitioners, as this from normal tissue, Br. J. Cancer 98 (2008) 457–465.
is likely to lead to overly optimistic performance metrics that are [22] N. Stone, C. Kendall, N. Shepherd, P. Crow, H. Barr, Near-infrared Raman
spectroscopy for the classification of epithelial pre-cancers and cancers, J. Raman
rarely matched in clinical practice. Spectrosc. 33 (2002) 564–573.
To achieve more representative metrics, we advise the use of [23] L.E. Orr, J. Christie-Brown, J.C. Hutchings, K. McCarthy, S. Rose, M. Thomas, et al.,
bootstrapping instead of cross validation, and strongly recom- Raman spectroscopy as a tool for the identification and differentiation of
neoplasias contained within lymph nodes of the head and neck, BiOS USA (2010)
mend permutation testing to determine the statistical significance 75481W.
of the sensitivity and specificity values obtained. Again, availabil- [24] J. Mo, W. Zheng, J.J. Low, J. Ng, A. Ilancheran, Z. Huang, High wavenumber Raman
ity of data from existing studies would be most helpful, as it would spectroscopy for in vivo detection of cervical dysplasia, Anal. Chem. 81 (2009)
8908–8915.
allow reanalysis using state-of-the-art validation methodology. [25] S. Koljenovic, T.B. Schut, A. Vincent, J.M. Kros, G.J. Puppels, Detection of
meningioma in dura mater by Raman spectroscopy, Anal. Chem. 77 (2005)
7958–7965.
Acknowledgments [26] J. Horsnell, P. Stonelake, J. Christie-Brown, G. Shetty, J. Hutchings, C. Kendall,
et al., Raman spectroscopy – a new method for the intra-operative assessment
of axillary lymph nodes, Analyst 135 (2010) 3042–3047.
The genesis of this work was financially supported by Cranfield
[27] M.C. Grimbergen, C.F. van Swol, R.J. van Moorselaar, J. Uff, A. Mahadevan-Jansen,
University and Gloucestershire Hospitals NHS Foundation Trust. Nick N. Stone, Raman spectroscopy of bladder tissue in the presence of
Stone was funded by a NIHR Career Scientist Research Fellowship. 5-aminolevulinic acid, J. Photochem. Photobiol. B. 95 (2009) 170–176.
[28] A.S. Haka, Z. Volynskaya, J.A. Gardecki, J. Nazemi, R. Shenk, N. Wang, et al.,
Diagnosing breast cancer using Raman spectroscopy: prospective analysis, J.
References Biomed. Opt. 14 (2009) 054023.
[29] A.S. Haka, K.E. Shafer-Peltier, M. Fitzmaurice, J. Crowe, R.R. Dasari, M.S. Feld,
Diagnosing breast cancer by using Raman spectroscopy, Proc. Natl Acad. Sci.
[1] C. Kendall, M. Isabelle, F. Bazant-Hegemark, J. Hutchings, L. Orr, J. Babrah, et al., U.S.A. 102 (2005) 12371–12376.
Vibrational spectroscopy: a clinical tool for cancer diagnostics, Analyst 134 [30] A.S. Haka, K.E. Shafer-Peltier, M. Fitzmaurice, J. Crowe, R.R. Dasari, M.S. Feld,
(2009) 1029–1045. Identifying microcalcifications in benign and malignant breast lesions by probing
[2] E. Montgomery, M.P. Bronner, J.R. Goldblum, J.K. Greenson, M.M. Haber, J. Hart, differences in their chemical composition using Raman spectroscopy, Cancer
et al., Reproducibility of the diagnosis of dysplasia in Barrett esophagus: a Res. 62 (2002) 5375–5380.
reaffirmation, Hum. Pathol. 32 (2001) 368–378. [31] S.K. Teh, W. Zheng, K.Y. Ho, M. Teh, K.G. Yeoh, Z. Huang, Near-infrared Raman
[3] R.G. Brereton, Chemometrics for Pattern Recognition, Wiley-Blackwell, Oxford, spectroscopy for early diagnosis and typing of adenocarcinoma in the stomach,
2009. Br. J. Surg. 97 (2010) 550–557.
[4] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning : Data [32] S. Duraipandian, W. Zheng, J. Ng, J.J. Low, A. Ilancheran, Z. Huang, In vivo
Mining, Inference, and Prediction, second ed., Springer, New York, 2009. diagnosis of cervical precancer using Raman spectroscopy and genetic algorithm
[5] J. Trevisan, P.P. Angelov, P.L. Carmichael, A.D. Scott, F.L. Martin, Extracting techniques, Analyst 136 (2011) 4328–4336.
biological information with computational analysis of Fourier-transform infrared [33] T. Kawabata, H. Kikuchi, S. Okazaki, M. Yamamoto, Y. Hiramatsu, J. Yang, et al.,
(FTIR) biospectroscopy datasets: current practices to future perspectives, Analyst Near-infrared multichannel Raman spectroscopy with a 1064 nm excitation
137 (2012) 3202–3215. wavelength for ex vivo diagnosis of gastric cancer, J. Surg. Res. 169 (2011)
[6] J.G. Kelly, J. Trevisan, A.D. Scott, P.L. Carmichael, H.M. Pollock, P.L. Martin-Hirsch, e137–e143.
et al., Biospectroscopy to metabolically profile biomolecular structure: a [34] M.J. Baker, E. Gazi, M.D. Brown, J.H. Shanks, P. Gardner, N.W. Clarke, FTIR-based
multistage approach linking computational analysis with biomarkers, J. spectroscopic analysis in the identification of clinically aggressive prostate
Proteome Res. 10 (2011) 1437–1448. cancer, Br. J. Cancer 99 (2008) 1859–1866.
[7] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation [35] H. Wills, R. Kast, C. Stewart, R. Rabah, A. Pandya, J. Poulik, et al., Raman
and model selection, IJCAI 14 (1995), 1137–1145. spectroscopy detects and distinguishes neuroblastoma and related tissues in
[8] J.A. Westerhuis, H.C.J. Hoefsloot, S. Smit, D.J. Vis, A.K. Smilde, E.J.J. van Velzen, fresh and (banked) frozen specimens, J. Pediatr. Surg. 44 (2009) 386–
et al., Assessment of PLSDA cross validation, Metabolomics 4 (2008) 81–89. 391.
[9] H. Fabian, N.A. Thi, M. Eiden, P. Lasch, J. Schmitt, D. Naumann, Diagnosing benign [36] R. Rabah, R. Webera, G.K. Serhatkulua, A. Caoa, H. Daia, A. Pandyaa, et al.,
and malignant lesions in breast tissue sections by using IR-microspectroscopy, Diagnosis of neuroblastoma and ganglioneuroma using Raman spectroscopy,
Biochim. Biophys. Acta 1758 (2006) 874–882. J. Pediatr. Surg. 43 (2007) 171–176.
[10] S. Sigurdsson, P.A. Philipsen, L.K. Hansen, J. Larsen, M. Gniadecka, H.C. Wulf, [37] M. Moreno, L. Raniero, E.A. Loschiavo Arisawa, A.M. do Espirito Santo, E.A.
Detection of skin cancer by classification of Raman spectra, IEEE Trans. Biomed Pereira dos Santos, R.A. Bitar, et al., Raman spectroscopy study of breast disease,
Eng. 51 (2004) 1784–1793. Theor. Chem. Acc. 125 (2010) 329–334.
M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25 25
[38] N.D. Magee, J.R. Beattie, C. Carland, R. Davis, K. McManus, I. Bradbury, et al., [43] E. Widjaja, W. Zheng, Z. Huang, Classification of colonic tissues using near-
Raman microscopy in the diagnosis and prognosis of surgically resected infrared Raman spectroscopy and support vector machines, Int. J. Oncol. 32
nonsmall cell lung cancer, J. Biomed. Opt. 15 (2010) 026015. (2008) 653–662.
[39] S.C. Park, S.J. Lee, H. Namkung, H. Chung, S.-H. Han, M.-Y. Yoon, et al., Feasibility [44] J.D. Horsnell, J.A. Smith, M. Sattlecker, A. Sammon, J. Christie-Brown, C. Kendall,
study for diagnosis of stomach adenoma and cancer using IR spectroscopy, Vib. et al., Raman spectroscopy – a potential new method for the intra-operative
Spectrosc. 44 (2007) 279–285. assessment of axillary lymph nodes, Surgeon. 10 (2012) 123–127.
[40] C.A. Lieber, S.K. Majumder, D.L. Ellis, D.D. Billheimer, A. Mahadevan-Jansen, In [45] M. Sattlecker, R. Baker, N. Stone, C. Bessant, Support vector machine ensembles
vivo nonmelanoma skin cancer diagnosis using Raman microspectroscopy, for breast cancer type prediction from mid-FTIR micro-calcification spectra,
Lasers Surg. Med. 40 (2008) 461–467. Chemom. Intell. Lab. Syst. 107 (2011) 363–370.
[41] C.A. Lieber, S.K. Majumder, D. Billheimer, D.L. Ellis, A. Mahadevan-Jansen, Raman [46] R. Baker, K.D. Rogers, N. Shepherd, N. Stone, New relationships between breast
microspectroscopy for skin cancer detection in vitro, J. Biomed. Opt. 13 (2008) microcalcifications and cancerl, Br. J. Cancer 103 (2010) 1034–1039.
024013. [47] M. Sattlecker, N. Stone, J. Smith, C. Bessant, Assessment of robustness and
[42] M. Sattlecker, C. Bessant, J. Smith, N. Stone, Investigation of support vector transferability of classification models built for cancer diagnostics using Raman
machines and Raman spectroscopy for lymph node diagnostics, Analyst 135 spectroscopy, J. Raman Spectrosc. 42 (2011) 897–903.
(2010) 895–901.