1 s2.0 S016599361400079X Main

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Trends in Analytical Chemistry 59 (2014) 17–25

Contents lists available at ScienceDirect

Trends in Analytical Chemistry


j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / t r a c

Current trends in machine-learning methods applied to spectroscopic


cancer diagnosis
Martina Sattlecker a,b, Nicholas Stone c, Conrad Bessant d,*
a King’s College London, Institute of Psychiatry, London, UK
b NIHR Biomedical Research Centre for Mental Health and Biomedical Research Unit for Dementia at South London and Maudsley NHS Foundation Trust,
London, UK
c School of Physics, University of Exeter, Exeter, UK
d
School of Biological and Chemical Sciences, Queen Mary University of London, London, UK

A R T I C L E I N F O A B S T R A C T

Keywords: The use of vibrational spectroscopy for diagnosis and staging of cancer is extremely attractive, promis-
Biochemical fingerprint ing many benefits over the currently used histopathology methods. The hypothesis underlying this ap-
Cancer diagnosis proach is that cancers have characteristic biochemical fingerprints that can be captured using spectroscopy.
Cancer staging
To relate complex multivariate spectra to disease state, machine-learning methods are typically used to
Data analysis
Machine learning
recognize diagnostic spectral patterns. This article provides an extensive review of this field. The average
Sensitivity diagnostic performance of the reviewed studies is impressive (>90% sensitivity and specificity) but most
Specificity studies were small (<40 samples). Furthermore, diagnostic performance has often been calculated using
Spectroscopy methods now known to be overoptimistic. We conclude that, if the combination of spectroscopy and
Validation machine learning is to translate into clinical practice, larger studies are needed and researchers should
Vibrational spectroscopy routinely provide spectral data in support of their publications so that the data can be reanalyzed by other
groups.
© 2014 Elsevier B.V. All rights reserved.

Contents

1. Introduction ........................................................................................................................................................................................................................................................... 18
1.1. Machine learning .................................................................................................................................................................................................................................... 18
1.2. Assessment of diagnostic performance .......................................................................................................................................................................................... 18
2. Studies by cancer type ....................................................................................................................................................................................................................................... 18
2.1. Gastrointestinal cancer ......................................................................................................................................................................................................................... 18
2.1.1. Esophageal cancer ................................................................................................................................................................................................................. 18
2.1.2. Stomach cancer ...................................................................................................................................................................................................................... 19
2.1.3. Colorectal cancer ................................................................................................................................................................................................................... 19
2.2. Urological cancer ................................................................................................................................................................................................................................... 20
2.2.1. Prostate cancer ....................................................................................................................................................................................................................... 20
2.2.2. Bladder cancer ....................................................................................................................................................................................................................... 20
2.3. Breast cancer ........................................................................................................................................................................................................................................... 20
2.4. Cervical cancer ........................................................................................................................................................................................................................................ 20
2.5. Skin tumors .............................................................................................................................................................................................................................................. 20
2.6. Lymph-node metastases ...................................................................................................................................................................................................................... 20
2.7. Lung cancer .............................................................................................................................................................................................................................................. 21
2.8. Brain tumors ............................................................................................................................................................................................................................................ 21

Abbreviations: ANN, Artificial neural networks; CART, Classification and regression trees; CV, Cross validation; DA, Discriminant analysis; IQR, Inter-quartile range; LDA,
Linear discriminant analysis; LOOCV, Leave-one-out cross-validation; LR, Logistic regression; MNLR, Multinomial logistic regression; PC-DF, Principal component discrimi-
nant function; PCA, Principal components analysis; PLS-DA, Partial least squares discriminant analysis; QDA, Quadratic discriminant analysis; SIMCA, Soft independent mod-
elling of class analogies; SM-LR, Sparse multinomial logistic regression; SVM, Support vector machine; VS, Vibrational spectroscopy.
* Corresponding author. Tel.: +44 (0)20 7882 6510; Fax: +44 (0)20 7882 7732.
E-mail address: c.bessant@qmul.ac.uk (C. Bessant).

http://dx.doi.org/10.1016/j.trac.2014.02.016
0165-9936/© 2014 Elsevier B.V. All rights reserved.
18 M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25

3. Machine-learning methods: prevalence and performance ................................................................................................................................................................... 21


4. Sample sizes .......................................................................................................................................................................................................................................................... 22
5. Prevalence of model-testing methods .......................................................................................................................................................................................................... 23
6. Conclusions and recommendations .............................................................................................................................................................................................................. 23
6.1. Histopathology informs the training data ..................................................................................................................................................................................... 23
6.2. Small study sizes .................................................................................................................................................................................................................................... 24
6.3. Limited model testing .......................................................................................................................................................................................................................... 24
Acknowledgments ............................................................................................................................................................................................................................................... 24
References .............................................................................................................................................................................................................................................................. 24

1. Introduction validation that would be needed to confirm the efficacy of a new


method for clinical use. Indeed, the model is simply tested by ap-
Histopathology is currently the “gold standard” technique for di- plying it to data that were not used during the training stage, to see
agnosis and staging across all types of cancer. Typically, tissue how many of samples from this unseen data can be correctly iden-
samples are taken from patients and examined by pathologists using tified. There are three main ways of doing this: cross validation, boot-
various staining techniques. This approach has several limitations, strapping, and independent testing.
including delays in providing diagnostic results and the potential In the cross-validation approach, a pre-defined number of k
for inter-observer disagreement [1,2]. To overcome these limita- samples (k-fold cross-validation, or leave one out cross validation
tions, new methods are needed to allow rapid, non-invasive and high- (LOOCV) when k = N) are systematically removed from the data set
throughput diagnosis. Vibrational spectroscopic techniques, especially to form the test set, with all remaining samples being used to build
infrared (IR) and Raman spectroscopy, exhibit the potential to over- the model. This procedure is repeated until each sample in the data
come these limitations and provide an additional way of diagnos- set has been used once in the test set. Typically, k is selected based
ing and staging of cancer by providing a biochemical profile of the on the number of available samples, as a too large k may result in
tissue that varies according to whether or not cancer is present [1]. training instability, especially in small data sets [7]. Although popular,
cross validation has been shown to produce overly optimistic per-
formance metrics [8].
1.1. Machine learning
Bootstrapping is widely regarded as a less biased method, as it
randomly selects a pre-defined number of samples for the test set.
Because of the complex multivariate data that they produce, IR
This procedure is repeated multiple times, so each sample can be
or Raman methods as a routine technique for cancer diagnostics will
represented various times in the test set.
strongly depend on the efficacy of data-analysis methods that can
A fourth validation method is the use of an independent test set.
recognize spectroscopic patterns specific to cancer. Many different
In this approach, a significant proportion of the samples is removed
data-analysis techniques are used for this purpose, collectively re-
from the acquired data – these play no part in the model develop-
ferred to as machine-learning methods because they are trained to
ment and serve solely as a test set. The difficulty with the indepen-
recognize diagnostic patterns by being presented with examples of
dent testing approach is constructing a representative test set,
data acquired from samples of known disease state [3,4]. Machine-
especially if the total number of samples available is small.
learning techniques can be roughly broken down into methods that
In this review, we consider the machine-learning approaches and
are built upon a statistical foundation, such as linear discriminant
associated validation methods applied to cancer diagnostics using
analysis (LDA) and partial least squares discriminant analysis (PLS-
VS over the past decade. The papers included in the review are sum-
DA), and purer computational methods, such as support vector ma-
marized in Table 1. We considered only studies that reported sen-
chines (SVMs), artificial neural networks (ANNs) and random forests.
sitivity and specificity, as these measures are independent of the
Each technique has its pros and cons in terms of ultimate perfor-
population used in the study, so they have the potential to be di-
mance (complex methods, like ANNs, should perform better, in
rectly compared. Sensitivity is the percentage of positives (in this
theory) and optimization challenges (training and optimization is
case cancerous samples) that are correctly identified as such. Con-
usually easier with statistical methods).
versely, specificity is the percentage of negatives that are identi-
For our purposes, the ultimate aim of each type of machine-
fied as such. For consistency, we quote sensitivity and specificity
learning method is the same – to produce a mathematical relation-
rounded to one decimal place. For multi-class studies, we quote the
ship (commonly referred to as a model) that can classify a given
range of performance values obtained across the classes.
sample into the correct disease state based solely on the spectro-
scopic data acquired from it. Depending on the application, classi-
2. Studies by cancer type
fication may be between two states – separating healthy and diseased
samples – or may involve multiple classes representing different
2.1. Gastrointestinal cancer
cancer stages or sub-types. Details on how classification models are
built for vibrational spectroscopy (VS) data have been described by
2.1.1. Esophageal cancer
Trevisan et al. [5] and Kelly et al. [6].
In the majority of Raman studies investigating this cancer type,
LDA has been used for building diagnostic classification models. For
1.2. Assessment of diagnostic performance example, principal component analysis (PCA)-fed LDA achieved sen-
sitivities of 84.0–97.0% and specificities of 93.0–99.0% in a three-
Today, applying a machine-learning technique to a dataset is not group classification approach when assessed by LOOCV [22]. In a
technically challenging as off-the-shelf software tools are avail- more recent approach, Bergholt et al. [19] investigated 75 tissue
able for this purpose. However, using these techniques appropri- samples from 27 patients for in-vivo diagnosis of esophageal cancer.
ately is paramount if meaningful results are to be obtained. Of They used the gathered spectra to generate an LDA model and tested
particular importance is the way in which the diagnostic perfor- the resulting model with LOOCV, which resulted in a sensitivity of
mance of a model is calculated. This is often referred to as model 97.0% and a specificity of 95.2%. Further studies using LDA for pre-
validation, although it rarely comes close to the experimental dicting esophageal cancer are shown in Table 1.
M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25 19

Table 1
Summary of all investigated articles, including methods and results. In some studies, multiple tissue samples were derived from one subject and analyzed as
individual samples

Classification Method Test method Spectroscopy method Tissue type Tissue samples Subjects Sens. % Spec. % Year Ref

ANN Independent FTIR Breast 22 22 93.0 100 2006 [9]


ANN LOOCV Raman Skin 222 222 94.2 98.6 2004 [10]
CART Independent Raman Stomach 73 53 88.9 92.9 2008 [11]
LDA 4-fold CV FTIR Brain 59 57 17.0–71.0 95.0 2005 [12]
LDA Independent Raman Esophagus 114 49 66.0–84.0 81.0–96.0 2010 [13]
LDA LOOCV FTIR Lymph nodes 184 22 80.3 91.9 2011 [14]
LDA LOOCV FTIR Stomach 103 103 66.0–74.0 90.0 2005 [15]
LDA LOOCV FTIR Esophagus 98 32 92.0 80.0 2007 [16]
LDA LOOCV Raman Esophagus 87 44 84.0–97.0 93.0–99.0 2003 [17]
LDA LOOCV Raman Bladder 15 15 94.0 92.0 2006 [18]
LDA LOOCV Raman Esophagus 75 27 97.0 95.2 2011 [19]
LDA LOOCV Raman Prostate 38 37 87.0 84.0 2005 [20]
LDA LOOCV Raman Stomach 76 44 95.2 90.9 2008 [21]
LDA LOOCV Raman Bladder 12 12 88.0 97.5 2002 [22]
LDA LOOCV Raman Esophagus 89 44 84.0–97.0 93.0–99.0 2002 [22]
LDA LOOCV Raman Lymph node metastasis 103 103 75.0–100 86.0–99.0 2010 [23]
LDA LOOCV Raman Cervix 46 46 93.5 97.8 2009 [24]
LDA LOOCV Raman Brain 38 20 100 100 2005 [25]
LDA LOOCV Raman Lymph node metastasis 38 20 92.0 100 2010 [26]
LDA LOOCV Raman Bladder 29 24 89.0 79.0 2005 [20]
LDA LOOCV Raman-Fluorescence Bladder 92 38 100 80.8 2009 [27]
LR Independent Raman Breast 121 21 83.0 93.0 2009 [28]
LR LOOCV Raman Breast 149 58 94.0 96.0 2005 [29]
LR Train result Raman Breast 90 11 88.0 93.0 2002 [30]
MNLR LOOCV Raman Stomach 125 72 75.0–91.0 80.0–96.0 2010 [31]
PLS-DA 2-fold CV Raman Cervix 57 29 72.5 89.2 2011 [32]
PCA Train result Raman Stomach 12 10 73.0 73.0 2011 [33]
PC-DF Independent FTIR Prostate 40 39 92.3 99.4 2008 [34]
PC-DF Train result Raman Brain 62 31 80.8–100 64.3–100 2009 [35]
PC-DF Train result Raman Brain 31 31 95.0–100 92.3 2007 [36]
QDA Train result Raman Breast 34 34 99.0 98.0 2010 [37]
Random forest Independent Raman Lung 34 34 90.0 75.0 2010 [38]
SIMCA Independent FTIR Stomach 11 11 30.0–87.0 77.0 2007 [39]
SM-LR LOOCV Raman Skin 42 19 100 91.0 2008 [40]
SM-LR Training only Raman Skin 39 39 100 100 2008 [41]
SVM Independent Raman Lymph node 43 43 100 100 2010 [42]
SVM LOOCV Raman Colorectal 105 59 99.4–99.9 99.3 2008 [43]
SVM LOOCV Raman Lymph node 59 58 71.0–81.0 91.0–97.0 2012 [44]
SVM ensemble Independent FTIR Breast cancer 71 71 85.0–100 75.0 2011 [45]
Two-matrix DA LOOCV Raman Breast 236 110 79.0–90.0 82.0–98.0 2010 [46]

Fourier transform IR (FTIR) spectroscopy in attenuated total re- differentiate between cancerous and non-cancerous tissue samples
flection was investigated for premalignant (dysplastic) mucosa in with a sensitivity of 73.0% and a specificity of 73.0%.
the esophagus. An LDA model achieved a sensitivity of 92.0% and Soft independent modeling of class analogies (SIMCA) was em-
a specificity of 80.0% for Barrett’s specimens sub-classified non- ployed for prediction of three different stomach pathologies (normal,
dysplasia and dysplasia when tested by LOOCV [16]. adenoma and cancer) based on IR spectroscopic measurements. Al-
though the data set was small, consisting of only 11 patient samples,
an independent test set was used to evaluate the classification model.
2.1.2. Stomach cancer The SIMCA model achieved a sensitivity from 30.0–87.0% and a speci-
A PC-fed LDA model built for classifying dysplasia from normal ficity of 77.0% [39].
gastric tissue based on Raman spectra (44 patients, 76 specimens) An LDA model was developed by Li et al. [15] for predicting four
achieved a sensitivity of 95.2% and a specificity of 90.9% when as- different stomach-tissue pathologies (healthy, superficial gastritis,
sessed by LOOCV [21]. atrophic gastritis, and gastric cancer). The developed model achieved
A different machine-learning approach was taken by The et al. sensitivities of 66.0–74.0% and a specificity of 90.0% for healthy
[11], who investigated classification and regression trees (CART) for samples when assessed by LOOCV.
differentiating between normal and cancerous gastric tissue speci-
mens (73 tissue samples from 53 patients). A sensitivity of 88.9% 2.1.3. Colorectal cancer
and a specificity of 92.9% were estimated when tested with an in- Although many Raman studies investigated spectral differ-
dependent test set. ences between normal and cancerous tissue, only a small number
A three-class model for diagnosing and typing adenocarci- reported development and assessment of a classification model to
noma in the stomach was built using multinomial logistic regres- provide an actual diagnosis. For example, Widjaja et al. [43] devel-
sion (MNLR). This model predicted the pathology of 125 tissue oped multi-class SVM models for predicting colon pathology (normal,
specimens (from 72 patients) with sensitivities of 75.0–91.0% and hyperplastic polyps and adenocarcinoma) using Raman spectra
specificities of 80.0–96.0% when assessed by LOOCV [31]. derived from 105 tissue specimens (59 patients). A radial basis func-
In a study investigating 10 patient samples Kawabata et al. [33] tion SVM model achieved sensitivities of 99.4–99.9% and a speci-
reported that information derived from PCA can be used to ficity of 99.3% when tested by LOOCV.
20 M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25

2.2. Urological cancer invasive malignancies. In this study, sensitivities of 79.0–90.0% and
specificities of 82.0–98.0% were reported [46].
2.2.1. Prostate cancer FTIR spectroscopy was investigated to differentiate between types
LDA was applied to distinguish benign from malignant pros- of breast disease based on breast calcifications. An SVM-ensemble
tate samples (37 patient samples) measured with a fiber optic probe, classifier was developed, was tested with an independent test set
as suitable for laparoscopic and endoscopic use. A sensitivity of 87.0% and correctly classified with sensitivities of 85.0–100.0% and a speci-
and a specificity of 84.0% were achieved when the model was as- ficity of 75.0% [45].
sessed by LOOCV [20].
IR spectroscopy was also applied for grading of prostate cancer- 2.4. Cervical cancer
tissue specimens. In a study of 39 patients, classification models using
a PC discriminant function (PC-DF) analysis achieved an overall sen- Only a few studies investigated this potential application area
sitivity of 92.3% and a specificity of 99.4% when assessed with an of VS. IR Raman spectroscopy in combination with LDA modeling
independent test set [34]. has been investigated for in-vivo diagnostics of cervical cancer. The
classifier built by using spectra derived from 46 patients yielded a
2.2.2. Bladder cancer diagnostic sensitivity of 93.5% and a specificity of 97.8% when tested
The predominantly applied machine-learning method in bladder by LOOCV [24].
cancer studies is LDA {e.g., it was used to develop a diagnostic model Raman spectroscopy has also been investigated for in-vitro di-
to discriminate between non-tumor and tumor bladder tissue by agnosis for cervical pre-cancer. A partial least squares discrimi-
de Jong et al. [18]}. The resulting model, which was built from Raman nant analysis (PLS-DA) model was employed and tested by leave two
data obtained from 15 patient samples, yielded a sensitivity of 94.0% samples out cross-validation and achieved a sensitivity of 72.5% and
and a specificity of 92.0% when tested by LOOCV. In a similar ap- a specificity of 89.2% [32].
proach, 24 patient samples, representing normal urothelium, cys-
titis and transitional cell carcinoma-tissue samples were used to 2.5. Skin tumors
develop a diagnostic LDA model. This classifier achieved a sensi-
tivity of 89.0% and a specificity of 79.0% when tested by LOOCV [20]. Skin is the most accessible organ of all, so it is most suitable for
In more recent work, it was investigated if the application of non-invasive in-vivo diagnostics using VS. Raman spectroscopy and
Raman spectroscopy with fluorescence-guided cystoscopy could sparse multinomial LR (SM-LR) have been used to distinguish
improve specificity for diagnostic prediction of bladder biopsies. The between normal, basal cell carcinoma, squamous cell carcinoma and
LDA model employed, built on data derived from 38 patient samples, melanoma. In this study, based on 39 patients, an overall sensitiv-
achieved a sensitivity of 100% and a specificity of 80.8% when as- ity and specificity of 100% in the training data was reported [41].
sessed by LOOCV [27]. Based on this study, a Raman hand-held probe was developed
and used to measure skin samples in 19 patients in vivo. SM-LR was
2.3. Breast cancer employed to differentiate between normal and abnormal (basal cell
carcinoma, squamous cell carcinoma and inflamed scar tissue)
In a Raman study investigating ex-vivo samples from breast tissue spectra. The assessment by cross-validation achieved a sensitivity
(normal, fibrocystic change, fibradenoma and invasive cancer), a lo- of 100% and a specificity of 91.0% [40].
gistic regression (LR) was employed to differentiate between ma- ANN were applied for diagnostic prediction of five different skin-
lignant and benign spectra. The model yielded a sensitivity of 94.0% lesion types, including normal skin, pigmented nevi, seborrheic kera-
and a specificity of 96.0% when tested by LOOCV [29]. The same tosis, basal cell carcinoma and malignant melanoma. In this study,
machine-learning method was further investigated for capability of a total of 222 tissue samples were measured by Raman spectros-
classifying fresh resected tissue samples, mimicking an in-vivo ap- copy. The resulting spectra were used to build and test ANN by
plication. Thus, 129 tissue sites from 21 patients were measured and LOOCV, which achieved a sensitivity of 94.2% and a specificity of
their pathology predicted by the LR mode. A sensitivity of 83.0% and 98.6% [10].
a specificity of 93.0% were reported [28].
A different classification approach was taken by Moreno et al. [37], 2.6. Lymph-node metastases
who employed quadratic discriminant analysis (QDA) for distin-
guishing invasive ductal carcinoma (22 patients), fibrocystic breast Lymph-node assessment is an important step in staging cancers,
conditions (six patients) and normal breast tissues (six patients). especially since it is known that the presence of metastasis carries
The QDA model separated normal from altered tissue with a sen- a worse prognosis for the patient. In order to allow a better assess-
sitivity of 99.0% and specificity of 98.0% in the training data – un- ment of the lymph node status in breast-cancer patients, Raman
fortunately, no results were reported for test data. spectroscopy has been investigated for a potential inter-operative
ANN were used to distinguish between IR spectra representing application. SVMs have been successfully applied to differentiate
fibroadenoma and ductal carcinoma in situ. ANN were tested with metastatic lymph node tissue samples from non-metastatic tissue
an independent test set and achieved a sensitivity of 93.0% and a samples (43 samples) with a sensitivity and specificity of 100% in
specificity of 100% [9]. an independent test set [42].
Micro-calcifications are commonly found in breast tissue and are Similarly, SVMs were investigated by Horsnell et al. [44] for their
often an indicator for malignant disease development. In an effort potential for lymph-node diagnostics using Raman spectroscopy. The
to exploit this, Haka et al. [30] investigated Raman spectroscopy and model achieved 71.0–81.0% sensitivities and 91.0–97.0% specifici-
LR for predicting malignancies in breast tissue based on micro- ties when tested with LOOCV. In another approach, 38 lymph nodes
calcifications. Spectra, derived from 11 patient samples were clas- have been measured with a Raman hand-held probe. The achieved
sified with a sensitivity of 88.0% and a specificity of 93.0%. IR spectra were used to develop a principal component fed LDA model,
spectroscopy was also investigated for the potential to diagnose which achieved a sensitivity of 92.0% and a specificity of 100% when
breast pathology based on micro-calcifications. Pathology specific tested by LOOCV [26].
patterns (carbonate content and protein matrix: mineral ratios) were In another study, 103 lymph nodes represented different pa-
used to generate a two-matrix linear discriminant model (LDM) thologies, including primary lymph nodes from Hodgkin’s and non-
for differentiating between benign, ductal carcinoma in situ and Hodgkin’s lymphomas, and lymph nodes containing metastases from
M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25 21

squamous cell carcinomas and adenocarcinomas. An LDA model, de- which separated the different pathologies with sensitivities of 95.0–
veloped for differentiating between these four groups, achieved sen- 100% and specificities of 92.3–100% [36].
sitivities of 75.0–100% and specificities of 86.0–99.0% when tested The same group investigated in a similar manner if PCA and a
by LOOCV [23]. discriminant model built using Raman spectra obtained from frozen
Using FTIR spectroscopy, Liu et al. [14] measured 184 freshly samples would be capable of predicting the pathology of fresh
removed cervical lymph nodes from 22 patients with papillary samples. This classification approach yielded sensitivities of 80.8–
thyroid cancer undergoing thyroid surgery with lymph-node dis- 100% and specificities of 64.3–100% [35].
section. They developed an LDA model to predict metastasis in lymph In a study of 20 patients, an LDA model was applied to discrim-
nodes and tested by LOOCV. The model achieved a testing sensi- inate meningioma from normal dura. The resulting LDA model
tivity of 80.3% and a specificity of 91.9%. achieved a sensitivity of 100% and a specificity of 100% when as-
sessed by LOOCV [25].
2.7. Lung cancer IR spectroscopy has also been investigated for diagnosis of brain
tumors. For example, Beleites et al. applied a classifier system, con-
Only one study reported the investigation of Raman sisting of a genetic algorithm for feature selection and an LDA model
microspectroscopy for the diagnosis and prognosis of non-small cell for discriminating cancerous brain tissue (astrocytoma, glioblas-
lung cancer [38]. A total of 62 lung-tissue samples (28 normal, 34 toma) from normal brain tissue (a total of 59 tissue specimens). The
cancerous) derived from 43 patients were analyzed. A random forest developed classifier separated the IR spectra into four distinctive
classification model was developed and assessed with an indepen- groups with sensitivities of 17.0–71.0% and a specificity of 95% when
dent test set. This model yielded a diagnostic sensitivity of 90.0% tested using four-fold cross-validation [12].
and a specificity of 75.0%.

2.8. Brain tumors 3. Machine-learning methods: prevalence and performance

Excisional biopsy can be a potential hazard for vulnerable organs, LDA was the most frequently applied method for developing clas-
such as the brain. Taking this into account, VS would be an ideal tool sification models since it was used in 18 out of the 40 reviewed pub-
for future in-vivo application in brain-tumor diagnostics. In addi- lications (45%).
tion, inter-surgery application for estimation of tumor during re- The second most popular method was SVMs, used in about 10%
section would be highly desirable since excessive resection might of the studies, followed by LR used in 7% of studies, PC-DF used in
result in brain damage. Conversely, an incomplete resection can cause 7% of studies, SM-LR in 5% of studies and ANN in 5% of the studies.
the reoccurrence of the tumor. In the remaining 20% of studies, varying methods were used, in-
Biopsies from three normal adrenal glands, 16 neuroblastomas, cluding CART, MNLR, PLS-DA, QDA, random forest, SIMCA and two-
five ganglioneuromas, six nerve-sheath tumors, and one pheochro- matrix discriminant analysis.
mocytoma were collected for a Raman study. PCA and discrimi- The prevalence of the various machine-learning methods is shown
nant function analysis were used to build a classification model, in Fig. 1. The number of publications in the field has increased year

Fig. 1. Chart showing the frequency of machine-learning methods applied in 40 studies reporting the use of vibrational spectroscopy for cancer diagnostics.
22 M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25

Fig. 2. A. Boxplots showing the sensitivity of machine-learning methods used in the reviewed work. Highest sensitivity was achieved by SM-LR. B. Boxplots showing the
specificity of machine-learning methods. Strongest specificity was achieved by ANN.

on year, but we were unable to discern any significant trends in the limiting factor. The methods used to test classification models also
popularity of individual machine-learning methods over time. vary from paper to paper, so comparisons of reported perfor-
The frequent use of LDA is most probably due to its easy appli- mance must be treated with caution. Indeed, in the absence of the
cability, since the optimization and development of LDA models is original spectral data used in these studies, it is impossible to make
simple and requires little computing time and power. Another reason an objective comparison between the machine-learning methods
might be that, by using LDA, it is easy to identify what the discrim- employed.
ination is based upon in the spectral domain, thus enabling greater There is a small number of studies in which different machine-
understanding of the disease-related changes in the spectra. learning methods were applied to the same data and performance
The overall median sensitivity across classification models re- reported in a consistent way {e.g., we have demonstrated that LDA
ported in the reviewed studies was 90.2% (IQR = 11.2). can perform well on classifying cancerous from non-cancerous
Looking into the individual performance of individual methods spectra (sensitivity = 100%, specificity = 91.9%), but SVMs per-
showed that SM-LR achieved the highest median sensitivity of 100% formed better (sensitivity = 100%, specificity = 100%) on the same
(IQR = 0). ANN yielded the second best median sensitivity of 93.6% data set [42]}.
(IQR = 0.6), followed by PC-DF with a median sensitivity of 92.3%
(IQR = 3.5), LDA with a median sensitivity of 89.8% (IQR = 8.1), LR
with a median sensitivity of 88.0 % (IQR = 5.5), and SVM with a 4. Sample sizes
median sensitivity of 87.7% (IQR = 23.8). All classification methods
achieved sensitivity well above 80% (Fig. 2A). Across the reviewed studies, we found that the median number
In comparison, the best median specificity of 99.3% (IQR = 0.7) of subjects was 38. With the exception of the 2004 skin-cancer study
was achieved by ANN. Second best median specificity of 96.2% [10], which involved 222 subjects, this median number varied very
(IQR = 8.6) was achieved by PC-DA, followed by SVM with a median little (inter-quartile range of 32 across all reviewed studies). In several
specificity of 96.1% (IQR = 6.2), SM-LR with a median specificity of studies, multiple tissue samples were taken from each patient, so
95.5% (IQR = 4.5), LR with a median specificity of 93.0% (IQR = 1.5) the median number of tissue samples was higher at 61 (IQR = 61).
and LDA with a median specificity of 92.3% (IQR = 8.8). The overall Clearly, such a low sample number limits the applicability of a di-
median specificity is 93.0% (IQR = 9.7), so the overall specificity was agnostic model, particularly because of the highly multivariate nature
higher than the overall sensitivity (Fig. 2B). of spectral data. Interestingly, there is no discernible trend in sample
Interestingly, non-linear methods, such as SVMs, demonstrated size over time.
lower performance than simpler methods, such as LDA or LR. This In other research areas dealing with highly multivariate data (e.g.,
might be because a simple classifier performs sufficiently well if genome-wide association studies), it is now commonplace to in-
sample groups are relatively easily separable. According to Occam’s crease sample numbers by sharing data. This is currently not
razor, there is no reason to use a more complex classifier in this case. common practice among research groups using VS for cancer
By contrast, when facing a more complex classification problem, diagnostics. This may be due to the assumption that it could be
methods such as SVMs might be applied because simpler methods, difficult to combine data from different datasets because of the wide
such as LDA, have failed. variety of laboratory protocols and spectroscopic instrumentation
Availability of data is another confounding factor because simple used in different laboratories. However, this is strong motivation for
models are easier to train, so simple classification methods can out- data sharing – comparison of data from different laboratories would
perform complex methods when the quantity of training data is a allow inter-instrument variation to be characterized and would help
M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25 23

Fig. 3. Chart showing the frequency different testing methods were used. Notably LOOCV is the most commonly applied method. No group reported the application of boot-
strapping for testing classification models.

the community to develop solutions. To date, the only study of this Some 22% of all reviewed studies used an independent test set
type has been performed by applying simulated data artifacts to in- to assess their diagnostic models. Generally, this is the most thor-
dependent test data that were then presented to an already trained ough way of testing a diagnostic model, if a representative test set
classification model [47]. In that study, classification models were can be created. The only way for even stricter testing would be if
generally found to be able to deal with a significant amount of the the test set consisted of an independent cohort, which none of
instrument-to-instrument variation. the reviewed studies did. Combining data together from different
studies would be one way to attempt this, if such data were made
5. Prevalence of model-testing methods publicly available.
Six groups reported only training results. This is bad practice,
As already mentioned, the choice of methods used to deter- because it does not give any indication of a model’s power in pre-
mine diagnostic performance can significantly affect sensitivity and dicting unknown samples. It might demonstrate that diagnostic
specificity reported for a given study. The prevalence of testing groups are separable, but this must be taken with caution because
methods used in the literature is shown in Fig. 3. powerful non-linear classification methods, such as ANN, can easily
The majority (61%) of reviewed studies tested their diagnostic be trained to separate samples within any training set (assuming
models using cross-validation. Some 23 studies used LOOCV, there are no samples with identical spectra assigned to different
one used two-fold cross-validation and one study used four-fold groups) with 100% accuracy but fail when tested on unseen samples.
cross-validation. Diagnostic potential will therefore not have been demonstrated.
However, Westerhuis et al. [8] showed with permutation tests
that improper use of cross validation leads to an overly optimistic 6. Conclusions and recommendations
assessment of diagnostic performance. Interestingly, none of the in-
vestigated studies used bootstrapping as a test approach; only cross- The median sensitivity across all reviewed studies was 90.2% and
validation, independent test set or no validation at all were reported the median specificity 93.0%, with these metrics reaching 100% in
in the literature reviewed. some cases. These results suggest that VS coupled with machine
One reason why LOOCV is widely applied might be the gener- learning has considerable promise for cancer diagnostics, and raises
ally low sample numbers in the studies reviewed. Splitting such a the question of why it is not already being translated into clinical
relatively small data set into a training set and independent test set practice. Here, we identify three potential obstacles to clinical uptake,
might result in a training set that is too small to develop a stable and discuss how they might be overcome.
classifier, and the limited size of the test set would inevitably result
in a coarse estimate of model performance. 6.1. Histopathology informs the training data
However, bootstrap resampling is well suited to low sample
numbers and we would encourage the use of bootstrapping where In the studies reviewed, machine learning was carried out using
people are currently using LOOCV. training data for samples from which the “correct” diagnosis was
24 M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25

determined by histopathology. This means that the spectroscopic [11] S.K. Teh, W. Zheng, K.Y. Ho, M. Teh, K.G. Yeoh, Z. Huang, Diagnosis of gastric
cancer using near-infrared Raman spectroscopy and classification and regression
approach can only ever be as good as histopathology, which is un-
tree techniques, J. Biomed. Opt. 13 (2008) 034013.
likely to be 100% accurate due to inter-observer variability [1,2]. If [12] C. Beleites, G. Steiner, M.G. Sowa, R. Baumgartner, S. Sobottka, G. Schackert, et al.,
the diagnostic accuracy of spectroscopy is to exceed histopathol- Classification of human gliomas by infrared imaging spectroscopy and
ogy, then more accurately diagnosed training samples are re- chemometric image processing, Vib. Spectrosc. 38 (2005) 143–149.
[13] C. Kendall, J. Day, J. Hutchings, B. Smith, N. Shepherd, H. Barr, et al., Evaluation
quired, perhaps derived from patient-outcome data or molecular of Raman probe for oesophageal cancer diagnostics, Analyst 135 (2010)
biology. 3038–3041.
[14] Y. Liu, Y. Xu, Y. Zhang, D. Wang, D. Xiu, Z. Xu, et al., Detection of cervical
metastatic lymph nodes in papillary thyroid carcinoma by Fourier transform
6.2. Small study sizes infrared spectroscopy, Br. J. Surg. 98 (2011) 380–384.
[15] Q.B. Li, X.J. Sun, Y.Z. Xu, L.M. Yang, Y.F. Zhang, S.F. Weng, et al., Diagnosis of gastric
inflammation and malignancy in endoscopic biopsies based on Fourier
The number of samples used was small in all but one of the transform infrared spectroscopy, Clin. Chem. 51 (2005) 346–350.
studies reviewed, and that always raises questions about the va- [16] T.D. Wang, G. Triadafilopoulos, J.M. Crawford, L.R. Dixon, T. Bhandari, P. Sahbaie,
et al., Detection of endogenous biomolecules in Barrett’s esophagus by Fourier
lidity of findings, especially with highly multivariate data, such as transform infrared spectroscopy, Proc. Natl Acad. Sci. U.S.A. 104 (2007) 15864–
spectra. It is to be hoped that the promising diagnostic perfor- 15869.
mance demonstrated by multiple small studies can be used to justify [17] C. Kendall, N. Stone, N. Shepherd, K. Geboes, B. Warren, R. Bennett, et al., Raman
spectroscopy, a potential tool for the objective identification and classification
investment in larger studies that will carry more weight. In the of neoplasia in Barrett’s oesophagus, J. Pathol. 200 (2003) 602–609.
absence of such larger studies, if research groups in this field rou- [18] B.W. de Jong, T.C. Schut, K. Maquelin, T. van der Kwast, C.H. Bangma, D.J. Kok,
tinely shared their data, then spectra from similar studies could be et al., Discrimination between nontumor bladder tissue and tumor by Raman
spectroscopy, Anal. Chem. 78 (2006) 7761–7769.
combined to create larger datasets, to which machine-learning
[19] M.S. Bergholt, W. Zheng, K. Lin, K.Y. Ho, M. Teh, K.G. Yeoh, et al., In vivo diagnosis
methods could be applied. of esophageal cancer using image-guided Raman endoscopy and biomolecular
modeling, Technol. Cancer Res. Treat. 10 (2011) 103–112.
[20] P. Crow, A. Molckovsky, N. Stone, J. Uff, B. Wilson, L.M. WongKeeSong,
6.3. Limited model testing Assessment of fiberoptic near-infrared raman spectroscopy for diagnosis of
bladder and prostate cancer, Urology 65 (2005) 1126–1130.
[21] S.K. Teh, W. Zheng, K.Y. Ho, M. Teh, K.G. Yeoh, Z. Huang, Diagnostic potential
The almost total reliance on cross validation (especially LOOCV) of near-infrared Raman spectroscopy in the stomach: differentiating dysplasia
will be of concern to most machine-learning practitioners, as this from normal tissue, Br. J. Cancer 98 (2008) 457–465.
is likely to lead to overly optimistic performance metrics that are [22] N. Stone, C. Kendall, N. Shepherd, P. Crow, H. Barr, Near-infrared Raman
spectroscopy for the classification of epithelial pre-cancers and cancers, J. Raman
rarely matched in clinical practice. Spectrosc. 33 (2002) 564–573.
To achieve more representative metrics, we advise the use of [23] L.E. Orr, J. Christie-Brown, J.C. Hutchings, K. McCarthy, S. Rose, M. Thomas, et al.,
bootstrapping instead of cross validation, and strongly recom- Raman spectroscopy as a tool for the identification and differentiation of
neoplasias contained within lymph nodes of the head and neck, BiOS USA (2010)
mend permutation testing to determine the statistical significance 75481W.
of the sensitivity and specificity values obtained. Again, availabil- [24] J. Mo, W. Zheng, J.J. Low, J. Ng, A. Ilancheran, Z. Huang, High wavenumber Raman
ity of data from existing studies would be most helpful, as it would spectroscopy for in vivo detection of cervical dysplasia, Anal. Chem. 81 (2009)
8908–8915.
allow reanalysis using state-of-the-art validation methodology. [25] S. Koljenovic, T.B. Schut, A. Vincent, J.M. Kros, G.J. Puppels, Detection of
meningioma in dura mater by Raman spectroscopy, Anal. Chem. 77 (2005)
7958–7965.
Acknowledgments [26] J. Horsnell, P. Stonelake, J. Christie-Brown, G. Shetty, J. Hutchings, C. Kendall,
et al., Raman spectroscopy – a new method for the intra-operative assessment
of axillary lymph nodes, Analyst 135 (2010) 3042–3047.
The genesis of this work was financially supported by Cranfield
[27] M.C. Grimbergen, C.F. van Swol, R.J. van Moorselaar, J. Uff, A. Mahadevan-Jansen,
University and Gloucestershire Hospitals NHS Foundation Trust. Nick N. Stone, Raman spectroscopy of bladder tissue in the presence of
Stone was funded by a NIHR Career Scientist Research Fellowship. 5-aminolevulinic acid, J. Photochem. Photobiol. B. 95 (2009) 170–176.
[28] A.S. Haka, Z. Volynskaya, J.A. Gardecki, J. Nazemi, R. Shenk, N. Wang, et al.,
Diagnosing breast cancer using Raman spectroscopy: prospective analysis, J.
References Biomed. Opt. 14 (2009) 054023.
[29] A.S. Haka, K.E. Shafer-Peltier, M. Fitzmaurice, J. Crowe, R.R. Dasari, M.S. Feld,
Diagnosing breast cancer by using Raman spectroscopy, Proc. Natl Acad. Sci.
[1] C. Kendall, M. Isabelle, F. Bazant-Hegemark, J. Hutchings, L. Orr, J. Babrah, et al., U.S.A. 102 (2005) 12371–12376.
Vibrational spectroscopy: a clinical tool for cancer diagnostics, Analyst 134 [30] A.S. Haka, K.E. Shafer-Peltier, M. Fitzmaurice, J. Crowe, R.R. Dasari, M.S. Feld,
(2009) 1029–1045. Identifying microcalcifications in benign and malignant breast lesions by probing
[2] E. Montgomery, M.P. Bronner, J.R. Goldblum, J.K. Greenson, M.M. Haber, J. Hart, differences in their chemical composition using Raman spectroscopy, Cancer
et al., Reproducibility of the diagnosis of dysplasia in Barrett esophagus: a Res. 62 (2002) 5375–5380.
reaffirmation, Hum. Pathol. 32 (2001) 368–378. [31] S.K. Teh, W. Zheng, K.Y. Ho, M. Teh, K.G. Yeoh, Z. Huang, Near-infrared Raman
[3] R.G. Brereton, Chemometrics for Pattern Recognition, Wiley-Blackwell, Oxford, spectroscopy for early diagnosis and typing of adenocarcinoma in the stomach,
2009. Br. J. Surg. 97 (2010) 550–557.
[4] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning : Data [32] S. Duraipandian, W. Zheng, J. Ng, J.J. Low, A. Ilancheran, Z. Huang, In vivo
Mining, Inference, and Prediction, second ed., Springer, New York, 2009. diagnosis of cervical precancer using Raman spectroscopy and genetic algorithm
[5] J. Trevisan, P.P. Angelov, P.L. Carmichael, A.D. Scott, F.L. Martin, Extracting techniques, Analyst 136 (2011) 4328–4336.
biological information with computational analysis of Fourier-transform infrared [33] T. Kawabata, H. Kikuchi, S. Okazaki, M. Yamamoto, Y. Hiramatsu, J. Yang, et al.,
(FTIR) biospectroscopy datasets: current practices to future perspectives, Analyst Near-infrared multichannel Raman spectroscopy with a 1064 nm excitation
137 (2012) 3202–3215. wavelength for ex vivo diagnosis of gastric cancer, J. Surg. Res. 169 (2011)
[6] J.G. Kelly, J. Trevisan, A.D. Scott, P.L. Carmichael, H.M. Pollock, P.L. Martin-Hirsch, e137–e143.
et al., Biospectroscopy to metabolically profile biomolecular structure: a [34] M.J. Baker, E. Gazi, M.D. Brown, J.H. Shanks, P. Gardner, N.W. Clarke, FTIR-based
multistage approach linking computational analysis with biomarkers, J. spectroscopic analysis in the identification of clinically aggressive prostate
Proteome Res. 10 (2011) 1437–1448. cancer, Br. J. Cancer 99 (2008) 1859–1866.
[7] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation [35] H. Wills, R. Kast, C. Stewart, R. Rabah, A. Pandya, J. Poulik, et al., Raman
and model selection, IJCAI 14 (1995), 1137–1145. spectroscopy detects and distinguishes neuroblastoma and related tissues in
[8] J.A. Westerhuis, H.C.J. Hoefsloot, S. Smit, D.J. Vis, A.K. Smilde, E.J.J. van Velzen, fresh and (banked) frozen specimens, J. Pediatr. Surg. 44 (2009) 386–
et al., Assessment of PLSDA cross validation, Metabolomics 4 (2008) 81–89. 391.
[9] H. Fabian, N.A. Thi, M. Eiden, P. Lasch, J. Schmitt, D. Naumann, Diagnosing benign [36] R. Rabah, R. Webera, G.K. Serhatkulua, A. Caoa, H. Daia, A. Pandyaa, et al.,
and malignant lesions in breast tissue sections by using IR-microspectroscopy, Diagnosis of neuroblastoma and ganglioneuroma using Raman spectroscopy,
Biochim. Biophys. Acta 1758 (2006) 874–882. J. Pediatr. Surg. 43 (2007) 171–176.
[10] S. Sigurdsson, P.A. Philipsen, L.K. Hansen, J. Larsen, M. Gniadecka, H.C. Wulf, [37] M. Moreno, L. Raniero, E.A. Loschiavo Arisawa, A.M. do Espirito Santo, E.A.
Detection of skin cancer by classification of Raman spectra, IEEE Trans. Biomed Pereira dos Santos, R.A. Bitar, et al., Raman spectroscopy study of breast disease,
Eng. 51 (2004) 1784–1793. Theor. Chem. Acc. 125 (2010) 329–334.
M. Sattlecker et al./Trends in Analytical Chemistry 59 (2014) 17–25 25

[38] N.D. Magee, J.R. Beattie, C. Carland, R. Davis, K. McManus, I. Bradbury, et al., [43] E. Widjaja, W. Zheng, Z. Huang, Classification of colonic tissues using near-
Raman microscopy in the diagnosis and prognosis of surgically resected infrared Raman spectroscopy and support vector machines, Int. J. Oncol. 32
nonsmall cell lung cancer, J. Biomed. Opt. 15 (2010) 026015. (2008) 653–662.
[39] S.C. Park, S.J. Lee, H. Namkung, H. Chung, S.-H. Han, M.-Y. Yoon, et al., Feasibility [44] J.D. Horsnell, J.A. Smith, M. Sattlecker, A. Sammon, J. Christie-Brown, C. Kendall,
study for diagnosis of stomach adenoma and cancer using IR spectroscopy, Vib. et al., Raman spectroscopy – a potential new method for the intra-operative
Spectrosc. 44 (2007) 279–285. assessment of axillary lymph nodes, Surgeon. 10 (2012) 123–127.
[40] C.A. Lieber, S.K. Majumder, D.L. Ellis, D.D. Billheimer, A. Mahadevan-Jansen, In [45] M. Sattlecker, R. Baker, N. Stone, C. Bessant, Support vector machine ensembles
vivo nonmelanoma skin cancer diagnosis using Raman microspectroscopy, for breast cancer type prediction from mid-FTIR micro-calcification spectra,
Lasers Surg. Med. 40 (2008) 461–467. Chemom. Intell. Lab. Syst. 107 (2011) 363–370.
[41] C.A. Lieber, S.K. Majumder, D. Billheimer, D.L. Ellis, A. Mahadevan-Jansen, Raman [46] R. Baker, K.D. Rogers, N. Shepherd, N. Stone, New relationships between breast
microspectroscopy for skin cancer detection in vitro, J. Biomed. Opt. 13 (2008) microcalcifications and cancerl, Br. J. Cancer 103 (2010) 1034–1039.
024013. [47] M. Sattlecker, N. Stone, J. Smith, C. Bessant, Assessment of robustness and
[42] M. Sattlecker, C. Bessant, J. Smith, N. Stone, Investigation of support vector transferability of classification models built for cancer diagnostics using Raman
machines and Raman spectroscopy for lymph node diagnostics, Analyst 135 spectroscopy, J. Raman Spectrosc. 42 (2011) 897–903.
(2010) 895–901.

You might also like