Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

A Novel Pipeline Method for the Preprocessing of Mass

Spectrometry Proteomics Data


Maria Anna v. Rapsomaniki, Panagiotis G. Zerefos, Konstantinos A. Theofilatos,
Spiridon D. Likothanassis, Athanasios K. Tsakalidis and Seferina P. Mavroudi

Abstract- Mass spectrometry based methodologies allow us thousands of features. In addition to this high dimensionality
to analyze complex mixtures of proteins from biological - small sample size problem, each spectrunI contains a great
samples in order to identify important proteomic patterns and amount of noise and artifacts, mostly due to the high
discover novel disease biomarkers. However proteomics data
sensitivity of the instrunIent, sample contamination and
are high-dimensional and complex and need to undergo a
electrical noise. Another common problem is the
number of preprocessing steps so that they can be further
analyzed. This procedure involves various steps that interact in
miscalibration of the spectra that makes the data impossible
complex ways and is an open research area, as it has been to compare. For all those reasons, it is more than obvious
shown that it has a big effect on the final results. In this work that in order to extract knowledge about the true underlying
we propose a novel pipeline method for the preprocessing of biological differences in the proteome, various preprocessing
MS-based data. Our proposed strategy addresses some of the
steps need to be applied.
major problems when dealing with proteomics data, like
The main goal of preprocessing is to come up with a
miscalibration, noise and redundancy. Our method exploits the
advantages of using the mean spectrum, ensuring at the same
matrix of important features (i.e. peaks) and their
time higher sensitivity. In order to illustrate the utility of our corresponding intensity values, which can be further
proposed method we experimented on data coming from analyzed using a variety of computational methods. To
samples of patients with bladder cancer or benign disease. The achieve this, one must first remove noise, artifacts and
results were quite encouraging, as 31 statistically important
systematic bias without loss of information and then detect
peaks were identified, some of which are not detected by
and quantify a set of peaks. Preprocessing involves various
existing methods.
steps that are highly interrelated and it has been shown that
I. INIRODUCTION if those steps are not applied carefully, it will be difficult to
extract meaningful conclusions about the underlying disease
he rapid developments in mass spe�trometry �M �) �d
T the introduction of new expenmental 10ntzatlOn
[7]. For each step, a number of methods have been proposed
making the decision about the best combination of methods
methods, like matrix-assisted laser desorption ionization
a very challenging task. Furthermore, it is difficult to
(MALDI) and surface-enhanced laser desorption ionization
evaluate the performance of each method and come up with
(SELDI), has made it possible to study protein expression
a standard strategy, as for each dataset a different set of
levels in complex mixtures of proteins from various
methods appear to be more effective.
biological samples, like serum [1] plasma [2] and urine [3].
This paper proposes a new method for the preprocessing
The data generated from these technologies can be used to
of proteomics data that deals with the problematic
identify proteomic patterns that can successfully separate
characteristics of the data and exploits the advantages of
states (e.g. normal versus disease) and possibly discover
various existing methods. More specifically, our proposed
novel disease biomarkers. Those patterns have high
strategy focuses on three main problems: correcting the
diagnostic significance, as they can be used for early
miscalibration of the mass spectra, detecting the peaks in a
diagnosis, prognosis, monitoring disease progression or
sensitive yet robust manner and extracting the true intensity
therapeutic response [4]. This strategy has already been used
values that correspond in each peak. For the peak fmding
in various types of cancer, like ovarian [1] [5], breast [2] and
step, we used a method based on the mean spectrunI
prostate cancer [6], giving interesting results.
approach, where we first fmd the peaks per category, then
However, the complex nature of proteomics data makes
apply certain criteria to ensure their reproducibility and then
their analysis a challenging task, as the initial raw data are
combine them in a single peak list. Instead of working with
very difficult to handle. More specifically, the data retrieved
peak locations, we propose the use of peak intervals, to
after an MS experiment contain hundreds of samples (i.e.
ensure that the small shifts present in the data do not
mass spectra), and in each sample correspond tens of
interfere with the fmal results. Our proposed pipeline was
applied in a MALDI MS dataset, obtained by the Proteomics
Manuscript received June 2010
Maria Anna v. Rapsomaniki (rapsoman@ceid.upatras.gr), Konstantinos Research Unit of the Biomedical Research Foundation,
A. Theofilatos (theofilk@ceid.upatras.gr), Spiridon D. Likothanassis concerning patients with bladder cancer (high or low grade)
(likothan@ceid.upatras.gr), Athanasios K. Tsakalidis (tsak@ �i.gr� and and benign bladder disease. After the preprocessing, a
Seferina P. Mavroudi (mavroudi@ceid.upatras.gr) from Umverslty of
Patras, Greece. Panagiotis G. Zerefos (pzerefos@bioacademy.gr) from classification step was applied, which achieved extremely
Biomedical Research Foundation,Academy of Athens,Athens, Greece

978-1-4244-6561-3/101$26.00 ©2010 IEEE


high performance. Furthermore, 31 statistically important peak is sometimes detected twice, because of mistakes in the
peaks were identified, some of which are not detected by peak matching step. The second approach suggests peak
existing methods. detection using the mean spectrum and then peak
Our method is programmed in MATLAB R2007b. All quantification by finding the corresponding intensities in
source codes are available upon request. individual spectra [4]. This method is becoming increasingly
popular, as there is no need to match peaks afterwards and
II. BACKGROUND also the peaks detected are statistically important. However,
Before proceeding in the next section that describes our a common problem with this approach is that quite often
proposed pipeline method, we first briefly review the small peaks that may be of biological importance disappear
separate steps usually involved in MS data preprocessing. in the mean. It has also been shown that if the spectra are
The first preprocessing step is usually denoising, where one misaligned, the peaks are not detected correctly [12].
tries to get rid of noise without loss of information. Usual
approaches involve variations of the Wavelet Transform, III. METHODOLOGY
and because of its time-shift invariance, the Undecimated Dataset: Our proposed method was tested on a MALDI
Discrete Wavelet Transform (UDWT) [8] is becoming the dataset, composed of urine samples that were collected from
standard choice in this task. approximately 200 patients with Bladder Cancer (High or
The next preprocessing step is usually baseline Low Grade) or benign bladder disease. The samples were
correction, where the aim is to remove from the signal the run through the spectrometer during 9 days, giving 9 subsets
exponential baseline curve that is present because of the co­ (batches) of mass spectra. For each batch, a set of control
ionization of the matrix molecules. Various methods are samples were also provided. Each sample was run 5 times,
available, that are mainly based on iteratively fitting a curve giving 5 replicates. The preprocessing of this dataset was
to the local minima of the signal [9]. very challenging because apart from the aforementioned
The existence of systematic error is another common problems we had to deal with some additional ones. More
characteristic that makes the comparison of different spectra specifically, the calibration of the samples was not done
problematic. Global normalization of intensity values helps systematically, which resulted in incomparable spectra.
get rid of the experimental variations and is usually Furthermore, as most urine sample spectra, the spectra were
performed according to the Total Ion Current (TIC), which significantly contaminated and contained a great amount of
describes the total amount of protein present in the sample. irrelevant peaks [13]. All those reasons necessitated the
The peak finding and peak quantification steps are application of extra preprocessing steps.
crucial in dimension reduction, as from a spectrum of tens of Proposed Pipeline Method: Our proposed method is
thousands of mlz points we end up with a single matrix organized in two main steps: fmding the optimal peak list
consisting of some hundreds of peak locations and their that describes all samples and accurately quantifying the
corresponding intensity values (matrix of protein expression peaks in all spectra. For the first step, we used the strategy
levels). The aim is to detect peak positions as local maxima shown in Fig. 1. First, we applied a replicate selection step
and find corresponding intensity values, based on a peak's by computing the pairwise Euclidean distances between each
height or area, for every spectrum. replicate and their mean and then selecting the one with the
Finally, miscalibration and measurement errors cause the minimum distance, as we considered it to be the most
peaks to appear in slightly different mlz locations and thus representative. In that way we obtain a more manageable
peak alignment is used to correct those minor shifts. sample size without loss of information, as all 5 replicates
Commonly used algorithms try to fmd a common set of mlz actually refer to the same sample. We also applied a
values that represent all the samples, so that the same peaks resampling step, in order to homogenize the mlz vectors and
appear at identical mlz locations. In some of them each be able to compare different samples under the same
signal is shifted and scaled, so that the maximum alignment reference. To achieve that, each signal was decimated into a
among spectra is achieved, while in others the peaks are more manageable mlz vector and at the same time an
clustered so that the ones that represent the same protein antialias filter was applied, so that high-frequency noise is
belong to the same bin. The latter category is often referred prevented from folding into lower frequencies.
as peak matching and a common approach includes the use One of the biggest problems we had to correct was the
of hierarchical clustering [10]. miscalibration, which caused the presence of shifts in the
There are two main approaches available concerning the spectra, in some cases bigger than 100 mlz units. For that
fmal two steps: the first one proposes to detect and quantify reason we used an iterative process that involved first the
peaks in individual spectra, which gives n peak lists (where alignment of the control (reference) samples, then the
n the number of samples) that need to be matched afterwards correction of the shifts (linear transformation) in individual
[11]. This method is very sensitive, as it detects all peaks, spectra using the control alignment and fmally the global
but requires the application of peak matching. Common alignment using a vector of reference peaks.
problems are that the results are often redundant, as even the In order to fmd the most appropriate peak list, we propose
insignificant or random peaks are detected, and that the same a strategy based on the idea of using the mean spectrum for
peak detection. As mentioned, this method results in a peak interval. In that way we ensure that the minor shifts do not
list with statistically important peaks, as random peaks that cause mistakes in peak quantification.
appear only in individual samples are usually not detected.
However, there is no way to tell if a small yet persistent
peak, present in the majority of the samples of one category
and thus a possible biomarker, is detected or is disappeared
in the mean. On the other hand it is possible that sometimes
random peaks with high intensity values "survive" in the
mean.
For all those reasons, we used the mean spectrum per
category in order to combine the robustness of peak fmding
through the mean spectrum with higher sensitivity. We also Fig. 2: Peak quantification
applied a SNR cutoffcriterion, discarding the peaks that their
Signal-to-Noise Ratio (SNR) was under a certain threshold. The fmal matrix of peak expression levels was further
We introduced a consistency criterion, counting the times analyzed using an approach (Fig. 3) that involved a simple
each peak was detected and discarding the ones with a filtering feature extraction method (two-way t-test) followed
frequency lower than 35%. Finally, a peak matching by a Support Vector Machine (SVM) classifier with a
algorithm was used in order to combine the separate peak quadratic kernel function [14]. The details of this analysis
lists into a final one. are beyond the aim of this paper but are available upon
request.

Feature Selection
and SVM learning

Fig. 3: Final Analysis

IV. RESULTS
We applied our proposed method in the dataset described
earlier in order to detect and quantify the peaks. In Table I
we provide the results in the peak finding step for each
category, before and after the application of the cutoff
criteria. As we can see, each step reduces significantly the
dimensionality of the problem. The application of the peak
matching algorithm combined these 3 peak lists in a vector
of 456 peak bins, which are the features that will be further
used in the analysis.

TABLE I
NUMBER OF PEAKS DETECTED PER CATEGORY

Benign Low Grade High Grade

initial set 593 580 478


after SNR cutoff 477 451 393
after Consistency cutoff 416 415 338
Fig. I: Pipeline method for peak detection
The feature extraction and classification process was
In order to quantify the peaks in individual spectra, we repeated 100 times, to ensure reproducibility of the results,
used the method shown in Fig. 2. More precisely, first we and each time the classifier was built using a different
denoised the spectra using the UDWT. Then, we computed feature set. The average values of the evaluation
the baseline in multiple shifted windows and regressed it performance measures for our method are provided in Table
using a shape-preserving piecewise cubic interpolation. II.
Finally, we normalized the intensity values by standardizing TABLE II
the Area under Curve (AUC) to the group median. Since CLASSIFICAnON PERFORMANCE
small shifts still appear in the spectra, we used peak Accuracy Sensitivity Specificity
intervals of fixed width, centered at each peak's position,
and searched for the maximum intensity value inside each 0.9903 0.9887 0.9911
Kohn,and others, "Use of proteomic patterns in serum to identify
ovarian cancer," The Lancet, vol. 359, 2002,pp. 572-577.
In order to compare our results with existing methods, we L. Pusztai,B.W. Gregory, KA Baggerly,B. Peng,1. Koomen,
[2)
also performed peak finding using the overall mean H.M. Kuerer,F.1. Esteva,W.F. Symmans,P. Wagner, G.N.
spectrum [4], which gave us a matrix of 484 peaks and Hortobagyi,C. Laronga,0.1. Semmes, G.L.w. Jr,RR Drake,and
A Vlahou,"Pharmacoproteomic analysis of prechemotherapy and
corresponding intensities. The feature extraction step failed
postchemotherapy plasma samples from patients receiving
to identify statistically important features among those 484 neoadjuvant or adjuvant chemotherapy for breast carcinoma,"
peaks. Cancer, vol. 100,2004,pp. 1814-1822.
A further investigation of the results revealed that using [3] A Vlahou,P.F. Schellhammer,S. Mendrinos,K Patel,F.I.
Kondylis,L. Gong,S. Nasim,and G.L. Wright, "Development of a
our method a certain subset of 31 features is selected in Novel Proteomic Approach for the Detection of Transitional Cell
every run of the feature selection algorithm. This indicates Carcinoma of the Bladder in Urine," Am J Pathol, vol. 158,Apr.
that those features have high discriminatory power and thus 2001,pp. 1491-1502.
[4) 1.S. Morris,K.R Coombes,1. Koomen, KA Baggerly,and R
could be possible biomarkers. We also observed that the Kobayashi,"Feature extraction and quantification for mass
mean spectrum method failed to identify 7 of those features. spectrometry in biomedical applications using the mean spectrum,"
In Fig. 3 we show one of these features, which as we can see Bioinjormatics, vol. 21,May. 2005,pp. 1764-1775.
[5) A Vlahou,1.0. Schorge,B.W. Gregory,and RL. Coleman,
is far more abundant in cancer (both low and high grade) "Diagnosis of ovarian cancer using decision tree classification of
than benign samples. mass spectral data," Journal of Biomedicine and Biotechnology,
voI. 2003,2003,pp. 308-319.
[6] B. Adam, Y. Qu,1.W. Davis,M.D. Ward,M.A. Clements,L.H.
Mean Bening Spectrum Cazares,0.1. Semmes,P.F. Schellhammer, Y. Yasui,Z. Feng, and
10000 r----,-----,---� G.L. Wright, "Serum protein fingerprinting coupled with a pattern­
matching algorithm distinguishes prostate cancer from benign
prostate hyperplasia and healthy men," Cancer Research, vol. 62,
-� Jul. 2002,pp. 3609-3614.
� 5000 [7) K.A. Baggerly,J.S. Morris,and KR Coombes, "Reproducibility of
2 SELDI-TOF protein patterns in serum: comparing datasets from
c:

different experiments," Bioinformatics (Oxford, England), vol. 20,


Mar. 2004,pp. 777-785.
o t------ [8] KR Coombes,S. Tsavachidis,1.S. Morris,K.A. Baggerly,M.
3620 3625 3630 3635 3640 3645 3650 3655 Hung,and H.M. Kuerer, "Improved peak detection and
m/z quantification of mass spectrometry data acquired from surface­
enhanced laser desorption and ionization by denoising spectra with
Mean Cancer Spectra· Low and High Grade
the undecimated discrete wavelet transform," Proteomics, vol. 5,
10000
Nov. 2005,pp. 4107-41l7.
[9] K.A. Baggerly,1.S. Morris,J. Wang,D. Gold,L. Xiao,and KR
>-
Coombes, "A comprehensive approach to the analysis of matrix­
U)
assisted laser desorption/ionization-time of flight proteomics spectra
c: 5000 from serum samples," Proteomics, vol. 3,Sep. 2003,pp. 1667-
2
c:
1672.
[10) R Tibshirani,T. Hastie,B. Narasimhan,S. Soltys, G. Shi,A
Koong,and Q. Le, "Sample classification from protein mass
0 spectrometry, by 'peak probability contrasts'," Bioinformatics, vol.
3620 3625 3630 3635 3640 3645 3650 3655 20,Nov. 2004,pp. 3034-3044.
m/z [ I I] Y. Yasui,D. McLerran,B. Adam,M. Winget,M. Thornquist,and
Z. Feng, "An Automated Peak Identification/Calibration Procedure
Fig. 4: Peak at mlz location 3641.3 for High-Dimensional Protein Measures From Mass
Spectrometers," Journal of Biomedicine and Biotechnology, vol.
2003,Oct. 2003,pp. 242-248.
[12) K. Coombes, K Baggerly,and 1. Morris, "Pre-Processing Mass
V. FUTURE WORK
Spectrometry Data," Fundamentals of Data Mining in Genomics
In this paper we proposed a novel pipeline method for the and Proteomics, 2007,pp. 79-102.
[13) P. Zerefos,1. Prados,S. Kossida,A Kalousis,and A Vlahou,
preprocessing of MS-based proteomics data, which deals
"Sample preparation and bioinformatics in MALDI profiling of
with the demanding characteristics of this type of urinary proteins," Journal of Chromatography B, vol. 853,Jun.
bioinformatics data. Our future considerations involve the 2007,pp. 20-30.
development of a web-based tool, which will implement the [14) M. Hilario and A Kalousis, "Approaches to dimensionality
reduction in proteomic biomarker studies," Brief Bioinjorm, Feb.
basic steps of our method. In addition, because of the large 2008,p. bbn005.
number of steps and parameters involved in preprocessing, [15) R Armananzas, Y. Saeys,I. Inza,M. Garcia-Torres,C. Bielza,
our tool will propose optimal combinations of steps and Y.V.D. Peer,and P. Larranaga, "Peakbin Selection in Mass
Spectrometry Data Using a Consensus Approach with Estimation of
optimal parameter values according to the dataset each time Distribution Algorithms," IEEE/ACM Transactions on
provided, and in that way will ease the preprocessing job for Computational Biology and Bioinformatics,vol. 99,2010.
the unfamiliar user.

REFERENCES
[ I) E.F. Petricoin III,AM. Ardekani,B.A. Hitt,P.1. Levine,V.A.
Fusaro,S.M. Steinberg, G.B. Mills,C. Simone,D.A. Fishman,E.C.

You might also like