Professional Documents
Culture Documents
2010 - A Novel Pipeline Method For Pre Processing
2010 - A Novel Pipeline Method For Pre Processing
Abstract- Mass spectrometry based methodologies allow us thousands of features. In addition to this high dimensionality
to analyze complex mixtures of proteins from biological - small sample size problem, each spectrunI contains a great
samples in order to identify important proteomic patterns and amount of noise and artifacts, mostly due to the high
discover novel disease biomarkers. However proteomics data
sensitivity of the instrunIent, sample contamination and
are high-dimensional and complex and need to undergo a
electrical noise. Another common problem is the
number of preprocessing steps so that they can be further
analyzed. This procedure involves various steps that interact in
miscalibration of the spectra that makes the data impossible
complex ways and is an open research area, as it has been to compare. For all those reasons, it is more than obvious
shown that it has a big effect on the final results. In this work that in order to extract knowledge about the true underlying
we propose a novel pipeline method for the preprocessing of biological differences in the proteome, various preprocessing
MS-based data. Our proposed strategy addresses some of the
steps need to be applied.
major problems when dealing with proteomics data, like
The main goal of preprocessing is to come up with a
miscalibration, noise and redundancy. Our method exploits the
advantages of using the mean spectrum, ensuring at the same
matrix of important features (i.e. peaks) and their
time higher sensitivity. In order to illustrate the utility of our corresponding intensity values, which can be further
proposed method we experimented on data coming from analyzed using a variety of computational methods. To
samples of patients with bladder cancer or benign disease. The achieve this, one must first remove noise, artifacts and
results were quite encouraging, as 31 statistically important
systematic bias without loss of information and then detect
peaks were identified, some of which are not detected by
and quantify a set of peaks. Preprocessing involves various
existing methods.
steps that are highly interrelated and it has been shown that
I. INIRODUCTION if those steps are not applied carefully, it will be difficult to
extract meaningful conclusions about the underlying disease
he rapid developments in mass spe�trometry �M �) �d
T the introduction of new expenmental 10ntzatlOn
[7]. For each step, a number of methods have been proposed
making the decision about the best combination of methods
methods, like matrix-assisted laser desorption ionization
a very challenging task. Furthermore, it is difficult to
(MALDI) and surface-enhanced laser desorption ionization
evaluate the performance of each method and come up with
(SELDI), has made it possible to study protein expression
a standard strategy, as for each dataset a different set of
levels in complex mixtures of proteins from various
methods appear to be more effective.
biological samples, like serum [1] plasma [2] and urine [3].
This paper proposes a new method for the preprocessing
The data generated from these technologies can be used to
of proteomics data that deals with the problematic
identify proteomic patterns that can successfully separate
characteristics of the data and exploits the advantages of
states (e.g. normal versus disease) and possibly discover
various existing methods. More specifically, our proposed
novel disease biomarkers. Those patterns have high
strategy focuses on three main problems: correcting the
diagnostic significance, as they can be used for early
miscalibration of the mass spectra, detecting the peaks in a
diagnosis, prognosis, monitoring disease progression or
sensitive yet robust manner and extracting the true intensity
therapeutic response [4]. This strategy has already been used
values that correspond in each peak. For the peak fmding
in various types of cancer, like ovarian [1] [5], breast [2] and
step, we used a method based on the mean spectrunI
prostate cancer [6], giving interesting results.
approach, where we first fmd the peaks per category, then
However, the complex nature of proteomics data makes
apply certain criteria to ensure their reproducibility and then
their analysis a challenging task, as the initial raw data are
combine them in a single peak list. Instead of working with
very difficult to handle. More specifically, the data retrieved
peak locations, we propose the use of peak intervals, to
after an MS experiment contain hundreds of samples (i.e.
ensure that the small shifts present in the data do not
mass spectra), and in each sample correspond tens of
interfere with the fmal results. Our proposed pipeline was
applied in a MALDI MS dataset, obtained by the Proteomics
Manuscript received June 2010
Maria Anna v. Rapsomaniki (rapsoman@ceid.upatras.gr), Konstantinos Research Unit of the Biomedical Research Foundation,
A. Theofilatos (theofilk@ceid.upatras.gr), Spiridon D. Likothanassis concerning patients with bladder cancer (high or low grade)
(likothan@ceid.upatras.gr), Athanasios K. Tsakalidis (tsak@ �i.gr� and and benign bladder disease. After the preprocessing, a
Seferina P. Mavroudi (mavroudi@ceid.upatras.gr) from Umverslty of
Patras, Greece. Panagiotis G. Zerefos (pzerefos@bioacademy.gr) from classification step was applied, which achieved extremely
Biomedical Research Foundation,Academy of Athens,Athens, Greece
Feature Selection
and SVM learning
IV. RESULTS
We applied our proposed method in the dataset described
earlier in order to detect and quantify the peaks. In Table I
we provide the results in the peak finding step for each
category, before and after the application of the cutoff
criteria. As we can see, each step reduces significantly the
dimensionality of the problem. The application of the peak
matching algorithm combined these 3 peak lists in a vector
of 456 peak bins, which are the features that will be further
used in the analysis.
TABLE I
NUMBER OF PEAKS DETECTED PER CATEGORY
REFERENCES
[ I) E.F. Petricoin III,AM. Ardekani,B.A. Hitt,P.1. Levine,V.A.
Fusaro,S.M. Steinberg, G.B. Mills,C. Simone,D.A. Fishman,E.C.