Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

SELDI-TOF Mass Spectrometry Protein Data

By Huong Thi Dieu La


References
zAlejandro Cruz-Marcelo, Rudy Guerra, Marina Vannucci, Yiting Li,
Ching C. Lau, and Tsz-Kwong Man. Comparison of algorithms for
pre-processing of SELDI-TOF mass spectrometry data.
Bioinformatics, 24(19):2129–2136, 2008.
zRobert Gentleman, Vincent Carey, Wolfgang Huber, Rafael

Irizarry, and Sandrine Dudoit. Bioinformatics and Computational


Biology Solutions Using R and Bioconductor (Statistics for Biology
and Health). Springer Science and Business Media, Inc, New York,
first edition edition, 2005.
zHaleem J. Issaq, Timothy D. Veenstra, Thomas P. Conrads, , and

Donna Felschow. The SELDI-TOF MS approach to proteomics:


Protein profiling and biomarker identification. Biochemical and
Biophysical Research Communications, 292:587–592, 2002.
SELDI-TOF-MS
zSurface Enhanced Laser Desorption/Ionization
Time-of-Flight Mass Spectrometry
zUsed to profile protein markers from tissue or
bodily fluids and thus identify biomarkers that
can aid in diagnosis, prognosis or treatment.
zApplication: psychiatric disease, renal function,
cancer (pancreatic, prostate, ovarian, and
breast)
SELDI-TOF-MS Components
• ProteinChip array
– Retain specific proteins from the sample
• Reader
– Measures the molecular weights of the retained
proteins and generates a trace showing the
relative abundance vs. the molecular weights of
these proteins
• Software
– Identify differences in protein abundances
between two samples
Source:http://www.rci.rutgers.edu/~layla/AnalMedChem511/pdf_files/RB_pdf/403featu
re_issaq.pdf
Preparation
• Biological samples are processed via
fractionation.
• Fractionation: the process of splitting the
original sample into subsamples which contain
proteins that are more homogeneous
EAM: Energy
Absorbing
Molecule

Source:http://urology.jhu.edu/research/img1/proteomics13.jpg
Preprocessing of MS data
zAlignment of the spectra
zFiltering (Denoising)
zBaseline subtraction
zNormalization
zPeak Detection
zClustering of peaks
zPeak quantification
SELDI-TOF-MS softwares
• ProteinChip Software 3.1
• SpecAlign
• Cromwell
• PROcess
• MassSpecWAvelet
PROcess package
• Process a single spectrum
• Process a set of spectra
Process a single spectrum
• Baseline subtraction
• Peak detection
Baseline subtraction
zPurpose: To level off the elevated, non-constant
baseline caused by the chemical noise in the
EAM and by ion overload, thus, make different
spectra compatible.
zSolution: Using local regression to estimate the
bottom of a spectrum and then subtracting that
estimate from a spectrum
zTwo approaches: Fitting local regression to:
zThe points below a certain quantile
zLocal minima: yields better results when
estimating the baseline.
Baseline Subtraction
Baseline subtraction: algorithm
zFor each spectrum, find local minima by
segmenting the m/z range.
zFit a local regression to local minima for each
spectrum
zSubtract the estimated baseline from each
spectrum
### Load libraries

library(survival)

library(Icens)

library(PROcess)

### Read in the raw spectrum

fdat <- system.file("Test", package="PROcess")

fs <- list.files(fdat, pattern = "\\.*csv\\.*", full.names=TRUE)

f1 <- read.files(fs[1])

### Plot the raw spectrum

jpeg("f1.jpeg", width=480, height=480)

plot(f1, type="l", xlab="m/z")

title(basename(fs[1]))

dev.off()

### Remove the baseline

jpeg("f2.jpeg", width=480, height=480)

bseoff <- bslnoff(f1, method="loess", bw=0.1, xlab="m/z", plot=TRUE)

title(basename(fs[1]))

dev.off()
Peak detection
• Purpose: To detect peaks that represent the set
of proteins that are differentially expressed
between different samples.
Peak Detection: algorithm
zSmooth the spectrum using moving averages of ks
nearest neighbors
zCompute local variability as the median of the
absolute deviations of kv nearest neighbors.
zIdentify local maxima of the smoothed spectrum
using three thresholds:
z The signal to noise ratio: local smooth/local variability
z The detection threshold for the whole spectrum
z The shape ratio: the area under the curve within a small
distance of a peak candidate/ maximum of all such
peak areas of a spectrum
### Peak detection

jpeg("f3.jpeg", width=480, height=480)

pkgobj <- isPeak(bseoff, span=81, sm.span=11, plot=TRUE, zerothrsh=2, area.w=0.003, ratio=0.2)

dev.off()

### Inspect peaks in a particular range of m/z values

jpeg("f4.jpeg", width=480, height=480)

specZoom(pkgobj, xlim=c(5000,10000))

dev.off()
Peak detection
Processing a set of calibration
spectra
• Apply baseline subtraction
• Normalize spectra
• Cutoff selection
• Identify peaks
• Quality assessment
• Get proto-biomarkers
Example Data Set
• A set of 8 spectra from a calibration data set
– Same 5 proteins are present in the sample:
1084, 1638, 3496, 5807, 7034 amu
### Read in the 8 spectra

amu.cali <- c(1084,1638,3496,5807,7034)

### Plot 8 spectra and mark the protein positions by red vertical lines for each of them

jpeg("f5.jpeg", width=1080, height=560)

par(mfrow=c(2,4))

plotCali <- function(f, main, lab.cali){

x <- read.files(f)

plot(x,main=main, ylim=c(0,max(x[,2])), type="n")

abline(h=0, col="gray")

abline(v=amu.cali, col="salmon")

if(lab.cali)

axis(3, at=amu.cali, labels=amu.cali,

las=3, tick=FALSE, col="salmon", cex.axis=0.94)

lines(x)

return(invisible(x))

dir.cali <- system.file("calibration", package="PROcess")

files <- dir(dir.cali, full.names=TRUE)

i <- seq(along=files)

mapply(plotCali, files, LETTERS[i], i <=2)

dev.off()
Baseline subtraction
• Similar to baseline subtraction for a single
spectra
• R code:
Mcal <- rmBaseline(dir.cali, plot=TRUE)

head(Mcal)

060503peptidecalib_1_128.csv 060503peptidecalib_1_16.csv

3.6385 0.7253853 0.7485778

3.6458 0.6859291 0.6960419

3.65287 0.6856960 0.7088729

3.65972 0.6985420 0.7249795

3.66635 0.6885195 0.6953421

3.67276 0.6752363 0.6885879


Normalize Spectra
zPurpose: reduce variation due to experimental
noise
zTotal ion normalization:
z Calculate each spectrum's area under the curve
(AUC) for m/z values greater than the selected cutoff
z Scale all spectra to the median AUC
z Assumptions:
• The number of proteins being over-expressed is
approximately equal to the number of proteins being
under-expressed.
• The number of proteins whose expression levels
change is small relative to the total number of
proteins bound to the protein array surface
Cutoff selection
zChoose a cutoff point such that the magnitude
of the noise is relatively stable above that point.
zAlgorithm for a single cutoff point:
– Baseline-subtracted spectra within the group
are normalized to the median of the sums of
intensities of spectra
– The standard deviation of intensities at each
m/z value is calculated
– The mean of those standard deviations is
computed.
zRepeat for different cutoff points and Plot
average standard deviations vs. cutoff points.
### Cutoff selection

cts <- round(10^(seq(2,4,length=14)))

sdsFirst <- sapply(cts, avesd, Ma=Mcal)

jpeg("f6.jpeg", width=480, height=480)

par(mfrow=c(1,1))

plot(cts, sdsFirst, xlab="cutpoint", pch=21,


bg="red", log="x", ylab="average sd")

dev.off()

### Normalize spectra- cutoff point m/z=400

M.r <- renorm(Mcal, cutoff=400)


Identify Peaks
• Similar to peak detection for a single baseline-
adjusted spectrum
• R Code
### Identify peaks

peakfile <- "calipeak.csv"

getPeaks(M.r, peakfile, ratio=0.1)


Quality Assessment
• Purpose: Identify and eliminate spectra of
poor quality
• Based on 3 parameters:
– Quality: measure of separation of signal from
noise
– Retain: the number of high peaks in a single
spectrum
– Peak: the number of peaks in a spectrum
relative to the average number of peaks of the
whole set of spectra being considered
• Poor quality spectra: Quality < 0.4, Retain <
0.1, Peak <0.5.
Quality assessment: algorithm
zEstimate the noise by subtracting from each
spectrum its moving average with a window
size of 5 points.
zCalculate the noise envelope as 3 times the
standard deviation of the noise in a 250 point
window.
zCalculate the area under each spectrum A0
zCalculate the area after subtracting the noise
envelope from the spectrum A1
zObtain Quality, Retain, and Peak
Quality assessment: algorithm
• Quality: A1/A0
• Retain: the number of points with height
greater than 5 times noise envelope/ the total
numbrer of points in the spectrum
• Peak: the number of peaks in each spectrum
detected/ the average number of peaks for all
spectra in a run
qualRes <- quality(M.r, peakfile, cutoff=400)

QualRes

Quality Retain peak

060503peptidecalib_1_128.csv 0.4144087 0.1710994 0.9696970

060503peptidecalib_1_16.csv 0.4558286 0.1406047 0.9696970

060503peptidecalib_1_2.csv 0.4971926 0.1178203 0.9696970

060503peptidecalib_1_256.csv 0.4095177 0.1778567 0.7272727

060503peptidecalib_1_32.csv 0.3556932 0.1297756 0.9696970

060503peptidecalib_1_4.csv 0.5220848 0.1432037 1.2121212

060503peptidecalib_1_64.csv 0.4790304 0.1430304 1.2121212

060503peptidecalib_1_8.csv 0.4174718 0.1201594 0.9696970


Get Proto-biomarkers
• Peak alignment: peaks across spectra that are
likely to represent the same protein.
• Proto-biomarkers: peaks aligned across
spectra
• To obtain a proto-biomarker:
• Generate an interval around each peak that is
centered at the m/z value for the peak (0.3%)
• Determine which actual peaks are represented
by a proto-biomarker
• Use the maximum value as the height of that
proto-biomarker
### Get proto-biomarkers

bmkfile <- "calibmk.csv"

bmk1 <- pk2bmkr(peakfile, M.r, bmkfile, p.fltr=0.5)

mk1 <- round(as.numeric(gsub("M", "", names(bmk1))))

mk1 ### [1] 2906 3498 5812 7036

jpeg("f7.jpeg", width=1080, height=560)

par(mfrow=c(2,4))

plotCali2 <- function(...){

x <- plotCali(...)

lines(x[,1]*2, x[,2]+25, col="blue")

mapply(plotCali2, files, LETTERS[i], i <=2)

dev.off()
Analyze the result
• 5 known proteins: 1084, 1638, 3496, 5807,
7034
• Obtained 4 proto-biomarkers: 2906, 3498,
5812, and 7036
• Within 0.3% of m/z values of known proteins:
3498, 5812, and 7036
• Result of larger proteins with two charges:
2x2906 (5807) and 2x3496 (7034)
• Failed to detect peaks at m/z=1084 and 1638
Summary
• PROcess package:
– Process SELDI-TOF-MS data
– Advantage: produce more producible results
regarding peak quantification
– Limitation: The results were not homogeneous
across laser intensities

You might also like