Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 44

Geranyl acetate

C12H20O2
Mass Spectral Libraries
An Ever-Expanding Resource for
Chemical Identification

Steve Stein
Mass Spectrometry Data Center
National Institute of Standards and Technology
Gaithersburg, Maryland, USA
Evolution of the NIST MS Library

NIH/EPA To EPA Structures


Collection of Tandem MS
Cincinnati Begin Manual
Collections GC Retention
Budde Evaluation
Fales, Heller

1970’s 1980’s 1990’s 2000’s 2010’s

Red Books Evaluated Library


To NIST Exact m/z
9-track Tape
PC-XT Version AMDIS Peptides
300 Baud Modem
NIST/EPA/NIH MS Library
5000

4000

3000

2000

1000

0
'88 '89 '90 '91 '92 '93 '94 '95 '96 '97 '98 '99 '00 '01 '02 '03 '04 '05 '06 7 '08 09 '10 '11

Libraries Distributed/Year
Library Growth

EPA/NIH EPA NIST


Mass Spectra
• Flux of Ions at Discrete Masses
– Actually Mass/Charge (m/z)
– Mass reflects elemental composition
• Methane = CH4 = 16 nominal = 16.0303 u
• Isotope = 13CH4 = 17.0348 u (1%)
– Perfect spectrum – intensity vs chemical formula

• Ion is a ‘charged’ molecule (interconnected atoms)


– # protons ≠ # electrons
– Ions controllable with electric/magnetic fields
Mass Spectrometers

GC/MS LC/MS: Electrospray/MALDI


Electron Ionization Ion Trap/Collision Cell/..
Samples are often very complex

Signal
Intensity

Retention Time

GC/MS of Environmental Sample


LC-MS/MS of Protein Digest
Goal: Chemical Identity

• Elemental Composition
(measurable by MS)

• Chemical Structure
(invisible to MS)
Reveal Structure as Spectrum:
A Mass “Fragmentogram”
mass = 140 u

H3C O H3C +
O
e- + HC O P CH3 HC O P CH3 + 2e-

H3C F F
H3C

+ CH2 H3C O
OH
HO P CH3 + CH
+
CH O P CH3 + CH3
H2C F
F
mass = 99 u mass = 125 u
Structure/Spectrum Space
examples of structures with similar spectra
Mass Spectra Reproducible Over Time

O’Neal et al.
Anal. Chem.
1951

NIST
2012

A mass spectrum is a property of an ion


Molecular Fingerprints

VX

HD

GB
Spectra Can be Interpreted, Not Predicted

MS
Interpreter
?
Library Search

• “Fingerprint” Identification
– Identify compound by matching spectrum to library
spectrum

• Compare Query to Library Spectra


– Derive ‘score’ for each library spectrum
• Reflect likelihood both spectra are from same compound
– Arrange results by score
• Create ‘Hitlist’
Traditional Library Search

Search Query
List Spectrum

Score
Histogram

Library
Hit Spectrum
List

2011 Version - 213K EI, 5K CID, 71K RI Compounds


Non-Traditional Peptide Spectrum Search

Spectrum
List Query
Spectrum
Score
Histogram

Hit
Library
List
(Consensus)
Spectrum

For Protein Inference and Quality Monitoring


Spectrum Similarity Score
 QL
 Q L
• Each spectrum expressed as a vector
• Cosine of Angle between Query and Library Spectra
• Q, L = weight (abundance, mass) : for each peak
• weight (abundance, mass)
– Abundance
– Abundance * m/z
– Certainty * Abundance
– …
S.E. Stein, D.R. Scott "Optimization and Testing of Mass Spectral Search Algorithms for
Compound Identification”, J. Amer. Soc. Mass Spectrom., 5, 859-866 (1994)
Peaks are Highly Correlated
by mass and abundance Small
Peaks

Figure 3. Joint Peak Occurrence Probabilities for Different Abundances


 
100

90

80
Relative
70
Relative Probabilities

Abundance
60
Relative 50 s
50-100%
10-20%

Frequency 40
1-2%

30

20

10

0
0 10 20 30 40 50 60 70 80 90 100
m/z Difference

Differences in m/z Medium


Big
Peaks
Peaks

S.E. Stein and D.N Heller, J. Amer. Soc. Mass Spectrom. 2006, 17, 823-835.
Score Confidence Level
• How to Express Identification Certainty?
– Related to broad range of Identity problems
– Can it be quantitative?

• Follow Bayes
– Follow changes in confidence

• Bayesian Notation
– P ( ID is correct | Threshold Score )
Bayes Rule*

Reproducible
Spectrum

P ( Final
ID | Score ) P ( ID )
Starting P Change
( Scorein| ID)
X
P ( Confidence
FP | Score ) Confidence
P ( FP ) P Confidence
( Score | FP)
Prior Probability: False Positive
Influence of
Analyte is
Before Experiment Potential
Library Search
Identified Correctly

* Odds Version
I. Prior Probability
How plausible is the ID?
• Seen before under similar conditions?

• Expert knowledge
– Expected, plausible, unlikely, impossible

• Citations
– Google, ChemSpider, PubChem, MS Library, …
– Human Metabolite DB, Merck Index, ..

• Weak link in Identification


– Most compounds in a library cannot be in sample
II. Spectrum Variability
P( Score | ID )
• False Negative Potential
– FN when Correct ID has low score

• Spectra are Kinetic Properties of Ions


– But can vary due to instrument bias
– Include known variations in library

• But, Spectra Can Vary For Other Reasons


– Low S/N
– Contaminants
– Instrument problem
– Chemical reaction before/after Ionization
Instrument ‘Noise Signature’
250 Hexachlorobenzene Spectra
same instrument, calibration mix

1000

800

600

400

200

Bars show
0 quartiles
0 50 100 150 200 250 300
Typical Interlab Spectrum Variation
Energy Dependence Collision
Energy
Gly-3_NGA2-200x-HCD-5to55 #967 RT: 6.09 AV: 1 NL: 1.44E5
Setting
T: FTMS + c NSI d Full ms2 678.22@hcd30.00 [100.00-1370.00]
678.2228
100
2+ 30
R elative A bu nda nce

80 [M+H+K]
60

40
576.6827
20
204.0864 495.6573 991.2909
343.5973 394.1171 626.1581 829.2382 931.2584 1151.1477 1253.5557
0
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
m/z
Gly-3_NGA2-200x-HCD-5to55 #968 RT: 6.10 AV: 1 NL: 4.62E4
T: FTMS + c NSI d Full ms2 678.22@hcd35.00 [100.00-1370.00]
678.2226
100
Relative Abundance

80 576.6829
35
60
526.6601 695.3120
40
204.0863 495.6567
394.1174 991.2914
20 220.8554 343.5934
626.1591 829.2395
283.5729 769.2132 931.2676 1116.7522 1245.1434
0
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
Gly-3_NGA2-200x-HCD-5to55 #969 RT: 6.10 AV: 1 NL: 2.09E4 m/z
T: FTMS + c NSI d Full ms2 678.22@hcd40.00 [100.00-1370.00]
138.0547
100
Relative Abundance

80

60
204.0863 394.1176
576.6827
40
283.5727
466.1385
40 495.6571 626.1594 991.2908
343.5940 678.2214 788.2122
20 728.1912 829.2380
931.2665
242.0537 1032.2496
0
100 200 300 400 500 600 700 800 900 1000 1100 1200 1300
m/z
Ion Source Decomposition
OH

150 C

280 C
III. False Positive Potential
P( Score | FP )
• Wrong Compound ID – High Score

• Analyte Structure Is Unique


– But is it’s spectrum?
– NO – Cannot be shown beyond ‘reasonable doubt’
• Spectrum insufficient to ‘reconstruct’ molecule
– But, may be unique for all plausible compounds
• FP should be restricted to plausible compounds

• Sometimes Only Class ID is Possible


– Even if retention time is matched
Large Libraries Can Show Uniqueness

Science, Aug, 22, 2008


Asara et al.
Response to Comment on “Protein Sequences from Mastodon
and Tyrannosaurus rex Revealed by Mass Spectrometry”
Varieties of FPs
• Accidental
– Rare for good quality, informative spectra

• Low information content


– Few reliable peaks

• MS Class Identification
– Different compounds yield same set of ions due to
structure similarity
MS Specific Class ID
MS Specific Class ID
Same low mass ions (benzoyl)
Random Match
False Positives above 800
Hit List Contains Structural Information

S.E. Stein "Chemical Substructure Identification by Mass Spectral Library Searching”, J. Amer.
Soc. Mass Spectrom. 6 (1995), 644-655.
Is Analyte In Library?
• Ideal library contains all plausible compounds

• P(correct) = P(in library) x P(correct, assuming in library)


Estimating Probabilities of Correct Identification from Results of Mass-Spectra Library Searches, JASMS 5 p.
316 (1994)

• P(in library) = f(score)


• P(correct compound) = f(score – score next best ID)

• Optional ‘Prior Probability’ correction compound is in other


collections
How to Handle Unknown Components?

Major Category: Plant “Metabolites”

Vetiver Oil
Vetiver Oil
Many components not identified by GC/MS

NIST Library
Identified Manually
Not Identified
Unidentified Recurring Spectrum Library

99% - RI 1200

69% - RI 1860

57% - RI 2504

From 3,700 pediatric urine samples.


Derived 203 recurring spectra with same retention.
Working on blood, essential oils , biologic drugs
Rumsfeld Quadrants
Target Library Comprehensive
Expected by Unexpected
Library
Analyst by Analyst

Unknown Knowns
Identified Known Knowns
Not expected but
by Library Expected and found
found

Known Unknowns Unknown Unknowns


Not Identified
by Library
Expected but not Not expected and
found not found
Recurrent
Concentration Spectral
too low, not in Libraries
library, …
Our Pipeline

You might also like