Mass Spectral Reference Libraries: An Ever ... - ACS Publications

Jun 22, 2012 - discussed as members of one of three major contributors to identification ... the compounds that generated them in a “hit list” sor...
12 downloads 0 Views 3MB Size
Feature pubs.acs.org/ac

Mass Spectral Reference Libraries: An Ever-Expanding Resource for Chemical Identification The basic principles, practices, and pitfalls in the process of compound identification by searching mass spectral reference libraries are presented. Factors affecting the identification process are discussed as members of one of three major contributors to identification confidence: prior probability, risk of false negative results, and risk of false positive results. More general concerns and the problem of “unknown unknowns” are then explored. Stephen Stein* Chemical and Biochemical Reference Data Division, NIST, 100 Bureau Drive, Gaithersburg Maryland, United States when they contained about 6700 compounds;1 comprehensive libraries now contain spectra for hundreds of thousands of compounds.2,3 A wider representation of relevant compounds not only increases the chance of identifying a compound but also adds to the confidence of an identification by demonstrating the uniqueness of its spectrum. While the same principles apply to tandem MS libraries acquired by LC/MS, usage is far more limited, even as the availability of instruments that generate these spectra has mushroomed in recent years. As discussed later, this is a result of the limited size and availability of these libraries. The objective of this article is to acquaint current and potential users of these libraries with the basic principles, practices, and pitfalls of identification by library searching. While earlier impediments of limited storage and low processing speed have disappeared, in fact a single data file can exceed the size of the largest library, we show that progress depends only on the growth and exploitation of relevant, high quality libraries.



OUTLINE We first examine the fundamentals of chemical identification by library searching using an intuitive “Bayesian” statistical model.4 This divides the confidence of identification into three factors: prior probability, false negative potential, and false positive potential. We then examine various aspects of library search identification according to these factors. Then, we review spectrum similarity measures in current use. Next, we examine basic features of electron ionization (EI) and collision induced dissociation (CID) of ions and their implications for library building and use. Ideas underlying library quality and the role of high resolution are then presented. Then, we describe “recurrent spectra” libraries, a new class of libraries of unidentified spectra in common materials. We conclude with an overview of the process of chemical identification by MS library searching.

Robert Gates

Robert Gates

D

etermining the identity of a compound is a central task in chemical analysis. For compounds in complex mixtures, fragmentation patterns of their ions are widely used for this purpose. These patterns (spectra) can provide both the elemental composition of the compound as well as a direct read-out of labile bonds. Steady advances in the sensitivity and resolution of mass spectrometers are capable of unearthing increasing numbers of components in virtually any mixture. Dealing with the ever-increasing number of identifiable compounds and associated digital data presents a major impediment to the effective use of these new instruments. An established tool for dealing with this deluge of data are spectral libraries: collections of chemical structures and their spectra that can permit fast, reliable identifications for any compound whose fragmentation pattern is measured by the instrument. For volatile compounds, a GC/MS analysis, followed by library searching, has long been an accepted way of making identifications in complex mixtures. For any acquired spectrum, this method locates the most similar spectra in the reference library and presents the compounds that generated them in a “hit list” sorted by their similarity to the acquired spectrum. These libraries have steadily grown in coverage since their first computer use over 40 years ago This article not subject to U.S. Copyright. Published 2012 by the American Chemical Society



IDENTIFICATION CONFIDENCE A library search should not only yield the most probable identity of the compound that generated the query spectrum but also the confidence that it is correct. The best identification is often simply the one with the highest similarity “score”. This score should reflect the likelihood that the two spectra being Published: June 22, 2012 7274

dx.doi.org/10.1021/ac301205z | Anal. Chem. 2012, 84, 7274−7282

Analytical Chemistry



Feature

FALSE NEGATIVE RISK AND SPECTRUM VARIABILITY The numerator of the last term of eq 1B, P (Score|ID), is the probability that if the compound is present at identifiable levels in the sample that the Score threshold would be met. If, for any reason, differences in the search and library spectra of the target compound are too great, the Score will be below threshold, resulting in a false negative finding. While complete, reliable spectra in the reference library will reduce this risk, the spectrum of a given compound can differ substantially from its library spectrum for a number of other reasons. We now examine the origins of this variability (also, see Table 2).

compared were generated by the same compound. However, establishing the confidence that the identification is correct is more difficult. Using a Bayesian approach, this confidence is composed of two terms: the confidence prior to the library search and the change in confidence from the search. This is expressed most directly in an “odds” formulation of the Bayes rule (eq 1).

Table 2. Search Spectrum Variations (False Negative Risk) general class spurious peaks

specific type

cause

contaminant

background splitting/merging low concentration (low signal/ noise)

This uses the compact Bayesian notation, where P(X|Y) is the probability of X being correct if Y is true while P(X) is the simple probability of X being correct. Specifically, ID denotes a possibly correct identification and P(ID|Score) is the probability that this ID is correct if it exceeds a given similarity search Score. P(FP|Score) is the corresponding probability that the library match is wrong (a false positive, FP, result) at or above that Score.

variable intensity missing peaks risk of spurious peaks

energy variation



isomerization/ reaction

PRIOR PROBABILITY The first term in eq 1 is the desired confidence or, more specifically, the odds that an identification is correct after applying results of a library search. The second term represents this confidence prior to library searching; it is the well-known “Prior Probability” of Bayesian statistics. While this concept is unappealing to many because it can be difficult, or impossible, to derive precisely, its estimation is required if one wishes to determine the probability than an identification is correct. Its estimation may use results of earlier related experiments, expert judgment, or any other knowledge concerning the matrix or the experiment. Table 1 lists various contributors to Prior Probability. In the Bayesian spirit, the NIST Mass Spectral program5 applies a “common compounds” score correction based on how frequently a compound appears in a selected set of nonspectral data collections; if common and uncommon compounds have the same score, the common compound is more likely to be correct. In a similar way, an increasing number of citations for the putative identification, with added weight if it is relevant to the sample under study, can serve to increase the “prior”.6 For well-studied materials such as human blood or plasma, the presence or absence of a compound in a comprehensive collection such as the Human Metabolome Database (HMDB),7 can help establish the plausibility of the identification of a specific compound in that matrix.

thermal (EI ionizer)

collision energy in sample decomposition in ionizer

instrument

ion isomerization/ adduction intensity saturation (nonlinear)

overlapping mixture component: RT for GC/MS; RT and m/z for LC/MS-MS ionization region/carrier fluid contaminant centroiding or resolution problems ion counts/centroid errors low signal; MS/MS, near precursor ion noise, more likely in low signal strength spectra thermal energy in precursor causes more fragmentation (GC/MS only) tuning/ion optics problem chemical reaction prior to injection reaction on hot ionizer surface (GC/MS only) solvent adducts, special concern in ion trap excessive concentration

The mass spectrum of a given ion is, in effect, an energydependent property of the ion, and therefore, if the energy is controlled, the spectrum should be highly reproducible. Fragmentation paths open to the ion are determined by its structure or, at a deeper level, by its potential energy surface (PES). Subject to the PES, relative intensities of fragment ions are determined by the internal energy of the precursor. Fortunately, the way energy was acquired plays no role in fragmentation since energy relaxation is very fast.8 Energy distributions and fragmentation time scales depend primarily on the fragmentation method, and there are few methods in use. For electron ionization, the earliest method and the primary one used for GC/MS, fragmentation occurs following the ejection of an electron in a molecule by a high energy electron. Here, the thermal energy of the molecule adds to the excitation energy of the ionization processes. This added energy is similar to that in photoelectron spectroscopy,9 which

Table 1. Prior Probability (Chance of Compound being in Sample) general class cited as present in similar material known to exist, any reference to compound expert chemist opinion found in earlier experiments

how to find

plausibility

in specialist compilations/literature on web, cited in literature, assigned registry number experience with material and methods, stable, plausible product of known component analysis of the same/equivalent mixture by similar procedure 7275

already identified as possible component the more citations, the more likely correct, especially if relevant reasonable to find in analysis, stable, feasible product of known components expect to find since was identified earlier dx.doi.org/10.1021/ac301205z | Anal. Chem. 2012, 84, 7274−7282

Analytical Chemistry

Feature

m/z values can be identified in the source MS1 spectra, though this appears not to be widely used at the present time. While various strategies can be applied to identify multiple components in a single spectrum, significant contamination will reduce identification confidence to unacceptably low levels; better chromatography is the preferred solution to this problem. Another source of spectrum variation arises from chemical reaction prior to fragmentation. The decomposition of organophosphate esters prior to analysis by GC/MS led to both an error in our reference library and a research publication.15 Reaction of a precursor ion prior to fragmentation is a quite different problem, which can occur in the ion source or even in the collision cell. LC solvents can, for example, associate with analyte ions. A worrisome case was reported where fragmentation of an ion from the drug Norfloxacin depended strongly on the in-source energy setting of the instrument.16 This was traced to ion rearrangement caused by initial loss and subsequent gain of water after being formed in the electrospray.17 This study also found that, even after isolation in a collision cell or ion trap, sufficiently reactive ions can form solvent adducts. Finally, spectrum variation can be caused by differences in instrument conditions such as different tuning or temperatures, often leading to systematic variations that depend only on m/z. However, as long as masses are sufficiently accurate, this will not often prevent an identification. While match algorithms will compensate for such m/z bias, the best scores will result when library and query spectra have been measured on the same instrument.

arises from a difference in the energy of the initially formed ion and its most stable “ground state” structure, a molecular property. This leads to the high reproducibility of these spectra, which do not significantly vary above an electron energy of about 30 eV. Since thermal energy increases with molecular size, fragmentation of larger molecules can be sensitive to the temperature of the molecules prior to ionization. The importance of this energy is made clear by the significant increase in abundance of higher mass ions when they are cooled in supersonic-beam “cold EI”,10 an effect that increases with increasing molecular size. Fragmentation in an ion trap proceeds by “slow heating” of the isolated ion,11 with fragmentation occurring when the dissociation rate exceeds the collision rate. “Resonance excitation” is the generally preferred activation method. Here, only the precursor ion is excited; its fragments cool upon formation leading to high fragmentation reproducibility and little energy dependence. A third, common fragmentation method accelerates an ion typically at 5−50 eV into a millitorr collision cell where its translational energy is converted, collision-by-collision, to internal energy.12 When the energy gained by collision is sufficient to dissociate the ion prior to the next collision, fragmentation occurs. This “beam-type” collision cell dissociation is the method used in quadrupole time-of-flight, “Higher Energy Dissociation” (HCD) or triple quadrupole (QQQ) instruments. This is also “slow heating”, though in this case ions acquire energy as they decelerate. Since product ions initially possess the same velocity as their precursor, they can continue to gain energy and fragment, producing spectra that depend on the collision energy of the ion entering the cell. Therefore, library reference spectra need to cover a range of energies. So called “post source decay” fragmentation in MALDI is also a multistep heating process and generates spectra not dissimilar to that in a triple quadrupole, though at an energy inherent to the ion generation process. High energy, single collision fragmentation, formerly the principal method of tandem mass spectrometry, is less common nowadays, finding use in MALDI-TOF-TOF collision cells. Other fragmentation methods employ chemical energy (chemical ionization and electron capture dissociation, for example) and excitation by light (infrared multiphoton dissociation, for example). Libraries are not available for these less common varieties of fragmentation. Spectrum variations are most often associated with low signal strength or contaminant peaks. The ratio of the maximum to median peak intensity provides a simple measure of signalto-noise that may be readily derived from a spectrum. Values lower than 3 rarely provide confident identifications while values over 10 are usually sufficient for identification. Impurity peaks from background or cofragmentation of contaminants are ever present problems. Deconvolution software for GC/MS has been developed to extract ions based on the common chromatographic shape thereby eliminating background signals and contributions from closely eluting components. The freely available AMDIS program5,13 is widely used for this purpose. On the other hand, for LC/MS-MS, cofragmentation is limited by selecting a narrow m/z range for fragmentation. However, even here, cofragmentation can be a serious problem, especially for low intensity precursors. The increasing number of precursor ions detectable at decreasing intensity means that cofragmentation becomes an ever greater risk as the intensity of the precursor declines. Cofragmentation of coeluting isomers, which of course have the same precursor mass, remains a general problem in both GC/MS and LC/MS. One solution is the use of multiple-level ion trap fragmentation (MSn), whose spectra have been incorporated in libraries to distinguish isomeric glycans.14 In LC-MS/MS, cofragmenting ions with different precursor



FALSE POSITIVE RISK AND SPECTRUM UNIQUENESS The denominator of the last term, P (Score|FP), is the probability that a false hit will exceed the threshold Score for identification. Table 3 presents common sources of false positive results in library searching. For good quality spectra, these false positive results are primarily a consequence of the inability of mass spectrometry to distinguish between certain structural features. Positional isomers on aromatic rings are perhaps the best known cases (isomers of difluorobenzene, for example), but there are a variety of others (Figure 1). Reliable identification of a compound depends on uniqueness of its spectrum. Spectrum Uniqueness. Since a compound has a unique chemical structure, in principle, each compound might generate an equally unique spectrum. Unfortunately, fragmentation follows a limited set of low energy pathways and rarely provides evidence of all structural features. This also generally prevents the derivation of a unique structure from a spectrum alone. Most spectra simply do not contain enough information. To illustrate, the presence of a nitrogen atom in a molecule can strongly direct and limit the fragmentation of both ionized and protonated molecules, sometimes without leaving a trace of remote bonding. Moreover, rearrangement may scramble bonding in the molecule. A major exception to this rule is found in “shotgun proteomics”, where owing to the predictability of their fragmentation, peptides are identified solely from their tandem spectra using a list of possible amino acid sequences. As a result, libraries of peptide fragmentation spectra are now larger than the biggest “small molecule” spectral libraries.18 Moreover, this field has the luxury of being able to estimate “false discovery rates” because of the ability to construct appropriate libraries of false identifications; such measures of reliability are not available for other classes of compounds. However, libraries add considerable confidence to peptide identification since relative intensities of product ions are not 7276

dx.doi.org/10.1021/ac301205z | Anal. Chem. 2012, 84, 7274−7282

Analytical Chemistry

Feature

Table 3. Incorrect Library Identifications (False Positive Risk) general class

specific type

lost structural information upon fragmentation

class identification

accidental match

no molecular ion (GC-EI/MS) one dominant peak (other than common losses) unrelated compounds with matching spectra optical isomers equilibrated mixtures

identical by MS

same connectivity isomers with minimal differences in fragmentation uncertain structure in library

diastereomers, cis−trans (Z/E), conformers, ... aromatic ring positional isomers stable cyclic structures position of attachment, unsaturation or stereo guessed

cause

comment

compounds sharing structure segment that generates ion fragments: remainder lost as neutral fragment unstable ionized molecules

significant structural moiety not among fragment ions

one bond far weaker than all others in ion

lipids with charge at one end commonly yield few informative ions very rare for good quality spectra with more than a few peaks (RT or high resolution may resolve) requires chiral separation most stable form usually dominates, libraries should link together peak intensities vary due to different fragmentation energies methyl benzenes, bromobiphenyls, ...

accidental fragment peak overlap same fragmentation energetics tautomers, conformers primarily noncovalent, proximity effects fragment ions do not reveal connectivity isomerization generates indistinguishable products structure details not known with certainty and have no predictable effect on spectra

well predicted and peptides in these libraries contain “prior probability” information.19,20 Requirements for spectrum uniqueness are, in principle, greatly reduced by the fact that only the compounds that might actually be present in the sample need to be considered. Unfortunately, no systematic means exists for generating the full list of these possibilities. A good match of a measured spectrum to a single library spectrum in a comprehensive library provides objective evidence of uniqueness. If several spectra match and all belong to related compounds, then a class identification is indicated and it is up to the expert to decide if any member of this class is correct. Benzene and fulvene (a nonaromatic isomer) have similar spectra and both would match a benzene spectrum. A benzene identification in this case requires a tacit application of prior probabilities. Moreover, if a library suitable for flame analysis contained many of the other isomers of C6H6, many more false identifications could be made in nonflame analysis; clearly, prior probabilities are implicit in making identifications of even the simplest compounds. Class Identification. Since compounds with similar mass spectra generally have related structures, a large proportion of high scoring, false library hits are actually correct “class identifications”. We have examined this issue using our electron ionization library3 where each spectrum was searched against the library with exact matches omitted (Figure 2). Distributions of the highest scoring false hits at different scores are shown separately for cases where wrong hits have the same and different formulas as the query compound. Inspection of results shows that, when scores are high and the formula matches, structural isomers are almost always responsible. Lower scoring mismatches most often have different formulas than the query compounds and generate similar ions but have different neutral losses. Figure 3 shows the distribution of numbers of false hits above a score of 800 (a “good” match21). Examination of these compounds shows that almost all (>90%) have significant structure similarity to the query compound. Others undergo known rearrangements but otherwise have similar fragmentation qualities. This figure also shows that nearly 2/3 of compounds in the library are unique (all false hits have scores less than 800). Note that some spectra generate enormous numbers of high scoring wrong identifications, E-3-Eicosene being the worst with 253 false IDs with scores above 800. This presents a challenge in hydrocarbon analysis, since the library is rich in linear hydrocarbons and differences in many of these spectra are

may depend on ionizer temperatures

naphthalene/azulene is classic case based on assumptions in library or CAS rn number assignment, no accepted method for representing this uncertainty

small. In the area of plant metabolomics, alpha-pinene (C10H18) is a related example, having 40 different compounds with scores above 800, many of which are plausible identifications. In these cases, matching retention times is essential for identification by GC/MS. Unfortunately, LC retention is far more variable, which has prevented wide use of retention for identification in complex mixtures with LC/MS. Structural similarity within some of these classes may only be evident to those with an understanding of fragmentation rules. An example of this is shown in Figure 1a, and even here, the similarity of fragmentation paths is only evident to most after examining the spectra. A very rare example of a coincidental match of a spectrum with more than a few peaks in shown in Figure 1b. Figure 1c shows a more common case where spectra are similar for two quite different compounds because they fragment primarily to the same product ion. Fragment and Formula Recognition. The structural similarity of high scoring identifications was exploited to extract substructure information from the hit list even in the absence of a clear identification.22 Using the score-weighted occurrences of over 30 common structural features in the hit list, their presence or absence in the query compound could be derived. This includes features such as aromatic rings or carbonyl groups and quantities related to chemical formulas. Peak Occurrences Are Highly Correlated. A spectrum with more than a few peaks can be very unique; a spectrum of four peaks between m/z 50 and 250, with a 1 m/z precision, will match only 1 out of 1.6 billion randomly generated four peak spectra, without even considering peak intensities! However, the actual level of uniqueness is far lower since mass spectral peaks are very highly correlated; they depend on the highly nonrandom distribution of chemical structures. For example, the chance of finding two major peaks separated by 1 Da (hydrogen atom or isotope) or 14 Da (methylene) is more than five times greater than any separation between 5 and 10 Da.23 Of course, only m/z values corresponding to a subset of the elements in the fragmenting ion are allowed.



SCORES A library search finds the library spectra that most closely match the query spectrum and assigns each a “goodness of fit” score for creating a sorted “hit list”. Ideally, this score measures the likelihood that a library compound generated the acquired spectrum. Two varieties of scoring methods are used in current 7277

dx.doi.org/10.1021/ac301205z | Anal. Chem. 2012, 84, 7274−7282

Analytical Chemistry

Feature

Figure 1. Varieties of false identifications (a) class identification, two compounds yielding the same set of ions at the same intensities due to similarity in structure and fragmentation pathways, (b) entirely different structures yielding ions of the same formulas, and (c) the same low mass ions from a substructure in common (benzoyl) for two very different precursor molecules.

they are more discriminating than lower mass peaks. “Reverse” scores assign no weight to peaks present only in the search spectrum, under the assumption that they arise from impurities. The second scoring method begins with individual peak occurrence statistics and then applies corrections to account for peak correlation and other factors; this method is called Probability Based Matching, PBM.24 However complex, except for their speed, experts should

commercial systems. One is based on a simple measure of similarity that can be interpreted as the cosine of the “angle” between two spectra, where each spectrum is a vector with an axis along each of its m/z values. It is, in effect, a normalized “dot product”21 of these two vectors. Intensity and m/z weighting corrections have been used to optimize performance. For example, higher mass peaks can be given greater weight since 7278

dx.doi.org/10.1021/ac301205z | Anal. Chem. 2012, 84, 7274−7282

Analytical Chemistry

Feature

the highest absolute score. Assuming a compound is in the library, the relative likelihood that any hit is correct depends on its relative position and score in the hit list.



EI-MS VS ESI-MS/MS While electron ionization libraries are routinely used for GC/MS analysis, tandem mass spectral libraries have only found a few specialized applications. Their relatively small size (a few thousand compounds) and limited availability (some are only available on a single instrument platform, some only as an image) are the principal reasons for this low usage. This limited size is partly a result of the far longer time for which EI libraries have been developed; the first electron ionization libraries were available on magnetic tape some 40 years ago,1 while the technology for readily acquiring high quality, reproducible tandem spectra has, arguably, only been around for 10 years. However, there are other reasons for this slow development: Reproducibility. It has long been known that electron ionization above 40 eV generate highly reproducible spectra, providing stable molecular fingerprints. Spectra acquired 60 years ago are virtually identical to those of today (Figure 4). Early collision-induced dissociation experiments (CID) employed high (kV) energy single collisions, and spectra were dominated by instrument-dependent scattering. Spectra from in-source fragmentation (at the vacuum inlet of an electrospray interface), widely used in earlier LC/MS instruments that were unable to select individual ions for fragmentation, were subject to interference from coeluting compounds and hence were unreliable sources of reference spectra. Moreover, a significant advantage of GC is the reproducibility of retention properties across instruments, facilitating the use of comprehensive libraries;26 reproducibility across platforms for LC is far less, and general purpose collections are not available. Energy Dependence. As discussed above, in beam-type, multiple-collision cells (Qtof, HCD, QQQ instruments), spectra depend greatly on collision energy, adding to the difficulty of identification by library searching. Some libraries record spectra at low, medium, and high energies.7,27 Some libraries deal with this problem by averaging over a range of energies,28 by standardizing conditions,29 or using less restrictive matching criteria.30 Our approach is to include spectra over as wide a range of energies as possible and let the software sort out the best hits for a given ion. Chemical Space. Unlike EI, electrospray and MALDI are not limited to volatile, thermally stable molecules, so the potential number of molecules that can be identified is far greater. At first glance, this might discourage the idea of building a comprehensive library. However, even EI libraries cover only a small fraction of theoretically possible volatile compounds. They focus instead on identifiable components found in mixtures of practical interest. The number of times a compound is identified is the best single measure of its utility in a mass spectral library. This same concept limits the number of spectra needed for a comprehensive library for many matrixes. In fact, even though they are available in abundance, we no longer add spectra of newly synthesized compounds to our library since it is very unlikely that they will be present in a practical analysis (except in the synthesizer’s lab). Some classes of compounds are quite tractable. The number of mammalian metabolites is thought to be in the low thousands, and the number of widely used pesticides, drugs, bulk chemicals, and their degradation products is not overwhelming. For some materials, notably plant metabolites, the potential number appears to be extremely

Figure 2. Distributions of best matching incorrect identifications. Separate curves are shown when the chemical formulas of the query and best matching wrong compounds are identical and when they are different. Another pair of curves is shown for cases where molecular weights are identical and are different. These curves used match factor bins of 8 units (maximum 1000). Results are based on results of searching each spectrum in the NIST 2011 EI library again the complete library.

Figure 3. Distribution of all matching identifications with scores above 800. Distributions of numbers of identifications for searching each spectrum in the 2011 NIST EI library against the full library (black circles) and fraction of identifications less than or equal the numbers of identifications (note that X includes the one correct identification for search).

always outperform any algorithm. Modern systems use some form of filtering to quickly reject spectra that cannot be correct; if the correct compound is in the library and does not appear in the hit list, this is the likely culprit.



IS THE COMPOUND IN THE LIBRARY? A common concern is whether a compound producing an unidentified spectrum is actually in the library. Moreover, even if it is, if there is more than one good match, it may not be clear which one is correct. These two factors have been separated, and a probability has been derived for each based on computed scores.25 These estimates are derived from results of searches of all replicate (alternate) spectra against the main library in the NIST/EPA/NIH Mass Spectral Library. The confidence that the compound (or its “class” equivalent) is in the library depends on 7279

dx.doi.org/10.1021/ac301205z | Anal. Chem. 2012, 84, 7274−7282

Analytical Chemistry

Feature

Figure 4. Electron ionization mass spectra of n-hexadecane measured 61 years apart. Upper Spectrum from ref 38. Bottom spectrum measured at NIST on a quadruple.

in the use of exact masses of product ions for spectrum matching purposes.

large, and only a small fraction has been identified. However, even here, if the focus is on a limited number of materials, the problem may be tractable. On the other hand, identification by tandem MS libraries offers some advantages. First, unlike EI-MS, it always reveals the molecular ion, greatly adding to identification confidence. Equally significant is its ability to select ions by m/z for fragmentation, providing a means of avoiding contaminant peaks. While many of the compounds in EI and ESI libraries may overlap, virtually all precursor ions are different. However, simple bond breaking in an EI radical cation (odd electron) generally leads to a relatively stable even electron cation, which can be a product or even the precursor of a protonated molecule in the ESI or MALDI. For example, many of the fragment ions from ionized and protonated cocaine are the same, including the two major products m/z 182 and 82. The simple cleavage leading to loss of benzoyl moiety in ionized cocaine leads to the same ion as loss of benzoic acid in protonated cocaine. Searching the EI library with tandem MS spectra can, in fact, greatly aid structure elucidation.31 A search mode in the NIST MS Search Program is specially designed for this purpose.3 It matches peaks in the query spectrum that are also present in library spectra, searching for tandem spectra that are, in effect, embedded within EI spectra.



LIBRARY QUALITY Reference libraries are an integral part of many instruments, and most users presume their reliability. Library builders know, however, of the many possible sources of error. The “annotation” in particular is especially prone to error. Perhaps the worst case is an excellent spectrum but for the wrong compound! Over the years, a variety of computer-assisted quality control methods have been developed, but we have found that a human analyst is required to make the final decisions about whether a spectrum should be added to the library.33 Even so, given our inability to predict spectra, the best assurance of reliability of the compound’s spectrum is an independent determination. Four different aspects of library quality are described below. Spectrum Accuracy and Completeness. A spectrum should contain all major fragments down to 1% to 2% of the largest peak and from 40 m/z to the precursor isotopes. Peak splitting, m/z rounding errors, and peak saturation are common concerns. Compounds with major contaminant peaks are not added to the library. No effort is made to “improve” spectra by deleting or altering peaks. Comprehensiveness. The value of an entry in a comprehensive library is, simply put, proportional to the number of times the compound is identified by users. Since there is no way to measure this value, experts must select the compounds for addition based on perceived needs of users. In principle, library building can cease once all identifiable components of a mixture are included. However, as measurement sensitivity increases, new compounds will inevitably be revealed and library building is likely to continue at its present pace for the foreseeable future (Figure 5). Energy Dependence. For collision-energy dependent spectra, libraries should include spectra over the range of practical energies (from very low to high extents of decomposition). Chemical Structure. A large portion of the effort in maintaining a library is devoted to drawing and correcting chemical structures. These are essential for users to quickly make



HIGH RESOLUTION LIBRARIES As the resolution of mass spectrometers increases, so does their ability to convert acquired m/z values into chemical formulas. For a compound not in the library, this can greatly aid structure determination, as illustrated in recently proposed “fragmentation trees”, which can be “aligned” with spectra from libraries to find structurally related compounds.32 Accurate precursor mass is also helpful in reducing the number of potential false positive identifications in a library search. However, once a reference spectrum has been determined, formulas for most peaks can be deduced from fragmentation rules. Furthermore, we have found no cases where combining (binning) peaks according to their integral mass adversly affect search results. In fact, doing so eliminates concerns about the resolution of a particular measurement. Consequently, there appears to be little benefit 7280

dx.doi.org/10.1021/ac301205z | Anal. Chem. 2012, 84, 7274−7282

Analytical Chemistry

Feature

The most commonly occurring compounds in the library can then be subjected to further attempts at structure elucidation. Once identified, that knowledge can be readily available for all future analyses where the compound appears. We have developed such a library for urine and are pursuing this strategy for human plasma, essential oils, and eventually plant metabolites. Fiehn has already reported progress in this area.36



THE BIG PICTURE: RUMSFELD’S QUADRANTS We now present an overview of the process of identification by library searching. This approach extends a concept first expressed by Donald Rumsfeld, former U.S. Secretary of Defense, who while commenting on uncertainties in war presented the concept of “known knowns” and “unknown unknowns”.37 In this spirit, we divide the outcome of library searching in four “Rumsfeld Quadrants” (Figure 6), formed by the intersection of two yes/no

Figure 5. Numbers of compounds and replicate spectra in the NIST/ EPA/NIH mass spectral library of electron ionization spectra. Over the period of distribution, beginning with EPA/NIH “Red Books”.39

identifications, but the underlying “connection table” format is also needed to efficiently find identical and similar molecules. Ambiguities in chemical structure, caused by tautomerism (fast rearrangements, usually due to H-migration), and uncertainties in stereochemistry are a continual source of uncertainty for both users and evaluators but are not reliably handled by the naming and registry systems in common use. A method for generating a “canonical” (unique) string of characters (its digital “name”) for a specified structure was developed specifically for this purpose, where ambiguities can be explicitly represented in different “layers”. This is the basis of the InChI, the IUPAC Chemical Identifier,34 which is central to finding identical structures in our libraries and is fast becoming a popular general method for representing chemical structures in other areas of chemistry. Fortunately, in mass spectrometry, the stereo and tautomer “layers” generally have little influence on mass spectra, so these layers can usually be safely ignored when locating spectra.



RECURRENT, UNIDENTIFIED SPECTRA For the foreseeable future, a sizable fraction of components in practical samples will remain unknown. Guessing the identity of a truly unknown constituent and then synthesizing it for confirmation can be very expensive and may even fail, so few will undertake the effort. If no similar spectrum is found in a library, identification is often futile. Further, as instrument sensitivity improves and lower concentration compounds are observed, the problem will worsen. The problem is already acute for plant derived metabolites35 but is an issue in the analysis of virtually any biomaterial. High resolution is helpful but rarely sufficient to reliably identify an isomer. We are developing a practical way of dealing with this problem. The approach is to derive unidentified, good quality recurring spectra for materials of general interest and make them available as a supplemental library. Such materials would be plasma, urine, and common reference materials and matrixes. A key to this analysis is the use of reliable spectrum extraction tools; the AMDIS program is being extended for this purpose for GC/MS. Such work is more straightforward for LC/ MS-MS where cofragmentation is less of a problem for major components, especially if high resolution instruments are used. However, methods for extraction of reliable spectra for single compounds and then combining them to form a consensus spectrum is still a developing area. Such libraries promise to identify a large fraction of currently unidentified, but commonly observed components by their spectra, which along with their retention characteristics can both link an unidentified compound to other analyses and find correlations with identified compounds.

Figure 6. “Rumsfeld Quadrants” showing the intersection of yes/no answers for whether analysts expect a compound to be identified in the sample (prior probability) and whether it was identified in a library search.

questions: (1) is the compound expected to be in the mixture (its prior probability) and (2) has the compound been reliably identified by library searching (its score exceeds the identification threshold)? The upper left quadrant represents “known knowns”, the confirmation of expectations. This confirms both the expertise of the analyst and the quality of the experiment. The upper right quadrant, “unknown knowns”, expresses the value of a comprehensive library, finding correct identifications that were not expected. The lower left quadrant, “known unknowns” represents a failure of the process, either an erroneous expectation of the analyst or a failure of the library search process. The expected component was not identified. Finally, the lower right quadrant, “unknown unknowns” remains a principal challenge to identification. Library searching reveals a spectrum for a compound unknown to both the analyst and the library. Improving this situation is the motivation of the “recurrent spectrum” library described above, where all compounds observable by mass spectrometry, but as yet unidentified, will reside.



THE FUTURE Mass spectral reference libraries are storehouses of fragmentation properties of compounds identifiable by mass spectrometry. 7281

dx.doi.org/10.1021/ac301205z | Anal. Chem. 2012, 84, 7274−7282

Analytical Chemistry

Feature

It is not clear when libraries will be “big enough”, though this answer will depend on the application, the analyst, and developments in instrumentation. As mass spectrometers continue to increase in sensitivity and resolution and the knowable content of a mixture increases, both library expansion and software development will be needed to manage this increase in complexity. Applications to tandem MS are still in their early stages, and extension to recurring, unidentified spectra (unknown unknowns) has just begun. The tools will permit an increasingly detailed analysis of complex mixtures that will further enhance the ever-increasing power of the mass spectrometers themselves.



(9) Turner, D. D.; Baker, C.; Baker, A. D.; Brundle, C. R. Molecular Photoelectron Spectroscopy; Wiley: New York, 1970. (10) Amirav, A. J. Phys. Chem. 1990, 94, 5200−5202. (11) McLuckey, S. A.; Goeringer, D. E. J. Mass Spectrom. 1997, 32, 461−474. (12) Knyazev, V. D.; Stein, S. E. J. Am. Soc. Mass Spectrom. 2009, 21, 425−439. (13) Stein, S. E. J. Am. Soc. Mass Spectrom. 1999, 10, 770−781. (14) Zhang, H.; Singh, S.; Reinhold, V. N. Anal. Chem. 2005, 77, 6263−6270. (15) Maney, J. P.; Miller, G. C.; Comeau, J. K.; Van Wyck, N. E.; Fencl, M. K. Environ. Sci. Technol. 1995, 29 (8), 2147−2149. (16) Kaufmann, A.; Butcher, P.; Maden, K.; Widmer, M.; Giles, K.; Uria, D. Rapid Commun. Mass Spectrom. 2009, 23, 985. (17) Neta, N.; Godugu, B.; Liang, Y.; Simon-Manso, Y.; Yang, X.; Stein, S. E. Rapid Commun. Mass Spectrom. 2010, 24, 3271−3278. (18) NIST Libraries of Peptide Tandem Mass Spectra; http://peptide. nist.gov. (19) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; Stein, S. E.; Aebersold, R. Nat. Methods 2008, 5 (10), 873−875. (20) Craig, R; Cortens, J. C.; Fenyo, D.; Beavis, R. C. J. Proteome Res. 2006, 5, 1843−1849. (21) Stein, S. E.; Scott, D. R. J. Am. Soc. Mass Spectrom. 1994, 5, 859− 866. (22) Stein, S. E. J. Am. Soc. Mass Spectrom. 1995, 6, 644−655. (23) Stein, S. E.; Heller, D. N. J. Am. Soc. Mass Spectrom. 2006, 17, 823−835. (24) McLafferty, F. W.; Hertel, R. H.; Villwock, R. D. Org. Mass Spectrom. 1974, 9, 690−702. (25) Stein, S. E. J. Am. Soc. Mass Spectrom. 1994, 4, 316−323. (26) Babushok, V. I.; Linstrom, P. J.; Reed, J. J.; Zenkevich, I. G.; Brown, R. L.; Mallard, W. G.; Stein, S. E. J. Chromatog., A 2007, 1157, 414−421. (27) Horai, H.; et al. J. Mass Spectrom. 2010, 45, 703−714 ; http:// www.massbank.jp. (28) Josephs, J. L.; Sanders, M Rapid Commun. Mass Spectrom. 2004, 18 (7), 743−759. (29) Lemire, S. W.; Busch, K. L. J. Mass Spectrom. 1996, 31, 280−288. (30) Bristow, A. W. T.; Webb, K. S.; Lubben, A. T.; Halket, J. Rapid Commun. Mass Spectrom. 2004, 18, 1447−1454. (31) Weissberg, A.; Dagan, S. Int. J. Mass Spectrom. 2011, 299, 158− 168. (32) Rasche, R.; Scheubert, K.; Hufsky, F.; Zichner, T; Kai, M; Svatoš, A.; Böcker, S. Anal. Chem. 2012, 84, 3417−3426. (33) Ausloos, P. A.; Clifton, C. L.; Lias, S. G.; Mikaya, A. I.; Stein, S. E.; Tchekovskoi, D. V.; Sparkman, O. D.; Zaikin, V.; Zhu, D J. Am. Soc. Mass Spectrom. 1999, 10 (1999), 287−299. (34) The IUPAC International Chemical Identifier (InChI): http:// www.iupac.org/home/publications/e-resources/inchi.html. (35) Fiehn, O.; Kopka, J.; Trethewey, R. N.; Willmitzer, L. Anal. Chem. 2000, 72, 3573−3580. (36) Skogerson, K.; Wohlgemuth, G.; Barupal, D. K.; Fiehn, O. BMC Bioinf. 2011, 12 (1), 321−336. (37) Rumsfeld, D. “There are known knowns”; http://en.wikipedia. org/wiki/There_are_known_knowns . (38) O’Neal, M. J.; Weir, T. P. Anal. Chem. 1950, 23 (6), 830−843. (39) Heller, S. R.; Milne, G. W. A. EPA/NIH Mass Spectral Data Base, Nat. Stand. Ref. Data Ser.; U.S. Government Printing Office: Washington D.C., 1978; Volume 63, issue 3.

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest. Biography Steve Stein obtained his B.S. degree in Chemistry at the University of Rochester in 1969 and Ph.D. in Physical Chemistry from the University of Washington under Prof. B.S. Rabinovitch in 1974. His thesis work involved experimental and theoretical aspects of unimolecular decomposition kinetics, areas which he has continued to use to the present day, switching from neutral to ionic reactants along the way. As a postdoc and staff member at SRI International under Sidney Benson, he learned the practical tools of “Thermochemical Kinetics” and pursued them for some time in the Chemistry Department of West Virginia University. He joined the National Bureau of Standards (now the National Institute of Standards and Technology) over 30 years ago and discovered the joys of reference databases some 20 years ago. He has led the development of the NIST Mass Spectral Databases for most of this time, expanding them to include tandem mass spectra and proteomics, as well as trying to more closely link chemical structures with spectra.



ACKNOWLEDGMENTS The author wishes to acknowledge assistance by D.V. Tchekhovskoi of NIST is his assistance in data analysis and Robert Gates ([email protected]) for his artwork. Any mention of commercial items in this document does not imply recommendation or endorsement by the National Institute of Standards and Technology nor does it imply that the products identified are necessarily the best available for the purpose.



REFERENCES

(1) Stenhagen, E.; Abrahamson, S.; McLafferty, F. W. Atlas of Mass Spectral Data (Computer Tape); Wiley: Canada, 1969. (2) McLafferty, F. W. Wiley Registry of Mass Spectral Data, 9th ed.; Wiley: Hoboken, NJ, 2009. (3) NIST/NIH/EPA Mass Spectral Library, Standard Reference Database 1, NIST 11. Standard Reference Data Program, National Institute of Standards and Technology: Gaithersburg, MD, USA, 2011. (4) Howson, C.; Ulrich, P. Scientific Reasoning: The Bayesian Approach, 2nd ed.; Open Court Publishing Co., Chicago, IL, 1992. (5) NIST Mass Spectral Search Program, v. 2.0f, and Automated Mass Spectral Deconvolution and Identification Software programs are freely available at http://chemdata.nist.gov/mass-spc/ms-search/ with test library. (6) Little, J. L.; Williams, A. J.; Pshenichnov, A.; Tkachenko, V. J. Am. Soc. Mass Spectrom. 2012, 23, 179−185. (7) Wishart, D. S.; Knox, C; Guo, A. C.; et al. Nucleic Acids Res. 2009, 37 (Database issue), D603−D610 ; http://www.hmdb.ca. (8) Baer, T.; Hase, W. L. Unimolecular Reaction Dynamics; Oxford University Press, New York, 1996. 7282

dx.doi.org/10.1021/ac301205z | Anal. Chem. 2012, 84, 7274−7282