Lookup Peaks: A Hybrid of de Novo Sequencing and Database

Publication Date (Web): January 23, 2007. Copyright ... Journal of Proteome Research 2017 16 (3), 1288-1299. Abstract | Full Text ..... Analytical Che...
0 downloads 0 Views 248KB Size
Anal. Chem. 2007, 79, 1393-1400

Lookup Peaks: A Hybrid of de Novo Sequencing and Database Search for Protein Identification by Tandem Mass Spectrometry Marshall Bern,* Yuhan Cai,† and David Goldberg

Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, California 94304

A powerful technique for peptide and protein identification is tandem mass spectrometry followed by database search using a program such as SEQUEST or Mascot. These programs, however, become slow and lose sensitivity when allowing nonspecific cleavages or peptide modifications. De novo sequencing and hybrid methods such as sequence tagging offer speed and robustness for wider searches, yet these approaches require better spectra with more complete and consecutive fragmentation and, hence, are less sensitive to low-abundance peptides. Here we describe a new hybrid method that retains the sensitivity of pure database search. The method uses a small amount of de novo analysis to identify likely b- and y-ion peakss “lookup peaks”sthat can then be used to extract candidate peptides from the database, with the number of candidates tunable to fit a computing budget. We describe a program called ByOnic that implements this method, and we benchmark ByOnic on several data sets, including one of mouse blood plasma spiked with low concentrations of recombinant human proteins. We demonstrate that ByOnic is more sensitive than sequence tagging and, indeed, more sensitive than the three most popular pure database search toolssSEQUEST, Mascot, and X!Tandems on both the peptide and protein levels. On the mouse plasma samples, ByOnic consistently found spiked proteins missed by the other tools. A central problem of proteomics is the identification of proteins and their modifications in biological samples. One approach separates the proteins by gel electrophoresis, excises and digests selected spots, and then analyzes the resulting peptide mixture using mass spectrometry. A higher-throughput approach, called “shotgun proteomics’’, digests the proteins first, separates the resulting complex peptide mixture using one or more stages of liquid chromotography (LC), and then flows the sample directly into the mass spectrometer using electrospray ionization (ESI). In either case, tandem mass spectrometry (denoted MS/MS) has proved to be the most sensitive identification technique, often able to identify proteins from just one or two peptides. Tandem mass spectrometry employs an initial round of mass measurement, * Corresponding author: (tel) (650) 812-4443; (fax) (650) 812-4471; (e-mail) [email protected]. † Present address: Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195. 10.1021/ac0617013 CCC: $37.00 Published on Web 01/23/2007

© 2007 American Chemical Society

followed by fragmentation of ions in a selected mass over charge range and a second round of measurement. Database search programs, such as SEQUEST,1 Mascot,2 OMSSA,3 and X!Tandem,4 have been used to analyze tandem mass spectra of peptides for ∼10 years. These programs extract a set of candidate peptides from a protein database using parent ion mass and possibly cleavage specificity and then score the candidates using the peaks observed in the tandem spectrum. A weak point of database search is the identification of modified peptides. A modification shifts the parent ion mass and many of the peaks within a spectrum, disrupting both the lookup and the scoring phases. SEQUEST5,6 and Mascot handle this difficulty by expanding the database to include anticipated modifications. When allowing many types of modifications, this approach becomes slow on even a medium-size database and loses sensitivity on unmodified peptides7 as the number of candidates explodes. Hence Mascot8 and X!Tandem9 offer a mode that builds a small database containing only the proteins found by unmodified peptides and then considers modifications of this small database. This solution, however, also suffers from some limitations: it cannot find a protein that is represented only by modified peptides; it can find only anticipated modifications; and it may increase the difficulty of separating true identifications from false positives, as it unpredictably increases the number of proteins with two or more chance peptide identifications. Another approach computes an amino acid sequence de novo, that is, without reference to the database, and then uses this sequence as a probe to extract a small set of candidates that do not necessarily match the observed parent ion mass. The candidates can then be scored relatively quickly, perhaps using (1) Eng, J.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (2) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (3) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. J. Proteome Res. 2004, 3, 958-964. (4) Craig, R.; Beavis, R. C. Bioinformatics 2004, 20, 1466-1467. (5) MacCoss, M. J.; McDonald, W. H.; Saraf, A.; Sadygov, R.; Clark, J. M.; Tasto, J. J.; Gould, K. L.; Wolters, D.; Washburn, M.; Weiss, A.; Clark, J. I.; Yates, J. R., 3rd. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 7900-7905. (6) Yates, J. R., 3rd; Eng, J. K.; McCormack, A. L.; Schieltz, D. Anal. Chem. 1995, 67, 1426-1436. (7) Chamrad, D. C.; Korting, G.; Stuhler, K.; Meyer, H. E.; Klose, J.; Bluggel, M. Proteomics 2004, 4, 619-628. (8) Creasy, D. M.; Cottrell, J. S. Proteomics 2002, 2, 1426-1434. (9) Craig, R.; Beavis, R. C. Rapid Commun. Mass Spectrom. 2003, 17, 23102316.

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007 1393

an error-tolerant scoring function,10 thereby alleviating the limitations above. The probe sequence may be either a single long sequence that is matched in an error-tolerant way, as in SPIDER11 or OpenSea,12 or a number of short (usually 3-letter) sequence tags, at least one of which must match exactly, as in GutenTag,13 MultiTag,14 or InsPecT,15 all implementations of Mann and Wilm’s original idea.16 This hybrid approach requires a somewhat better spectrum, that is, one with more complete fragmentation and fewer noise peaks, than database search, because it relies on the presence of a “ladder” of successive peaks for the computation of the probe sequence. Hence it generally loses some of the unmodified matches, but makes up for the loss by finding more modified peptides. For example, on a popular test set17 of 18 940 spectra of reference proteins published by the Institute for Systems Biology (ISB), SEQUEST run without modifications finds 2756 valid identifications. InsPecT loses 488 of these identifications,15 but then recoups the loss by adding 947 new identifications for a total of 3215. Not all of the new identifications are modifications; many of them are unmodified peptides found by improvements in scoring. In this paper, we describe a new hybrid approachsimplemented in a program called ByOnicsthat does not require a ladder of spectral peaks and, hence, can identify spectra containing modified peptides and mixtures of peptides without losing sensitivity on unmodified peptides. In fact, ByOnic is more sensitive than the most popular database search programs. On the ISB data set, ByOnic makes 3420 valid identifications when not allowing modifications and 3863 when modifications are enabled, for sensitivity improvements of 24% over SEQUEST and 20% over InsPecT. We also demonstrate ByOnic’s sensitivity on samples of greater relevance for biomarker discovery, such as human blood plasma and mouse plasma spiked with recombinant human proteins. On samples containing 1 µg of spiked proteins/mL, ByOnic averaged 8.33 out of 13 spiked proteins detected (meaning at least two matched spectra), whereas Mascot averaged 6.0. The key novelty of ByOnic is its algorithm for finding candidates in the database. ByOnic uses both parent ion masss in the case of modifications, with a suitably wide tolerancesand a small number of peaks within the spectrum to find candidates. The peaks are ones that the program judges to be b- or y-ion peaks, due to their intensities and relationships to other peaks in the spectrum. The closest idea already in the literature appears to be Tang et al.’s18 use of the 40 tallest spectral peaks for indexed lookup of tryptic peptides. On the continuum from “pure” database search (SEQUEST, Mascot, OMSSA, X!Tandem) through sequence tagging (GutenTag and InsPect for short tags, SPIDER (10) Pevzner, P. A.; Dancik, V.; Tang, C. L. J. Comput. Biol. 2000, 7, 777-787. (11) Han, Y.; Ma, B.; Zhang, K. J. Bioinf. Comput. Biol. 2005, 3, 697-716. (12) Searle, B. C.; Dasari, S.; Wilmarth, P. A.; Turner, M.; Reddy, A. P.; David, L. L.; Nagalla, S. R. J. Proteome Res. 2005, 4, 546-554. (13) Tabb, D. L.; Saraf, A.; Yates, J. R., 3rd. Anal. Chem. 2003, 75, 6415-6421. (14) Sunyaev, S.; Liska, A. J.; Golod, A.; Shevchenko, A.; Shevchenko, A. Anal. Chem. 2003, 75, 1307-1315. (15) Tanner, S.; Shu, H.; Frank, A.; Wang, L. C.; Zandi, E.; Mumby, M.; Pevzner, P. A.; Bafna, V. Anal. Chem. 2005, 77, 4626-4639. (16) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, 4390-4399. (17) Keller, A.; Purvine, S.; Nesvizhskii, A. I.; Stolyar, S.; Goodlett, D. R.; Kolker, E. Omics 2002, 6, 207-212. (18) Tang, W. H.; Halpern, B. R.; Shilov, I. V.; Seymour, S. L.; Keating, S. P.; Loboda, A.; Patel, A. A.; Schaeffer, D. A.; Nuwaysir, L. M. Anal. Chem. 2005, 77, 3931-3946.

1394

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

and OpenSea for long tags) to pure de novo sequencing (Lutefisk, PEAKS, PepNovo, EigenMS), the lookup-peak technique sits just to the left of sequence tagging. The sequence tagging approach taken by GutenTag and InsPecT uses three-letter sequence tags, computed from the spectrum de novo, to find initial database matches, which are then checked for flanking masses and other corroborating information. The lookup-peak technique is “sequence tagging without the tags”, because each lookup peak is essentially the same as a flanking mass, that is, the putative total mass of a prefix or suffix of the peptide. Because the technique avoids the error-prone computation of the three-letter sequence, it suffers very little loss of sensitivity from pure database search; nevertheless, it retains most of the high filtering power of sequence tagging. For example, ByOnic might use two integers, say 500 and 615, that appear to be (the integer parts of) b- or y-ion peaks. Assuming that 500 and 615 do not form a complementary pair summing to the parent ion mass, this pair of integers filters the database by a factor of ∼55 times 18, or ∼1000. A random peptide contains a b-ion at 500 with probability ∼1/110 because residues average ∼110 Da, and similarly it contains a y-ion at 500 with probability ∼1/110, for a total probability of ∼1/55 for either a b- or y-ion. (The probability that a smaller integer such as 200 is the mass of a bor y-ion depends upon the number of ways it can be formed as a sum of residue masses, but for integers larger than ∼400, these initial fluctuations smooth out.) Given that the peptide contains a b- or y-ion at 500, it contains one at 615 with probability ∼1/18, because there are 18 distinct amino acid residue masses, counting K (128.09 Da) and Q (128.06 Da) as indistinguishable. In the case that ByOnic uses two far-apart integers, say 500 and 1000, the filtering factor increases to ∼552, or ∼3000, because the occurrence of a b- or y-ion at 500 is nearly independent of such an occurrence at 1000. If ByOnic uses three integers with two of them separated by a residue mass and the third far away, it achieves a filtering factor of ∼552 times 18, or ∼54 000. These back-of-theenvelope calculations turn out to be reasonably good estimates, with (500, 615) giving an empirical filtering factor of 815, (500, 1000) giving 3870, and (500, 615, 1000) giving 49 260 for human peptides of mass between 1525 and 1555. For comparison, a threeletter sequence tag, without flanking masses, matches an arbitrary position in the protein database with probability ∼1 in 183, or ∼1 in 6000. To compute such a tag, the de novo analysis must find peaks in the spectrum representing four successive cleavages (or three cleavages and a terminus), a more difficult task than simply finding two or three not-necessarily sequential b- or y-ion peaks. Lookup peaks, on the other hand, can incur a slight cost in error tolerance. A three-letter tag in a 15-residue peptide has an 80% chance of still matching the database after a single modification, whereas a lookup peak has only a 50% chance, because a b-ion peak shifts with a modification to its left and a y-ion peak shifts with one to its right. However, if the parent ion mass M is known to the closest integer, then we can use both a peak and its complement, that is, both i and M - i. One peak from a complementary pair always survives a single modification, and hence, in the case of known parent mass and a single modification, lookup peaks lose no sensitivity and theoretically still outperform sequence tagging. Of course, if there are two modifications, one at each end of the peptide, lookup peaks may fail where sequence

tagging succeeds, but this case should be relatively rare in most samples. METHODS In this section, we briefly describe the algorithms behind ByOnic. A complete description is given in the Supporting Information. There are three basic steps: (1) computing lookup peaks, (2) searching the database to find candidates, and (3) scoring the candidates. Lookup Peaks. We compute lookup peaks in pairs, even though we generally use the peaks individually in the database search. Each pair (i, j) corresponds to a pair of peaks in the spectrum (a ladder of just two peaks), at masses m and m′, such that m′-m is the mass of an amino acid residue (within an instrument-dependent tolerance) and i and j are the “integer parts” of m and m′. We define the integer part of a mass m to be the closest integer to 0.9995m in order to correct for the “mass defects” of amino acid residues. Amino acid residues average 111 Da with an average fractional part (mass defect) of 0.056, and hence a peptide of mass m will have an average mass defect of (0.056/111)m ) 0.0005m. Using our definition, a peptide with mass 1814.1 has integer part 1813, which (with overwhelming probability) matches the sum of the integer parts of its amino acid residue masses. We compute 14 (i, j) pairs, typically containing 15-20 distinct peaks; these are the lookup peaks. If the user indicates that the integer part of the parent ion mass M is known, we also include the 14 complementary pairs (M-j, M-i) for a total of 28 pairs. The number of (i, j) pairs is not critical; a larger number of pairs gives somewhat finer control of the filtering factor, because the step from 3 to 4 matches out of 30 lookup peaks is smaller than the step from 2 to 3 matches out of 20. To compute pairs, we first process the spectrum by manipulating the intensities of peaks. We start by normalizing the raw intensities (ion counts) by a measure of local intensity, in order to measure the strengths of peaks relative to those around them. We identify likely isotope seriessevenly spaced peaks differing by 0.5 or 1 Da with intensitiesinreasonableagreementwithnaturalisotopeabundancess and boost the intensities of the monoisotopic (+0) peaks and downgrade the intensities of the higher isotope peaks. Similarly, we boost the intensities of peaks accompanied by likely water loss peaks and downgrade the intensities of the likely water loss peaks. We then propagate intensity around the spectrum: a strong peak boosts the intensity of each weaker peak (bringing its intensity up to the average of the two) that differs from it by a residue mass, within the instrument tolerance. The propagation is a simple way to enhance spectral ladders, without fixing the ladder length or restricting attention to any one sequence of peaks. The propagation iterates for two cycles. We then compute all pairs (i, j), for which the corresponding peaks differ by a residue mass and are both among the top 200 peaks by (modified) intensity. We sort the list of pairssthere are on the order of 100 pairs at this pointsby min { I(i), I(j) }, where I( ) denotes modified peak intensity. Finally, we pick pairs from the top (higher intensity) of the list on down, but imposing a limit: we use only the top two pairs for each integer in each position. If we have already chosen (500, 613) and (500, 615), we cannot also choose (500, 599), but we can choose (413, 500). If the total intensity of pairs falls below a threshold, the spectrum

is rejected as a noise spectrum; thus ByOnic contains a built-in quality filter.19 Finding Candidates. ByOnic can use either peak pairs or individual peaks to look up candidates. We measured the accuracy of lookup peaks on a set of well-identified ion trap spectra from the Open Proteomics Database,20 accession opd00006_ECOLI. On every one of 305 spectra, at least one of the lookup peaks was correct, that is, matched either a b- or y-ion (see supporting Information Table 1). Two of the 305 spectra had only a single correct lookup peak, and 3 had only two correct lookup peaks, meaning that 300 out of 305, or 98.4%, had 3 or more lookup peaks correct. A total of 298 out of 305 spectra, or 97.7%, had at least one (i, j) pair correct, meaning that the pair matched successive b- or y-ions. By contrast, PepNovo,21 probably the best generator of three-letter tags, generated at least one correct tag on 275 out of the 305 spectra (90.2%), when generating 20 three-letter tags for each spectrum. (In assessing PepNovo, we counted I and L as interchangeable, and similarly K and Q.) This result is not surprising, as some of the training set spectrasfor example, those from proline-rich peptidessdo not have significant peaks for four successive cleavages. In a linear scan22 of the protein database, it is straightforward to compute the integer parts of the b- and y-ions of all peptides with mass falling within the parent ion mass window and then compute the number of matches to the lookup peaks. With careful implementation, the running time is almost the same as the time for a scan that checks only the parent ion mass, as the bottleneck on most computers will be moving the database and candidates through the cache memory, rather than the arithmetic operations. On a Sun-Fire v440, each linear scan of an 80-Mbyte database takes ∼1 s. Faster performance can be achieved by batching a number of lookups per scan or, for tryptic digests, by building a database containing only the peptides of interest. The latter approach reduces the lookup time for fully tryptic peptides to 0.08 s and for semitryptic peptides to 0.62 s/spectrum, with a parent mass uncertainty of (3 Da. On a tryptic digest, a reasonable level of filtering can be achieved by requiring a single lookup-peak match for database peptides with two tryptic termini, two matches for peptides with one tryptic terminus, and three matches for arbitrary peptides. Using the lookup peaks from 14 lookup pairs, this set of requirements filters the database peptides falling within the parent mass window by factors of about 3, 16, and 150 for tryptic, semitryptic, and nontryptic peptides (giving an aggregate factor of ∼80) and reduces the scoring time for a search without modifications to a fraction of the lookup time. The empirically measured loss rate (would-be top-scoring peptides that did not make it past the filter) is less than 1% for unmodified peptides. On a nonspecific digest, a reasonable level of filtering can be achieved by requiring three individual peak matches or one full pair for any peptide, which results in a filtering factor of ∼40, and an empirically measured loss rate of less than 2%. Requiring two (19) Bern, M.; Goldberg, D.; McDonald, W. H.; Yates, J. R., 3rd Bioinformatics 2004, 20 (Suppl 1), I49-I54. (20) Prince, J. T.; Carlson, M. W.; Wang, R.; Lu, P.; Marcotte, E. M. Nat. Biotechnol. 2004, 22, 471-472. (21) Frank, A.; Tanner, S.; Bafna, V.; Pevzner, P. J. Proteome Res. 2005, 4, 12871295. (22) Edwards, N.; Lippert, R. Second International Workshop on Algorithms in Bioinformatics; Springer-Verlag: New York, 2002; pp 68-81.

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

1395

pairs or four individual peaks out of 20 lookup pairs gives a factor of ∼200 and a loss rate of ∼3%. (See Figure 1 in the Supporting Information for a graph of loss rate versus filtering factor.) Scoring. We couple our novel lookup method with a scorer that is both sensitive and flexible. The scorer is an empirically developed “dot-product” scorer, similar to ones previously used for de novo sequencing.23,24 Like the module that computes the lookup peaks, the scorer normalizes the raw intensities by a measure of local intensity, and downgrades or upgrades the intensities depending upon the appearance of isotope peaks. It further modifies the peak intensities to reflect peak ranks (tallest, second tallest, and so forth), as shown in Figure 2 in the Supporting Information. The result of this preprocessing is a modified spectrum typically containing 20-300 peaks with realvalued intensities running from about 0 to 30.0. Then for each candidate peptide, the scorer computes a spectrum of theoretical peaks and expected intensities. The theoretical spectrum includes not only b- and y-ions but also a-ions, doubly charged y-ions, and water and ammonia losses. The score is a sum of “benefits” for theoretical peaks found in the (modified) observed spectrum and “penalties” for theoretical peaks not found. Each benefit is a product of three terms: (1) the peak intensity in the observed spectrum, (2) the expected intensity in the theoretical spectrum, and (3) a weight reflecting the closeness of the mass match. For (2), we use a simplified version of published statistics25,26 that models peak intensity as a function of cleavage position and flanking residues. For (3), we use a smooth function of the mass error that approximates the distribution of mass errors in well-identified spectra, rather than a fixed cutoff (square window) as used by SEQUEST or SILVER.27,28 The smooth function drops to zerosno matchsat a usersettable tolerance, with a default value of 0.4 Da for ion trap instruments. For time-of-flight (TOF) instruments, ByOnic recalibrates the m/z measurements of the modified spectrum based upon tentative matches of observed and theoretical peaks; for this step, ByOnic uses robust linear regression as described previously.24 After recalibration, ByOnic uses a default tolerance of 0.08 Da for QTOF. The penalty for a theoretical peak not found is simply a constant C times its expected intensity; experiments showed that C ) 2 worked well and that any C with 1 < C < 3 was almost equally good. The potential benefit of an observed peak is up to ∼10 times the penalty for missing the peak, but only intense peaks with perfect accuracy contribute such large benefits. Each improvement, starting from a baseline SEQUEST-like scorer, was checked empirically on the 305 OPD spectra and on 1124 spectra from data set 1 described below. For example, if we use a fixed cutoff for peak matching rather than the smooth function, we lose ∼9% of the valid identifications. If we score only b- and y-ions, and ignore a-ions, neutral losses, and doubly charged y-ions, we (23) Ma, B.; Zhang, K.; Hendrie, C.; Liang, C.; Li, M.; Doherty-Kirby, A.; Lajoie, G. Rapid Commun. Mass Spectrom. 2003, 17, 2337-2342. (24) Bern, M.; Goldberg, D. J. Comput. Biol. 2006, 13, 364-378. (25) Havilio, M.; Haddad, Y.; Smilansky, Z. Anal. Chem. 2003, 75, 435-444. (26) Tabb, D. L.; Smith, L. L.; Breci, L. A.; Wysocki, V. H.; Lin, D.; Yates, J. R., 3rd. Anal. Chem. 2003, 75, 1155-1163. (27) Elias, J. E.; Gibbons, F. D.; King, O. D.; Roth, F. P.; Gygi, S. P. Nat. Biotechnol. 2004, 22, 214-219. (28) Gibbons, F. D.; Elias, J. E.; Gygi, S. P.; Roth, F. P. J. Am. Soc. Mass Spectrom. 2004, 15, 910-912.

1396

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

lose another 7% of the valid identifications. ByOnic provides a menu of some 50 variable modifications (Table 2 in the Supporting Information) along with the two “fixed” cysteine modifications, carbamidomethylation (+57) and carboxymethylation (+58). ByOnic assumes there will be at most one variable modification per peptide, with the exceptions of the following common modifications: hydroxylated proline (hydroxyproline), for which it allows at most four; oxidized methionine and deamidated N and Q, for which it allows at most two; and the phosphorylation modifications, S[+80], T[+80], Y[+80], S[-18], and T[-18], for which it allows any combination of at most three. ByOnic also supports mutation search (any one-letter change) and “blind” modification search (any integer change to any one residue), which is also supported by several recently developed, more specialized tools.18,29 Scoring with modifications enabled is reasonably fast (with the same level of lookup filtering, ∼2-10 times slower than no-modification scoring, and typically faster than the candidate lookup), but scoring mutations and blind modifications is slow (up to 500 times slower than no-modification scoring), and hence, these modes are best used with either stringent filtering or a small database. Second Identifications. ByOnic optionally removes all the peaks explained by the top-scoring candidate and writes out a “knockout spectrum” that can be sent back through ByOnic (both candidate lookup and scoring) in order to make a second (or third or fourth) identification, a coeluting peptide with close m/z, whose score was overshadowed by the first peptide. This secondidentification capability means that ByOnic, like ProbIDTree,30 can handle multiplexed spectra, such as those produced31 by “dataindependent CID”. For the peak removal step, we do use a square window for peak matching, because we remove peaks entirely rather than just downgrading them. RESULTS AND DISCUSSION We compared the performance of ByOnic to SEQUEST, Mascot (version 2.1.03, November 2005), and X!Tandem (version 2006.04.01) on three protein data sets: (1) a well-known data set17 from the ISB, containing 18 940 (37 044 spectrum files, counting +2 and +3 parent charge assignments separately) LC-ESI ion trap spectra (Thermo Finnigan LCQ instrument) of trypsin-digested reference proteins, (2) a data set from PPD, Inc. (Menlo Park, CA), containing 1200 capillary-LC-ESI-QTOF spectra of trypsindigested human blood plasma, and (3) a data set from PPD, Inc., containing 85 948 LC-ESI ion trap spectra (Thermo Electron LTQ) of mouse blood plasma, spiked with known concentrations of soluble human proteins. Data sets 2 and 3 were depleted of very abundant proteins (serum albumen, immunoglobulins R and γ, R-1 antitrypsin, transferrin, haptoglobin, and fibrinogen) using a multiple affinity removal system (Agilent). For these data sets, the cysteine was carboxymethylated (+58). Data set 1 has no cysteine modification. For data set 3, the human proteins were produced recombinantly and hence were pure, contaminated only with low concentrations of Escherichia coli proteins. (29) Tsur, D.; Tanner, S.; Zandi, E.; Bafna, V.; Pevzner, P. A. Nat. Biotechnol. 2005, 23, 1562-1567. (30) Zhang, N.; Li, X. J.; Ye, M.; Pan, S.; Schwikowski, B.; Aebersold, R. Proteomics 2005, 5, 4096-4106. (31) Venable, J. D.; Dong, M. Q.; Wohlschlegel, J.; Dillin, A.; Yates, J. R. Nat. Methods 2004, 1, 39-45.

Figure 1. Listing of MS/MS peptide identification tools on a range from pure database search to full de novo sequencing. Pure database search can make identifications from lower-quality spectra, but tends to be slow and brittle. Our new identification tool, ByOnic, uses only a small amount of de novo analysis and, thus, retains the sensitivity of pure database search, while adding speed and error tolerance.

Figure 2. Comparison of three peptide identification programs on the Institute for Systems Biology data set. To plot ROC curves for SEQUEST, we fixed the Delta threshold (at 0 and 0.08) and varied the Xcorr threshold (with +2 and +3 spectra requiring 0.5 and 1.0 greater Xcorr than +1). For example, Xcorr thresholds of 1.5, 2.0, and 2.5 along with a Delta threshold of 0.08 gave sensitivity of 0.11 and precision of 0.75. For X!Tandem and ByOnic, we simply varied the score thresholds. “ByOnic Mods” shows ByOnic’s results when run with ∼40 modifications enabled; its maximum sensitivity is 0.204 (3863 valid identifications out of 18 940 spectra).

Data set 1 was chosen because it is a de facto standard, with published results from SEQUEST,17 InsPecT,15 and blind modification searching.29 Data set 2 provides a realistic test on data from a very different type of instrument, and data set 3 provides a realistic test on a spiked sample, thereby enabling assessment of false negative as well as false positive rate. We remark that because blood plasma contains a large number of well-confirmed low- to medium-abundance proteins32,33 it makes an excellent sample for technology development. Reference Mixture. The ISB data set contains fairly pure reference proteins with known concentrations in two different mixtures. Figure 2 gives results on this data set, comparing (32) Adkins, J. N.; Monroe, M. E.; Auberry, K. J.; Shen, Y.; Jacobs, J. M.; Camp, D. G., 2nd; Vitzthum, F.; Rodland, K. D.; Zangar, R. C.; Smith, R. D.; Pounds, J. G. Proteomics 2005, 5, 3454-3466. (33) States, D. J.; Omenn, G. S.; Blackwell, T. W.; Fermin, D.; Eng, J.; Speicher, D. W.; Hanash, S. M. Nat. Biotechnol. 2006, 24, 333-338.

ByOnic to SEQUEST and X!Tandem. We used the SEQUEST results in the ISB download and ran ByOnic and X!Tandem with the ISB-supplied 88-Mbyte protein database, which includes human sequences along with the proteins in the mixture. All three tools searched semitryptic and nontryptic peptides as well as tryptic peptides. ByOnic was configured to require one, two, and three lookup peaks for tryptic, semitryptic, and nontryptic peptides, giving a filtering factor of ∼80 as discussed above. We used ISB’s definition of a valid identification: a spectrum whose best “match”stop-scoring identification, regardless of scoresis to a peptide from 1 of the 18 proteins deliberately placed in the sample, or to keratin. On this data set, ByOnic proved more sensitive than the other tools, but was slightly less precise than X!Tandem (more false positives) when identifications were filtered very stringently. (However, some of ByOnic’s false positives, for example, actin, are plausible contaminants.) SEQUEST was both less sensitive Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

1397

Figure 3. Comparison of three peptide identification programs on a data set of 1200 QTOF spectra of blood plasma. All three tools were run with 10 common modifications enabled (hydroxyproline, oxidized M, H, and W, deamidated N and Q, pyro-glu from N-terminal Q and E, and acetylated K and N-terminus). A Mascot score threshold of 30 gives sensitivity 0.55 and precision 0.92, and a more stringent threshold of 40 gives sensitivity 0.42 and precision 0.996. ByOnic’s second-identification feature doubled the total number of identifications, thereby improving sensitivity but decreasing precision.

and less precise than the other two tools. ISB’s “mixture A” contains rabbit myosin at 2 nM concentration. SEQUEST completely missed this protein, but X!Tandem matched one spectrum to it and ByOnic matched two. X!Tandem’s ROC curve shows a slight S-shape, meaning that many valid identifications had poor scores (“E-values”) and that among the poor-scoring identifications, those with E-value greater than 1.0, score is not a good predictor of validity. ByOnic gives 3420 valid identifications (top-scoring identifications to the reference proteins plus keratin) when run without modifications. ByOnic loses 124 of SEQUEST’s 2756 identifications, given in the ISB download, primarily due to differences in scoring rather than lookup failure. Sequence tagging, as embodied in InsPecT, loses a much larger number, 488, primarily due to inaccurate sequence tags.15 ByOnic gives 3863 valid identifications when run with modifications enabled, including 821 valid identifications with modifications. Notice that the increase in the number of valid identifications from 3420 to 3863 is substantially less than the number of valid identifications with modifications. There are two reasons for this discrepancy. Modified peptides, especially ones with small modifications such as deamidation (which adds only 0.98 Da), often match the correct protein even in nomodification mode. Second, ∼4% of the valid unmodified identifications are lost when ByOnic is run with many modifications enabled, due to the increase in the number of possible candidates. A comparative study7 found similar sensitivity losses in SEQUEST and Mascot (allowing only two modifications) and also found SEQUEST to be more sensitive than Mascot. To avoid the 4% loss, ByOnic can be run in both no-modification and modification modes, with the preferred identification chosen by protein rank, rather than by score alone. This approach finds 4003 valid identifications on the ISB data set. 1398

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

As reported previously,15 the ISB data set contains a substantial number of modified peptides, even some unusual modifications such as oxidized tyrosine, Y[+16], and dehydrated aspartic acid, D[-18]. Like InsPecT,15 ByOnic found the following prevalent modifications: deamidated N (142 peptides), deamidated Q (135), sodiated D (70), oxidized M (66), oxidized Y (65), dehydrated D (65), disulfide bridge (60) (which appears as C[-2] in samples without cysteine modification), sodiated E (56), and pyro-glu from Q (17). ByOnic also found hydroxyprolines in bovine β-casein, with several spectra matching the semitryptic peptide P[+16]LP[+16]PTVMFPPQSVLSLSQSK. Finally, ByOnic matched a number of spectra to two valid peptides, for example, 12 spectra containing both AQQPDGLAVVGVFLK from bovine carbonic anhydrase and VYVEELKPTPEGDLEILLQK from bovine β-lactoglobulin. These identifications were found accidentally, meaning that the same spectrum matched different valid proteins with +2 and +3 parent charge assignments and were then confirmed by ByOnic’s second-identification search. For fairness, we counted these spectra only once each in Figure 2. QTOF Spectra of Human Blood Plasma. Data set 2 contains QTOF spectra of human blood plasma, a sample material of high importance for biomarker discovery. Figure 3 gives Mascot, ByOnic, and X!Tandem results for data set 2. For this data set, we ran all three tools with common modifications (oxidation, deamidation, and pyro-glu) enabled, and we defined a valid identification to be a match to one of the 59 abundant plasma proteins34 that were found by at least two tools (Table 3 in the Supporting Information). Mascot searched tryptic and semitryptic peptides (first tryptic and then semitryptic on as-yet-unidentified (34) Anderson, N. L.; Polanski, M.; Pieper, R.; Gatlin, T.; Tirumalai, R. S.; Conrads, T. P.; Veenstra, T. D.; Adkins, J. N.; Pounds, J. G.; Fagan, R.; Lobley, A. Mol. Cell. Proteomics 2004, 3, 311-326.

Table 1. Number of Spiked Proteins Detected for Each of Mascot, X!Tandem, and ByOnica spike concn 100 ng/mL 1 µg/mL 10 µg/mL

Mascot 0, 0, 0 7, 6, 5 11, 11, 10

X!Tandem 2, 1, 1 10, 7, 8 11, 11, 11

0, 0, 0 8, 7, 8 11, 11, 11

ByOnic 1, 1, 3 11, 10, 9 12, 12, 11

1, 1, 0 10, 8, 8 13, 13, 13

3, 3, 2 12, 10, 10 13, 13, 13

a All three tools were run on fully tryptic peptides from similar databases with the same 10 common modifications enabled. The left column (boldface type) gives the number of spiked proteins matched to at least two spectra for each of three replicates, and the right column gives the number matched to at least one spectrum. Thus, ByOnic found 10 of the spiked proteins with at least two spectra and 12 with at least one spectrum on the first 1-µg replicate.

spectra) from a 40-Mbyte human protein database, but ByOnic and X!Tandem, which are faster than Mascot, used the 88-Mbyte ISB database and searched for nontryptic peptides as well. Nontryptic peptides, however, contributed less than 3% of ByOnic’s valid identifications, barely balancing the loss of sensitivity due to the increase in candidates. Because QTOF parent ion masses are reliable, ByOnic used both (i, j) and complementary (M-j, M-i) pairs for lookup, requiring one, two, and three lookup peak matches as above. The use of complementary pairs made no difference for unmodified matches but increased the number of valid modification matches by ∼5%. On this data set, ByOnic was again more sensitive than the other two tools. ByOnic’s use of m/z recalibration explains some of its advantage, as ByOnic without recalibration loses ∼5% of its valid identifications, halving the gap between its ROC curve and XTandem’s. ByOnic also appears to be more sensitive than the other tools at the protein level. ByOnic found all 59 valid proteins, and moreover, it found complement C9, N-acetylmuramoyl-Lalanine amidase, apolipoprotein M, and fibulin, four known plasma proteins,34 each with multiple spectra. Of these, X!Tandem missed the amidase, Apo M, and fibulin, and Mascot missed C9 and fibulin. (Hence fibulin was not counted among the 59 valid proteins.) X!Tandem again shows the highest precision, meaning that it gives the fewest false positives, when filtering so stringently that only 50-70% of the true positives are accepted. X!Tandem uses a first pass to identify the proteins in the sample from unmodified, tryptic peptides, and a second pass using the protein database generated in the first pass to identify semitryptic and modified peptides. This multipass technique provides a large speed-up, especially for data sets, such as this QTOF set, for which the number of spectra is much smaller than the number of proteins in the database. The multipass technique can also improve both sensitivity and precision at the peptide level by focusing the database on the proteins most likely to be in the sample. We note that lookup-peak filtering is compatible with multipass search and indeed with other methods for focusing the database, such as offline curation. The QTOF data set is rich in peptides, so we also processed the spectra using ByOnic’s second-identification feature. This more thorough analysis indeed finds a greater number of valid identifications, but most knockout spectra do not contain peptides, so precision decreases slightly. We would argue, however, that the most important points on the ROC curves are the right end pointssthe maximum sensitivities achievablesbecause false positives can be screened much more effectively by a peptides-to-

proteins program such as ProteinProphet35 than by score thresholding. On the QTOF data set, second identifications matched lower-abundance proteins disproportionately often and helped lift three known plasma proteins (ficolin, transthyretin, and complement factor I) from “one-hit wonders” with single spectral matches to more solidly identified two-hit proteins. We also ran the QTOF spectra with all modifications enabled, and in blind modification mode. (These results are not shown in Figure 3.) The QTOF data set, like most plasma samples, does not appear to be heavily modified. The most common modifications in the QTOF data set were deamidation (35 spectra), pyroglu from Q (16), and oxidized methionine (9). The blind modification search added 27 more matches to the top plasma proteinss less than a 3% increase. The following integer shifts each appeared at least three times: S[+28] (possibly formylation), +3 and +4 near the N-terminus (probably misassigned parent masses), and Q[+22] (sodiation on an atypical residue). Spiked Mouse Blood Plasma. Although data sets 1 and 2 offer some evidence of ByOnic’s greater sensitivity at the protein level, data set 3, the spiked mouse plasma, provides the best test bed for assessing protein sensitivity. This data set includes nine separate runs through the LC-ESI-LTQ setup, three replicates each of three different samples. Each run contains ∼10 000 spectra, each of which was searched with +1, +2, and +3 parent charge assignments against the tryptic peptides (allowing any number of missed cleavages) from essentially equivalent 42-Mbyte databases. (The only difference in the databases was that Mascot’s database included other human sequences as decoys, whereas ByOnic and X!Tandem used reversed mouse sequences.) We considered only fully tryptic peptides (due to Mascot’s limitations), and we configured ByOnic to require two lookup peak matches, for a filtering factor of ∼16. Common modifications (oxidation, deamidation, and pyro-glu) were again enabled. The three samples all contain the same 13 soluble human proteins (Table 4 in the Supporting Information) spiked into mouse plasma, but at varying concentrations: 100 ng, 1 µg, and 10 µg, per milliliter of plasma. One of the spiked proteins, human apolipoprotein A-I, is an abundant plasma protein, and hence, its detection presents a “homologue problem”. We counted only unambiguous spectra, that is, spectra that matched peptides of human Apo A-I that do not appear in mouse Apo A-I. As above, we defined a spectrum’s “match” to be its top-scoring identification, regardless of score. This definition introduces a small amount of noise into the results. Each run includes ∼10 000 spectra, and the 13 spiked proteins constitute ∼0.011% of the protein database; thus we expect 1.1 (35) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. Anal. Chem. 2003, 75, 4646-4658.

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

1399

random matches to spiked proteins in each sample. (The 1.1 is actually an overestimate, because not all spectra find matches.) Two matches to the same spiked protein, however, remains a fairly rare chance occurrence. As shown in Table 1, ByOnic outperformed X!Tandem and X!Tandem outperformed Mascot in this experiment. Mascot not only found fewer spiked proteins but matched fewer spectra to the ones it did find. For example, on the three 1-µg replicates, Mascot matched 24, 23, and 18 spectra to the spiked proteins; X!Tandem matched 33, 30, and 28; and ByOnic matched 35, 36, and 27. For all three tools, some spiked proteins were detected only in weak, low-scoring spectra. For example, if we filter Mascot’s peptide identifications by discarding all identifications with scores below 30.0, then on the 1-µg replicates, Mascot finds 4, 5, and 5 spiked proteins with two or more spectra, instead of 7, 6, and 5, and it finds 9, 6, and 8 with one or more spectra, instead of 10, 7, and 8. Finally, we performed one test of ByOnic versus itself, in order to measure the speed and sensitivity tradeoff in a realistic setting. We searched one run of spiked mouse plasma (9540 spectra) with +1, +2, and +3 parent charge assignments against tryptic and semitryptic peptides, with common modifications enabled. When ByOnic required no lookup peak matchessno filtering at allsit took 5421 min to run the data and made 3149 valid identifications (matches to the top 100 proteins), with 569 of these matches to modified peptides. When ByOnic required two lookup peak matches for tryptic peptides and three for semitryptic peptides, it took 580 min and made 3116 valid identifications, with 531 of these matches to modified peptides. Lookup peaks gave a speedup of 9.35 with a sensitivity loss of 1.05%, with virtually all lookup peak failures occurring on modified peptides. (We did not use complementary pairs.) The filtering factor was about 16 on tryptic peptides and 40 on semitryptic peptides, but the time for the database scans now dominates, so the speedup was not as great as the filtering factor.

levels. Other groups have also demonstrated improvements at the peptide level using fragmentation statistics,25,26 machine learning,27 multitool consensus,36 and “peptide-centric” databases.37 Sophisticated scorers using techniques such as Hidden Markov Models, however, are typically slower than simple scorers such as our dotproduct algorithm, and hence, the benefits of scoring advances may be limited to small databases37 or to rescoring simple-scorer hits,27 unless coupled with a method for finding the most promising candidate peptides in the database. Two such methods have been described previously: sequence tagging,16 which filters the database at the peptide level, and the multipass method used by X!Tandem,9 which filters the databases just once for an entire run of spectrasat the protein level. In this paper, we have proposed a third methodslookup peakssfor reducing the number of candidate peptides. This method requires fewer peaks and less sequential fragmentation than sequence tagging and, hence, can identify lower quality spectra. We have demonstrated that the lookup-peak method loses sufficiently few correct candidates that our overall systemscandidate finding and scoringsis more sensitive than the most popular database-search programs, capable of finding low-abundance proteins in blood plasma that would be missed by the other tools. Because sensitivity, especially at the protein level, is often the single most important performance metric for an identification technology, we believe that the lookup-peak method is the first hybrid of database search and de novo sequencing that can replace pure database search.

CONCLUSION We have demonstrated identification sensitivity superior to SEQUEST, Mascot, and X!Tandem at both the peptide and protein

SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.

(36) Resing, K. A.; Meyer-Arendt, K.; Mendoza, A. M.; Aveline-Wolf, L. D.; Jonscher, K. R.; Pierce, K. G.; Old, W. M.; Cheung, H. T.; Russell, S.; Wattawa, J. L.; Goehle, G. R.; Knight, R. D.; Ahn, N. G. Anal. Chem. 2004, 76, 3556-3568. (37) Yen, C. Y.; Russell, S.; Mendoza, A. M.; Meyer-Arendt, K.; Sun, S.; Cios, K. J.; Ahn, N. G.; Resing, K. A. Anal. Chem. 2006, 78, 1071-1084.

Received for review December 14, 2006.

1400

Analytical Chemistry, Vol. 79, No. 4, February 15, 2007

ACKNOWLEDGMENT We thank Hua Lin, Thomas Shaler, Ted Jones, and Christopher Becker of PPD, Inc. in Menlo Park for supplying us with valuable data and advice; John R. Yates, III of The Scripps Research Institute for motivating this work; and Ari Frank of UCsSan Diego for help with PepNovo.

AC0617013

September

8,

2006.

Accepted