Anal. Chem. 2005, 77, 7581-7593
MASPIC: Intensity-Based Tandem Mass Spectrometry Scoring Scheme That Improves Peptide Identification at High Confidence Chandrasegaran Narasimhan,*,†,‡ David L. Tabb,‡,# Nathan C. VerBerkmoes,§ Melissa R. Thompson,†,§ Robert L. Hettich,§ and Edward C. Uberbacher‡
Graduate School of Genome Science and Technology, University of TennesseesOak Ridge National Laboratory, 1060 Commerce Park, Oak Ridge, Tennessee 37830-8026, and Organic and Biological Mass Spectrometry, Chemical Sciences Division, and Genome Analysis and Systems Modeling, Life Sciences Division, Oak Ridge National Laboratory, P.O. Box 2008, Oak Ridge, Tennessee 37831-6131
Algorithmic search engines bridge the gap between large tandem mass spectrometry data sets and the identification of proteins associated with biological samples. Improvements in these tools can greatly enhance biological discovery. We present a new scoring scheme for comparing tandem mass spectra with a protein sequence database. The MASPIC (Multinomial Algorithm for Spectral Profile-based Intensity Comparison) scorer converts an experimental tandem mass spectrum into a m/z profile of probability and then scores peak lists from potential candidate peptides using a multinomial distribution model. The MASPIC scoring scheme incorporates intensity, spectral peak density variations, and m/z error distribution associated with peak matches into a multinomial distribution. The scoring scheme was validated on two standard protein mixtures and an additional set of spectra collected on a complex ribosomal protein mixture from Rhodopseudomonas palustris. The results indicate a 5-15% improvement over Sequest for high-confidence identifications. The performance gap grows as sequence database size increases. Additional tests on spectra from proteinase-K digest data showed similar performance improvements demonstrating the advantages in using MASPIC for studying proteins digested with less specific proteases. All these investigations show MASPIC to be a versatile and reliable system for peptide tandem mass spectral identification. In the postgenome era, analyzing and understanding the proteome has become an increasingly important research priority. While RNA expression measurements have great utility,1 direct measurements of the proteome can capture the actual state of proteins. Protein function itself can be affected at multiple levels * Corresponding author: (phone) 865 574-2771; (fax) 865 576-5332; (e-mail)
[email protected]. † Graduate School of Genome Science and Technology, University of TennesseesOak Ridge National Laboratory. ‡ Life Sciences Division, Oak Ridge National Laboratory. # Current location: Department of Biomedical Informatics, Vanderbilt University Medical Center. § Chemical Sciences Division, Oak Ridge National Laboratory. (1) DeRisi, J. L.; Iyer, V. R.; Brown, P. O. Science 1997, 278, 680-686. 10.1021/ac0501745 CCC: $30.25 Published on Web 10/22/2005
© 2005 American Chemical Society
through a variety of mechanisms.2 Knowing under what conditions proteins are expressed, and what their actual state is, provides valuable insight into the dynamics of cell processes.3 Mass spectrometry (MS) has emerged as the standard tool to study proteins in complex mixtures.4,5 In the “bottom up” or “shotgun” proteomics approach, protein mixtures are digested into peptides using a proteolytic enzyme. The masses of these peptides alone can be used by peptide mass mapping to identify proteins present in a simple mixture.6-8 However, tandem mass spectrometry can be used to break peptides into smaller fragments. Because of substantial fragmentation at peptide bonds, peptide tandem mass spectra usually contain a series of b and y ions (along with other types)9 that can be used to infer the amino acid sequence of that peptide. By coupling tandem mass spectrometry to highperformance liquid chromatography (HPLC) separations, it is possible to employ such a strategy to analyze extraordinarily complex mixtures of proteins. Computational approaches have been developed to identify these MS/MS spectra. Because of the vast number of spectra that can be collected using modern mass spectrometers, algorithmic analysis is a crucial bridge between raw mass spectrometry data and peptide/protein identification.10 Identification methods fall into three main categories: de novo sequencing,11-15 database (2) Steen, H.; Mann, M. Nat. Rev. 2004, 5, 699-711. (3) Boguski, M. S.; Mcintosh, M. W. Nature 2003, 422, 233-237. (4) Gygi, S. P.; Rist, B.; Gerber, S. A.; Turecek, F.; Gelb, M. H.; Aebersold, R. Nat. Biotechnol. 1999, 17, 994-999. (5) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50. (6) MOWSE (http://www.hgmp.mrc.ac.uk/Bioinformatics/Webapp/mowse/ mowsedoc.html - 4.6). (7) Eriksson, J.; Chait, B. T.; Fenyo, D. Anal. Chem. 2000, 72, 999-1005. (8) Zhang, W.; Chait, B. T. Anal. Chem. 2000, 72, 2482-2489. (9) Wysocki, V. H.; Tsaprailis, G.; Smith, L. L.; Breci, L. A. J. Mass Spectrom. 2000, 35, 1399-406. (10) MacCoss, M. J.; Wu, C. C.; Liu, H.; Sadygov, R.; Yates, J. R., III. Anal. Chem. 2003, 75, 6912-6921. (11) Fernandez-de-Cossio, J.; Gonzalez, J.; Betancourt, L.; Besada, V.; Padron, G.; Shimonishi, Y.; Takao, T. Rapid Commun. Mass Spectrom. 1998, 12, 1867-1878. (12) Dancik, V.; Addona, T. A.; Caiser, L. R.; Vath, J. E.; Pevzner, P. A. J. Comput. Biol. 1999, 6, 327-342. (13) Chen, T.; Kao, M. Y.; Tepel, M.; Rush, J.; Church, G. M. J. Comput. Biol. 2001, 8, 325-337. (14) Taylor, J. A.; Johnson, R. S. Anal. Chem. 2001, 73, 2594-2604.
Analytical Chemistry, Vol. 77, No. 23, December 1, 2005 7581
identifications,16-25 and peptide sequence tagging.26-29 Database identification is a procedure that generates “theoretical spectra” of peptides extracted from a database of protein sequences and compares them to experimental tandem mass spectra. The candidate peptide that generates the highest scoring theoretical spectrum match is reported as the best match to a particular spectrum. Sequest16 and Mascot17 are tools for this type of comparison. Among the three approaches, database identification techniques are currently the most sensitive and reliable and thus the most widely used. In fact, de novo and tagging approaches have often used spectral identifications from Sequest for both training and comparison purposes. These two popular software systems, Sequest and Mascot, use different scoring schemes for identifying peptides. The Mascot scoring method calculates the probability of misassigning a certain number of fragment peaks in a database. Sequest uses a crosscorrelation-based metric for scoring the match between theoretical and experimental spectra. The cross-correlation scoring scheme at least partially combines intensity as well as a fragment m/z match property into a single scoring function. Some methods have attempted to model theoretical spectrum generation30,31 with an aim of improving spectral identifications while others have attempted to improve the number of identifications using a better probability model. Havilio et al.21 used a bipartite model, wherein a log of odds score for occurrence of fragments was explained by certain events (like isotope peaks, water loss etc.) against random chance. The scorer is very close to the one used in the de novo work by Dancik et al.12 Bafna and Edwards19 used an empirical probability distribution of false positive identifications for a spectrum and also specified a procedure by which prior knowledge of fragmentation behavior could be incorporated into the scoring. Recently, efforts have been made to model peptide identifications from a database search as theoretical distributions. The work reported by Sadygov and Yates23 models peptide comparison as a hypergeometric distribution. The question sought to be answered by the model is, "Out of all matches of fragments from a database, what is the probability that K peak matches come from a particular peptide, by random chance?” This model works on (15) Lu, B.; Chen, T. J. Comput. Biol. 2003, 10, 1-12. (16) Eng, J. K.; McCormack, A. L.; Yates, J. R., III. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (17) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (18) Clauser, K. R.; Baker, P. R.; Burlingame, A. L. Anal. Chem. 1999, 71, 28712882. (19) Bafna, V.; Edwards, N. Bioinformatics 2001, 17, S13-S21. (20) Zhang, N.; Aebersold, R.; Schwikowski, B. Proteomics 2002, 2, 1406-1412. (21) Havilio, M.; Haddad, Y.; Smilansky, Z. Anal. Chem. 2003, 75, 435-444. (22) Fenyo, D.; Beavis, R. C. Anal. Chem. 2003, 75, 768-774. (23) Sadygov, R. G.; Yates, J. R. Anal. Chem. 2003, 75, 3792-3798. (24) Fridman, T.; Razumovskaya, J.; VerBerkmoes, N.; Hurst, G.; Protopopescu, V.; Xu, Y. J. Bioinformatics Comput. Biol. 2005, 3, 455-476. (25) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. J. Proteome Res. 2004, 3, 958-964. (26) Mann, M.; Wilm, M. Anal. Chem. 1994, 66, 4390-4399. (27) Sunyaev, S.; Liska, A. J.; Golod, A.; Shevchenko, A.; Shevchenko, A. Anal. Chem. 2003, 75, 1307-1315. (28) Tabb, D. L.; Saraf, A.; Yates, J. R., III. Anal. Chem. 2003, 75, 6415-6421. (29) Searle, B. C.; Dasari, S.; Turner, M.; Reddy, A. P.; Choi, D.;. Wilmarth, P. A.; McCormack,| A. L.; David, L. L.; Nagalla, S. R. Anal. Chem. 2004, 76, 2220-2230. (30) Schutz, F.; Kapp, E. A.; Simpson, R. J.; Speed, T. P. Biochem. Soc. Trans. 2003, 31, 1479-1483. (31) Elias, J. E.; Gibbons, F. D.; King, O. D.; Roth, F. P.; Gygi, S. P. Nat. Biotechnol. 2004, 22, 214-219.
7582
Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
the principle that a true identification of a spectrum is likely to contain more matches than false identifications. Another scorer from Fridman et al.24 also models spectral comparison hypergeometrically but also incorporates properties of the experimental spectrum, such as the number of peaks observed. Neither of these two probability-based methods incorporates intensities into the scoring scheme. Scoring may also incorporate the peak density variations within a spectrum. ProbID, a recent system developed by Zhang et al.,20 uses Bayesian networks to model and score spectra. ProbID takes into account intensity, m/z error, and signal-to-noise factors. But, it achieves the objective through a multistage scoring scheme (Bayesian network), which creates some difficulties. For example, the probability of getting a match for an m/z value is dependent on the number of noisy peaks in the spectrum, yet the scorer treats them as independent events. The intensities of peaks in the experimental spectrum are normalized to one and treated as probabilities (or weights), a system that may lead to an underemphasis of fragment ions at low and high m/z values. In essence, three properties of any match contribute to its probability of being correct: accounting for intense peaks, matching of fragment ions with high m/z fidelity, and explaining peaks that are located in sparse regions of a spectrum. These aspects of a match must be combined in some way to produce a single score, whether by cross-correlation or in a probabilistic framework. In this work, we have designed a probability-based scoring method that explicitly models these three properties. It yields improved scoring discrimination and excellent speed. The scorer, termed MASPIC (Multinomial Algorithm for Spectral Profile-based Intensity Comparison), converts an experimental spectrum into a probability profile along the m/z axis (referred to as a m/z profile) and then evaluates potential matches using a multinomial distribution.32 Implemented as a scoring module in DBDigger,33 this system enables significant improvement in peptide identification relative to Sequest. In this report, we compare MASPIC’s scores with Sequest’s cross-correlation using protein samples of defined content along with more biologically relevant samples. We show that MASPIC outperforms Sequest most substantially when large numbers of candidate sequences are considered for each spectrum. MATERIALS AND METHODS Chemicals and Reagents. All salts, DTT, trifluoroacetic acid, and guanidine were obtained from Sigma Chemical Co. (St. Louis, MO). Sequencing grade trypsin, from Promega (Madison, WI), was used for all protein digestion reactions. The water and acetonitrile used in all sample cleanup and HPLC applications was HPLC grade from Burdick & Jackson (Muskegon, MI), and the 98% formic acid used in these applications was purchased from EM Science (an affiliate of Merck KgaA, Darmstadt, Germany). Protein Standard Mixture Composition. All protein standards used in this study were obtained from Sigma Chemical Co. Two stock protein standard mixtures were prepared for this study. The first protein standard mixture (PSM) contained the following purified proteins: hemoglobin R-chain and β-chain (2 mg total) (32) Feller, W. An Introduction to Probability Theory and Its Applications, 3rd ed.; John Wiley and Sons: New York, 1967; Vol. I. (33) Tabb, D. L.; Narasimhan, C.; Strader, M. B.; Hettich, R. L. Anal. Chem. 2005, 77, 2464-2474.
from Bos taurus, carbonic anhydrase (2 mg) from B. taurus, myoglobin (1 mg) from Equus caballus, albumin (5 mg) from B. taurus, alcohol dehydrogenase chain I and II (3 mg total) from Saccharomyces cerevisiae, and lysozyme (1 mg) from Gallus gallus. The second protein standard mixture (extended PSM) was created by mixing the following proteins at approximately equal molar concentrations: alcohol sehydrogenase from E. caballus liver, alcohol dehydrogenase from S. cerevisiae, R-amylase from Bacillus subtilis, albumin from B. taurus, β-amylase from Ipomoea batatas, carbonic anyhdrase IV from B. taurus, conalbumin from G. gallus, concanavalin A from Canavalia ensiformis, cytochrome c from B. taurus, deoxyribonuclease I from B. taurus, lysozyme from G. gallus, β-lactoglobulin A from B. taurus, β-lactoglobulin B from B. taurus, ribonuclease A from B. taurus, ribonuclease B from B. taurus, thyroglobulin from B. taurus, albumin from Homo sapiens, apomyoglobin from E. caballus, hemoglobin (R and β) from E. caballus, and (apo)-transferrin from B. taurus. The PSM and the extended PSM were dissolved in 6 M guanidine and 10 mM DTT to make stock solutions. An aliquot of both stock solutions was incubated at 60 °C for 1 h. The guanidine and DTT were diluted 6-fold with 50 mM Tris, 10 mM CaCl2 (pH 7.8), and sequencing grade trypsin was added at 1:100 (w/w). The digestion was performed with gentle shaking at 37 °C for 18 h followed by a second addition of trypsin at 1:100 (w/w) and an additional 5-h incubation. A third protein standard mixture termed extended PSM Prot-K was created by taking an aliquot of the extended PSM stock solution and diluting it in 8 M urea in 0.2 M Na2CO3 (pH 11.0). Proteinase-K was added at 1:100 (w/w) and incubated at 37 °C for 3 h; an additional 1:100 proteinase-K was added followed by a 45-min incubation at 37 °C. After the digestion, all three protein standard mixtures were treated with 20 mM DTT for 1 h at 37 °C as a final reduction step. Protein standard mixtures were desalted with Sep-Pak Plus C18 solid-phase extraction cartridges (Waters, Milford, MA). Samples were concentrated and solvent was exchanged into 100% water/0.1% formic acid by centrifugal evaporation to ∼1 mg/mL starting material, filtered, aliquoted, and frozen at -80 °C until LC-MS/MS analysis. LC-MS/MS Analysis of the PSM, Extended PSM, and Extended PSM Prot-K. The PSM sample was analyzed by a single-dimensional capillary-LC-ES-MS/MS experiment. The LCMS/MS experiment was performed using an Ultimate HPLC (LC Packings, a division of Dionex, San Francisco, CA) coupled to an LCQ-DECA ion trap mass spectrometer (Thermo Finnigan, San Jose, CA) equipped with an electrospray source. Injections were made with a Famos (LC Packings) autosampler onto a 20-µL loop. For this analysis ∼20 µg of the PSM was injected. Peptides were injected onto a Vydac (Grace-Vydac, Hesperia, CA) C18 column (300 µm i.d. × 25 cm, 300 Å with 5-µm particles) at a flow rate of 4 µL/min and separated over 90 min with a reversed-phase gradient from 95% H2O/5% ACN/0.5% formic acid to 30% H2O/ 70% ACN/0.5% formic acid. Peptides were eluted directly into a Finnigan electrospray source with 100-µm-i.d. fused silica. For all 1D LC/MS/MS data acquisition, the LCQ was operated in the data-dependent mode, where the top four peaks in every full MS scan were subjected to MS/MS analysis. Dynamic exclusion was enabled with a repeat count of 2 and exclusion duration of 1 min. Extended PSM and extended PSM Prot-K standards were analyzed in triplicate by a nano-LC-nano-ES-MS/MS experiment. The LC-MS/MS experiments were performed on an integrated Famos/Switchos/Ultimate 1D HPLC system (LC Packings)
directly coupled to a quadrupole ion trap mass spectrometer (LCQDECAXPplus, Thermo Finnigan) outfitted with a nanospray source. The diluted (∼10 ng/µL) extended PSM and extended PSM Prot-K mixtures were injected via the Famos autosampler (50 µL per injection) onto a C18 trapping cartridge at 30 µL/min (300 µm i.d. × 5 mm, 100 Å with 5-µm particles) (C18 PepMap, LC Packings). For this analysis, ∼500 ng of the protein standards was injected for each run. The trapping cartridge was washed for 10 min after injection with loading solvent (100% H2O with 0.1% formic acid). The trapping cartridge was then switched in-line with the nanoresolving column, and peptides eluted from the trapping cartridge and resolved on a C18 nanocolumn (75 µm i.d. × 25 cm, 300 Å with 5-µm particles) (Grace-Vydac). Peptides were resolved via gradient elution (∼150 nL/min) from 100% Solvent A (95% H2O/5% ACN with 0.1% formic acid) to 50% Solvent B (30% H2O/70% ACN with 0.1% formic acid) for 100 min, followed by a 20-min wash with 100% Solvent B and a 20-min equilibration back to Solvent A before the next run. The outlet from the resolving column was directly connected to the nanospray source with a short piece of fused silica. For all experiments, the MS was operated with the following parameters: ES voltage 2.0 kV, heated capillary temperature of 225 °C, and m/z range of 400-2000. The MS was operated with five microscans averaged for full scans and MS/MS scans, 5-Da isolation widths for MS/MS isolations, and 35% collision energy for collision-induced dissociation. For all replicates, the MS was operated in the data-dependent MS/MS mode, where the four most abundant peaks in every full MS scan were subjected to MS/MS analysis. Dynamic exclusion was enabled with a repeat count of 1. Algorithm. The MASPIC scoring scheme converts the experimental spectrum into a probability profile along the m/z axis (referred to as the m/z profile) and then compares it to candidate peptides using a multinomial distribution. The value of probabilities in the m/z profile is a measure of a particular type of match or a mismatch occurring by random chance and is generated during the preprocessing step. First, the experimental spectrum is divided into zones based on parent charge. The boundaries of these zones are determined based on fragment charges they contain. The +2 charged spectra are divided into two zones: the m/z region where there are fragments with charges +1 and +2 and the m/z region with +1 fragments alone (i.e., greater than the parent m/z). Similarly, the +3 charged spectra are divided into three zones: the m/z region where there are fragments with charges +1, +2, and +3, the m/z region with fragment charges +1 and +2, and the m/z region with just +1 fragments. The +1 charged spectra contain just one zone. Within each zone, peaks are binned into classes based on intensity. The number of peaks (the size) in the most intense class is equal to expected number of b and y ion fragments (computed from weighted average residue mass of 119.40) expected in that zone. After that, the size of the class increases by a factor of 2 (bin growth factor) as intensity decreases. Thus, the random probability of matching a peak in the second most intense class is twice as likely as the probability of matching a peak in the most intense class. This decreases the importance of a match with decreasing intensity. The total number of peaks considered for scoring in each zone is equal to the expected number of theoretical peaks multiplied by 10 (signal-to-noise ratio of 0.1). The total peaks for scoring along with peaks in initial bin and with bin growth factor determine the total number of intensity bins in a zone. Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
7583
MASPIC also gives higher weight to matches within (0.2 m/z unit of an experimental peak and slightly less weight to matches within ((0.2-0.5) m/z unit of an experimental peak. The weight factor in terms of probabilities for a high-accuracy match is 1.5 times more than that for low-accuracy match. Thus, within each intensity class, peaks can have two categories of “potential matches” based on m/z error. The combined random match probability of peaks in an intensity class and in a specific m/z category is the fraction of the m/z range of the zone occupied by such peaks. There is a probability associated with a mismatch within each zone as well. It is computed as a fraction of m/z distance unoccupied by the spectral peaks considered for scoring. By combining these, each m/z position in the spectrum has a defined probability associated with it. The probability values for the above-described m/z profile for each zone are calculated in the following way. Let the total m/z range for ith zone and jth class and m/z error category be qij. Let all the m/z profile values within each zone be θij, where i represents the zone number and j represents the unique class and m/z error category. The values of θij are defined as follows:
∑q
θij ) qij/
ij
j
For a theoretical m/z peak list from a candidate peptide, the corresponding matching classes and their probability are determined through a lookup on this m/z profile. A multinomial distribution is then employed as a scoring function to determine the score. The theoretical spectrum consists of a list of m/z values of +1 charged b and y ions for +1 and +2 parent spectra. It contains both +1 and +2 charged fragment ions for +3 parent spectra. The monoisotopic mass of the fragment ions are used to calculate the m/z values of the theoretical spectrum. The m/z values that are within a 0.6 m/z are merged into one, to account for the resolution of ion trap instruments. The m/z theoretical peak list, thus generated, is matched against the m/z profile (see Supporting Information Figure S1). A count is kept of how many of the theoretical m/z values fall within each group and category and how many fail to match any peak. Let the number of m/z values in the theoretical peak list within a zone be Ni. Let the count of such theoretical peak assignments of various matches and mismatches be denoted by vector 〈kij〉, where i represents the zone and j represents the unique class and category code. The probability of generating 〈kij〉 for θij is given by
P(〈kij〉) )
∏ i
[
Ni!
∏k ! ij
j
∏θ
]
kij ij
j
The score P(〈kij〉), as in previous works,23,24 tests for the null hypothesis that a peptide candidate sequence matches a spectrum by random chance. Thus, P(〈kij〉) will have high values for a certain number of average matches and have low values for very rare and unique matches. A negative loge of P(〈kij〉) is calculated and used as a numeric value of the score. This value is reported as the score by the MASPIC system for each matching process. Testing Protocol. The MASPIC system and Sequest were tested on four different data sets obtained using LCQ-quadrupole 7584
Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
ion trap instruments. These were an 8-protein standard mixture (referred to as the PSM test set), a 20-protein standard mixture digested with trypsin (referred to as the extended PSM data set), a ribosomal protein test set (referred to as ribosomal data set) prepared from the R. palustris enriched 70S ribosome complex,34 and a 20-protein standard mixture digested with proteinase-K (referred to as the extended PSM Prot-K). For all four data sets, a list of true positive proteins was generated, either based on proteins added to the standard mixtures (the three protein standard mixtures) or determined through analysis (the ribosomal data set; see Results and Discussion). A set of decoy sequences (proteins known not to be present in the samples) was added to the true positives list to create the test database.35 The size of the decoy databases was varied to study performance degradation and to simulate genomes of varying size. The charge states of the above stated spectra were determined using an internally developed system.36 Searches were then performed with DBDigger and Sequest to generate a list of true and false identifications along with a score for such identifications. All peptide hits that are reported from this true positives list are considered as true identifications. These identifications were converted to true and false distributions for their corresponding score. These distributions were evaluated, and the lowest score for which there was 1 false identification for every 19 true identifications (95% confidence) was calculated for +1, +2, and +3 charge parent spectra. These cutoffs were then used with DTASelect37 to identify the proteins and peptides present in the sample. DTASelect was run with options -p 2 (report proteins with at least 2 unique peptide hits), -d 0 (do not use filter for deltCN), and -DB (for generating DB-Peptides.txt file). The DB-Peptides.txt file obtained from running DTASelect was further analyzed to find the number of unique +1, +2, and +3 charged parent peptides identified. To further validate the quality of performance, receiver operating characteristic (ROC) curves were generated and plotted for the different runs. To maintain neutrality in performance evaluation, scripts with the same options were used to process and analyze outputs from all software. All the searches were performed with no posttranslational modifications on amino acid sequences. Test Data Sets. Sequest and the MASPIC system were compared on the PSM test, extended PSM, ribosomal, and extended PSM Prot-K data sets. The proteins present in the PSM and the extended PSM samples were eukaryotic, with the exception of R-amylase. Hence, decoy databases consisting of prokaryotic Open Reading Frames (ORFs) were constructed. The prokaryotic genomes used were Yersinia pseudotuberculosis, Shewanella oneidensis, Rhodopseudomonas palustris, Nostoc punctiforme, Burkholderia xenovorans, and Escherichia coli K12. Different sets of decoy databases were constructed by increasing the number of proteins in the decoy set roughly by a factor of 2. The database was modified by removing high-scoring BLAST38 hits for R-amylase. This was done to prevent skewed results due to (34) Strader, M. B.; Verberkmoes, N. C.; Tabb, D. L.; Connelly, H. M.; Barton, J. W.; Bruce, B. D.; Pelletier, D. A.; Davison, B. H.; Hettich, R. L.; Larimer, F. W.; Hurst, G. B. J Proteome Res. 2004, 3, 965-978. (35) Yates, J. R., III; Eng, J. K.; McCormack, A. L. Anal. Chem. 1995, 67, 32023210. (36) Razumovskaya, J.; Fridman, T.; Day, R.; Uberbacher, E.; VerBerkmoes, N.; Gorin, A. Proc. 52nd Am. Soc. Mass Spectrom. Conf. 2004, MPE 074. (37) Tabb, D. L.; McDonald, W. H.; Yates, J. R., III. J. Proteome Res. 2002, 1, 21-26. (38) Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Nucleic Acids Res. 1997, 25, 3389-3402.
misidentifications arising from protein homologues in the decoy database. Thus, for the PSM test, extended PSM, and extended PSM Prot-K data sets, four different databases with size ranging from a proteome of a prokaryote to a higher eukaryote were constructed. The PSM test, extended PSM, and extended PSM Prot-K data sets contained 1297, 2197, and 2200 spectra, respectively. For the ribosomal data set, spectra from a separate LC-MS/ MS experiment on the same sample were used to generate a true positive protein list. Since the exact list of proteins present in the sample were not known, we constructed an approximate true positives list. To achieve this goal, data from 2-D LC-MS/MS runs of the ribosomal data set34 were used to generate the true positives list. First, a tryptic search was performed with the 2-D separated ribosomal protein MS/MS data against the R. palustris protein sequence database using DBDigger33 with the MASPIC system and Sequest. Proteins with at least two peptide hits at 95% confidence cutoff scores from the searches were obtained using DTASelect.37 A protein was considered to be present in the sample if it was reported to be present either by DBDigger or Sequest. A total of 170 proteins were obtained (this includes the ribosomal proteins and nonspecific binding proteins that were copurified with the ribosome) of which 106 were reported to be present by both DBDigger and Sequest. Since R. palustris is a prokaryote, a decoy database constructed from prokaryotes might contain exact matches for the correct peptide sequences since the ribosome is highly conserved. A recently annotated protein sequence from the Arabidopsis thaliana genome was downloaded from GENBANK and used as a decoy data set. Since A. thaliana also contains ribosomal proteins, all proteins that were annotated as ribosomal proteins (∼300) were removed. Additionally, a tryptic search was also performed on the 2-D separated ribosomal protein MS/MS data with the modified A. thaliana database, and the results confirmed that no particular A. thaliana protein produced more than two peptide hits to the MS/MS data. With this final A. thaliana protein file, three decoy databases were created with 6141, 12 489, and 25 249 proteins, respectively. The 170 R. palustris protein sequences were added to these decoy databases to make three test databases. The R. palustris 1-D LC-MS/MS ribosomal data set contained multiple LC analyses and 6942 spectra. RESULTS AND DISCUSSION The MASPIC system was tested as a part of DBDigger and was compared with Sequest’s cross-correlation scoring scheme. The PSM test, extended PSM, and ribosomal data sets were searched twice with both tryptic and nontryptic (search with no enzyme specificity) search options. The extended PSM Prot-K data were searched only using the nontryptic search option. Counts of peptides and spectra identified correctly with 95% confidence along with their charge state were obtained using DTASelect. A simple test of the number of true identifications clearing a threshold is inadequate. This is because, with an increase in the number of true identifications, there is a possibility that there will be a corresponding increase in number of false identifications. To make sure that such a scenario does not occur, ROC curves were plotted for the different runs. ROC curves presented in this paper have the number of true identifications (IDs) on the y-axis and the number of false identifications on the x-axis. The curves reflect the number of true and false identifications made by each scorer at a particular score threshold. For each method tested,
there is a point on the curve where the ratio of true identifications to false identifications is 19:1. This point corresponds to 95% confidence. This point also shows how many true and false identifications were made by that particular scorer at that confidence level. There are potential caveats in this performance analysis. For tryptic searches, DBDigger has no limit on missed cleavage sites. Hence, for all tryptic searches performed using Sequest, the number of missed cleavage sites was set to five (the maximum allowed by that software). This could lead to bias in the tryptic searches. Yet another factor is that Sequest uses a preliminary scorer in filtering candidate peptides and retains only the top 500 candidates for evaluation by cross-correlation. DBDigger does not filter candidate peptides. The lack of prefiltering leads to a larger number of candidate sequences to be differentiated by MASPIC, but its use may eliminate some good candidate peptides for Sequest. In general, though, these two differences are likely to have only subtle impact on the scoring of peptide sequences. MASPIC Scoring Scheme. The MASPIC system incorporates three factors: intensity, expected peak density, and m/z error variations, into a single scoring function. Unlike other previous algorithms, these factors are incorporated into the scoring scheme with intent to test the null hypothesis using a multinomial distribution. Figure 1 shows an example of a spectrum identified by MASPIC. The spectrum was assigned to a +2 charged parent peptide with sequence “R.TPEVDDEALEK.F”. The spectral assignment shows that most of the intense peaks in the spectrum were accounted for by the assignable fragments of the peptide. When manual peak assignment is performed in mass spectrometry, researchers attempt to assign as many “high-intensity” peaks as possible. If two candidate sequences match different sets of “low-intensity” peaks, then there is no way of confidently stating which of the two sets of sequences is correct. When two candidate sequences differ in the intensity of peaks matched, then it is more likely that the candidate that matches more intense peaks is the correct one. Peptides fragmenting under low-energy CID predominantly produce b and y ions,9 and hence, high-intensity peaks tend to be ions from these b and y series. This makes such peptide assignments unique and more significant. Hence, the MASPIC system converts the intensity distributions into probabilities in such a way that high-intensity peaks are given more importance. The MASPIC system also incorporates peak density variations within an experimental spectrum in the form of “zones”. It is well known that the fragments of certain charge states are found in certain regions of the spectrum. For example, the [M + H]+ of the peptide sequence R.TPEVDDEALEK.F from Figure 1 is 1246.31 Da, and thus, the m/z value for the doubly charged precursor of the peptide is 623.6. A +2 fragment of this peptide cannot be found at regions with m/z greater than 623.6. Peak density (the measure of the number of peaks found per unit m/z value) within a spectrum can vary considerably between zones (Supporting Information Figure S2). This means that certain zones within a spectrum are likely to produce more matches by random chance than others. Thus, it is more significant to match peaks in regions of low peak density than in regions of high peak density. Additionally, the experimental m/z errors follow a normal distribution in the neighborhood of the theoretical fragment ion m/z values (Supporting Information Figure S3). Thus, more weight should be placed on a peak when the experimental m/z value is closer to the theoretical m/z value. All of the above stated Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
7585
Figure 1. Spectrum corresponding to a +2 charged peptide sequence R.TPEVDDEALEK.F from β-lactoglobulin precursor. This peptide was identified by MASPIC with a score of 73.94 and a deltCN [16,37] of 0.46. Sequest identified this spectrum with a cross-correlation score of 3.38 and a deltCN of 0.32.
scoring factors, intensity, peak density variations, and m/z errors were incorporated into the MASPIC system through the m/z profile. The process of converting an experimental spectrum into an m/z profile is explained using the +2 charged spectrum in Figure 1. This spectrum is first divided into two zones at the precursor m/z value of 623.6. The peaks within each zone are grouped based on their intensities. For the purpose of illustration and explanation, the peaks in each zone were classified into three intensity classes (class-A, class-B, class-C). Figure 2a represents the modified spectrum obtained from preprocessing the spectrum in Figure 1 after this classification process. The heights of the peaks are the inverse of the probability value associated with it. Thus, the height reflects the relative importance of matching a particular peak. The height also reflects the intensity class the peaks belong to. The peaks in the region with an m/z value greater than 623.6 (zone 2) have a different height compared to a peak in the region less than 623.6 (zone 1). This height difference is caused by the difference in random match probability due to different expected concentrations of peaks within each zone. Thus, unlike previous work,20 peak densities and importance of intensities matched are incorporated in a single step. Figure 2b shows the m/z region from 585.0 to 645.0 of Figure 2a that contains the border between zone 1 and zone 2. The actual border is at m/z 623.6. Figure 2b contains one class-A peak, one class-B peak, and seven class-C peaks from zone 1. The figure also contains one class-B peak and one class-C peak from zone 2. Figure 2c illustrates the probability as a function of m/z for the corresponding region in Figure 2b. The areas of the plot between the peaks represent the significance (expressed as probability) of a theoretical peak falling into such a region. The jump between zone 1 and zone 2 represents the difference of this value between zones. The peaks themselves are shown to be associated with two probabilities, for potential matches within (0.2 m/z unit and 7586
Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
for potential matches (0.2-0.5 m/z unit. This procedure is not applied to +3 spectra (see later discussion). Thus, an experimental spectrum has been converted to a profile along the m/z axis reflecting the probability of random match. The two-dimensional information associated with a peak in the experimental spectrum (m/z value and intensity) has been converted to one-dimensional information (as a measure of probability associated with m/z value). Now, a m/z theoretical peak list can be associated with any of the groups in each zone. For instance, Figure 2c shows a match for y5 (m/z 589.3) fragment from peptide R.TPEVDDEALEK.F to an experimental peak at m/z 589.4 with high accuracy (within (0.2 m/z unit of the experimental peak). These associations are then evaluated against the multinomial distribution and produce a score for that peak list. The MASPIC system is unique in its use of a multinomial distribution to test the null hypothesis. The chief reason for its use is to permit multiple dimensions of information to be combined probabilistically. Because information about peak intensities, match m/z fidelities, and peak densities are all composited together in MASPIC, a multinomial system is necessary. A hypergeometric system is useful for simple evaluations of combinations of matches and mismatches, but these systems are not easily scalable to many dimensions of information. Matching spectra to peptide sequences by a simple count of matches and mismatches is sufficient to separate true and false identifications for an “easier” subset of identifiable spectra. However, many spectra contain ion series that are incomplete due to residue content, regions that are dense with secondary fragment ions, or other variations that limit the applicability of simple scoring schemes. To cope with this spectral quality heterogenetity, MASPIC combines multiple match properties. A multinomial distribution has proven to be a useful way to amalgamate these multiple categorical variables.
Figure 2. Process of transforming the spectrum in Figure 1 into an m/z probability profile. (a) shows the transformed spectrum in Figure 1 after dividing the spectrum into two zones and grouping the peaks within each zone into three distinct classes in each zone. (b) is the zoomed in region of the rectangular area between 585 and 645 m/z units in (a). It shows the zone boundary at the precursor m/z. (c) an illustration of the m/z profile for the corresponding region in 2 (b).
Even though the current study found the multinomial distribution-based MASPIC system to compare well with Sequest, there are situations when the scores can be misleading. For instance, there can be candidate peptides that produce unusually low numbers of matches for a spectrum. Since the scorer tests for the null hypothesis, such candidates can also have low P(〈kij〉), just like good quality matches. To overcome this, when the total
number of peak matches for a peptide falls below the mean of the multinomial distribution, the score is set to zero. Effect of Individual Components of the MASPIC System. The MASPIC system incorporates three features for improving spectral identification: intensity information, density zoning, and m/z error. These features were evaluated individually to evaluate the contribution of each to the performance of the system as a Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
7587
Figure 3. Effect of intensity on performance of MASPIC scoring scheme evaluated by switching off the intensity factor in the scoring process. ROC curves for a nontryptic search from the extended PSM data set for (a) +1 charged spectra, (b) +2 charged spectra, and (c) +3 charged spectra. The plots demonstrate that intensity information is needed to enhance the performance of the scoring scheme.
whole. One of the three factors modeled into the MASPIC system is intensity of peaks matched. Figure 3 shows ROC curves for the cases when multiple intensity classes are used versus when a single intensity class (no-intensity model) is used. One anomaly worth discussing is the result for the no-intensity curve for +3 parent spectra. It shows marked performance degradation. This could be because +3 parent spectra are modeled with both +1 and +2 charged fragments; hence, they can match more peaks by random chance. Yet another analysis was performed where +3 spectra were modeled with only +1 charged fragments. The result looked more reasonable and comparable to results from the other two charge states. Thus, when more peaks are modeled into the theoretical spectrum (+2 fragments in this case), discrimination in the form of intensity becomes more necessary to enhance performance. The probability of finding a peak at any given location in a spectrum is not a constant, so MASPIC divides the spectrum into different zones based on inherent peak density variations within a spectrum. To test the importance of this feature, the MASPIC system was set up to consider only a single zone. Figure 4 shows the effect of zoning in discrimination of true and false identifications. The ROC curve shows that there is a significant improvement in performance when the zoning component was included in the MASPIC system. The effect of zoning was more pronounced for +3 spectra than for +2 spectra. Peaks matched from low-resolution ion trap data produce a normal distribution for m/z error (Supporting Information Figure 7588
Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
S3). The MASPIC system takes this factor into account and gives different levels of importance for matches within (0.2 m/z and those that match (0.2-0.5 m/z. Figure 5 shows the contribution of the m/z error factor to the performance of +1 parent charge spectrum identifications. While analyzing results due to m/z error, it was noticed that +3 charge parent spectra performance showed marginal decrease when m/z error was considered. Figure S3 shows the reason; the +2 fragments of +3 parent spectra do not follow the same normal distribution as the +1 fragments of +3 parent spectra. This is likely because the monoisotopic peak and the first isotopic peak are sometimes combined into a single peak by centroiding routine of the mass spectrometer. Since, it is not known before hand which of the experimental peaks are +2 and which are +1, the m/z error criteria cannot be applied effectively. For this reason, the m/z error factor is not modeled by MASPIC for +3 parent spectra. Comparison to Sequest’s Cross-Correlation. The MASPIC results were compared to Sequest cross-correlation results for each of the four samples included in this survey. Identifications were classified by experimental sample, algorithm employed, database size, protease specificity, and precursor charge. These results are presented in Supporting Information Tables S1-S4. The values in these tables were generated using deltCN cutoff of 0.0 for both Sequest and MASPIC. The aim was to compare crosscorrelation and probabilistic equivalent of cross-correlation on a one-to-one basis. Results that include the effects of deltCN are presented in the following section. It is important to note that
Figure 4. ROC curve for +2 and +3 charged spectra showing the importance of “zoning” of experimental spectrum for analysis. The +2 parent spectra are divided into two zones and +3 parent spectra are divided into three zones for analysis. The figure shows the difference in performance of the MASPIC system when there are multiple zones versus a single zone for +2 and +3 charged spectra.
Figure 5. MASPIC system test with and without the m/z error component. The ROC curves for +1 parent spectra of the extended PSM data set are presented to show the contribution of the m/z error data in identification of the true spectrum.
Sequest’s cross-correlation includes isotopic and neutral loss peaks for scoring whereas the MASPIC system includes just monoisotopic b and y ions. For all four samples, MASPIC successfully identified more spectra than Sequest’s cross-correlation. For the standard PSM, the tryptic database searches the two algorithms identified more spectra than the unconstrained searches. For the database with the fewest decoy sequences, MASPIC identified 322 spectra while Sequest’s cross-correlation identified 315. The difference was more substantial with the unconstrained search of the trypsin digest of the extended PSM; MASPIC’s 811 spectra represents a 19% improvement over Sequest cross-correlation’s 681. In the ribosome sample, the tryptic search resulted in 1738 spectra confidently identified by MASPIC versus 1531 for Sequest’s cross-correlation (see Figure 6 for ROC curves comparing the algorithms in tryptic search on this sample). For the proteinase-K digest of the extended PSM, MASPIC identified 325 spectra while Sequest’s cross-correlation identified 282. MASPIC demonstrated a marked improvement in the number of spectra identified from all four samples even at the smallest database size.
The numbers of identified spectra decrease as database size increases. However, the magnitude of this effect is less for MASPIC than for Sequest’s cross-correlation. The number of spectra identified by tryptic search in the standard PSM declined by 10.9% for MASPIC as the database size increased by a factor of 7.6, while the number of spectra Sequest’s cross-correlation identified decreased by 11.4%. In the ribosome searches, the database size was scaled by a factor of 4.1. MASPIC’s results diminished by 6.8%, while Sequest’s decreased by 12.1%. Similar trends were observed for unconstrained searches. The extended PSM was digested with trypsin, but the unconstrained searches identified substantial numbers of peptides with nontryptic ends. As the database scaled in size by a factor of 7.6, MASPIC identified 14.2% fewer spectra while Sequest’s cross-correlation identified 16.7% fewer. When these same proteins were digested by proteinase-K instead, far fewer peptides were identified. As the database size increased, MASPIC identified 16.0% fewer spectra while Sequest’s cross-correlation identified 22.7% fewer. Such consistent superior performance for samples digested with proteinase-K digests can potentially enhance biological discovery of membrane proteins.39 While in every case expanding search space caused a diminishment in the number of spectra identified, MASPIC proved more robust against this reduction in the face of database size increases and protease specificity. Confident identifications from each of the four samples were evaluated to determine the proportion of spectra from +1, +2, and +3 charged precursors. Identifications from singly charged precursors comprised a lower percentage of the identifications for unconstrained searches than for tryptic searches. In the extended PSM, MASPIC’s singly charged peptides accounted for 13% of the tryptic search identifications but only 10% of the unconstrained identifications. For Sequest’s cross-correlation, this proportion dropped from 8 to 4%. This difference is probably a result of the relative simplicity of +1 peptide spectra. Because they contain fewer peaks, they provide less information for a scorer to identify them uniquely to a particular candidate sequence, and unconstrained searches yield far more candidate sequences for comparison to each spectrum than tryptic searches.33 When MASPIC and Sequest’s cross-correlation identifications (39) Wu, C. C.; MacCoss, M, J.; Howell, K. E.; Yates, J. R., III. Nat Biotechnol. 2003, 21, 532-538.
Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
7589
Figure 6. ROC curves for the tryptic search from the ribosomal data set for (a) +1 charged spectra, (b) +2 charged spectra, and (c) +3 charged spectra. The performance of the MASPIC system and Sequest with and without deltCN filters criterion of 0.1. See also Supporting Information Figure S6.
were compared in unconstrained searches for the extended PSM trypsin digest and proteinase-K digest samples, the singly charged peptide identifications accounted for a larger proportion of the MASPIC results (for the trypsin digest, MASPIC’s +1 peptides were 10% vs Sequest’s 4%). Thus, the MASPIC system resolves true from false identifications more successfully for spectra from singly charged peptides. To establish whether the identifications from the two algorithms were overlapping, we evaluated the intersection of the two sets (see Supporting Information Table S5). Typically, the algorithms perform most similarly on doubly charged peptides, but substantial differences can be found for singly and triply charged peptides. The extended PSM showed dramatic differences for singly charged peptides because Sequest’s cross-correlation achieved far fewer correct identifications at this database size. MASPIC’s tryptic search for the ribosome sample gave many more singly charged peptides, but Sequest cross-correlation’s unconstrained search was more successful than MASPIC’s. Upon examination, 13 of these 23 peptides contained arginine residues, causing them to be dominated by relatively few intense peaks.9 For such spectra, the current intensity classification scheme of the MASPIC system is likely to be inadequate. Yet, it is worth noting that ribosomal proteins tend to be more basic than the rest of the proteome and in general there are lot fewer peptides 7590
Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
with nonmobile protons than peptides with mobile protons.40 The divergence between the two algorithms for triply charged peptides was also significant, suggesting that using multiple search algorithms on the same database can yield benefits for improved sequence coverage. On all four data sets the MASPIC system consistently performed better than Sequest’s cross-correlation. The improved discrimination of MASPIC makes it well-suited for proteome analysis of higher-order eukaryotes that have larger genomes of increased complexity. The MASPIC system is also likely to help in situations when there is a need for a database incorporating multiple genomes. Based on the unconstrained database searches, MASPIC is superior in situations where nonspecific proteases are used. For example, in situations where multiple proteolytic enzymes are used for the purpose of PTM mapping,41 the MASPIC system’s enhanced discrimination should prove very useful. A limited evaluation of primary Mascot scoring metric and MASPIC did not produce results reflective of the aggregate Mascot scores and evaluation. An explanation of our attempts at this comparison can be found in the Supporting Information (Figure S5). (40) Kapp, E. A.; Schutz, F.; Reid, G. E.; Eddes, J. S.; Moritz, R. L.; O’Hair, R. A.; Speed, T. P.; Simpson, R. J. Anal. Chem. 2003, 75, 6251-6264.
Table 1. Performance Comparison between Sequest and the MASPIC System for the PSM Test Set with deltCN ) 0.1a PSM test set
resultb attribute
4467
number of decoy ORFs 8933 17400
34023
MASPIC tryptic
peptides spectra charge state % decrease
260 (326) [99,106,55]
252 (311) [93,106,53] {3.2}
248 (302) [88,106,54] {5.7}
238 (291) [85,105,48] {10.7}
Sequest tryptic
peptides spectra charge state % decrease
251 (321) [90,104,57]
240 (305) [81,102,57] {3.8}
237 (302) [85,101,51] {9.9}
229 (291) [79,99,51] {11.9}
MASPIC nontryptic
peptides spectra charge state % decrease
260 (308) [89,118,53]
250 (298) [79,120,51] {9.1}
236 (281) [75,113,48] {14.8}
234 (277) [78,110,46] {15.9}
Sequest nontryptic
peptides spectra charge state % decrease
242 (304) [87,114,41]
229 (278) [76,111,42] {9.2}
220 (268) [72,106,42] {13.0}
204 (252) [68,102,34] {27.0}
a The values given in the first line of a cell without any brackets are the number of unique peptides identified (“P”). This is a qualitative measure. The values given in parentheses (”(Q)”) are the number of spectra identified. This is a quantitative measure. The value given in brackets (“[X,Y,Z]”) in the second line are the breakup of peptide identifications in terms of parent charge states +1, +2, and +3. Note that X + Y + Z should equal the number of unique peptides identified (P). The values given in braces (“{R}”) are the percentage decrease in number of peptides identified when compared to results from least decoy database size (results in column 3). b Each cell contains the number of unique peptide identifications, number of spectra identified, number of +1, +2, and +3 peptide identifications, and percentage decrease with decoy database size.
Effect of deltCN on True and False Identifications. Previous studies have concluded that deltCN, the ratio of difference in cross-correlation score between the first and second best identifications, can be a good discriminator for separating true and false identifications.16,42 Thus, to perform the comparison under conditions more typical for a Sequest analysis, the following calculations and analyses were performed. MASPIC’s deltCN was defined and computed the same way as for Sequest’s deltCN. All true and false identifications with deltCN < 0.1 were removed, and the primary scores were again used to plot ROC curves and generate 95% confidence cutoffs. The ROC curves agree with the previous reports that quality of Sequest results improves with the use of deltCN. Importantly, the ROC curves show that the quality of MASPIC results also improves when deltCN is considered (See Figure 6 and more similar figures in Supporting Information.). The results with deltCN set to 0.1 are presented in Tables 1-4. These tables, along with tables with no deltCN filters (presented in Supporting Information) are analyzed and discussed here. For the PSM test set, deltCN enhances +1 charge state for both tryptic and nontryptic searches. For the extended PSM data set, Sequest’s +1 charge identification improves significantly for tryptic searches. In the nontryptic search mode, for the extended PSM data set, Sequest showed improvement in all three charge states whereas MASPIC showed improvement predominantly in +1 and +3. Results from the ribosomal data set show improvement for Sequest on all three charge states for both tryptic and nontryptic searches. MASPIC shows improvement for +3 in tryptic mode and for +1 and +3 in nontryptic mode. The results for the proteinase-K data set showed a pattern similar to nontryptic searches on extended PSM and ribosomal data sets. Thus, the (41) MacCoss, M. J.; McDonald, W. H.; Saraf, A.; Sadygov, R.; Clark, J. M.; Tasto, J. J.; Gould, K. L.; Wolters, D.; Washburn, M.; Weiss, A.; Clark, J. I.; Yates, J. R., III. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 7900-7905. (42) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383-5392.
deltCN filter affects both the scorers in different ways. In summary, for tryptic searches, Sequest gains mainly from +1 spectra whereas MASPIC gains from +3 spectra, and for nontryptic searches, Sequest improves on all three charge states whereas MASPIC improves on +1’s and +3’s. All these charge state and database size-dependent variations point to a potential way of choosing deltCN based on these factors (instead of single value). Overall, as expected, the effect of deltCN is more pronounced with increasing decoy database size. Table 5 shows common and unique identifications made by both methods with deltCN as filters. From analyzing the table we can conclude that MASPIC and Sequest vary a bit in the peptides identified in the nontryptic search mode and MASPIC identifies most of the peptides Sequest identifies in the tryptic search mode. The tables with deltCN filters sometimes show an increase in peptide identifications with increasing decoy database size. For the extended PSM data set using Sequest in the tryptic search mode, the number of +1 charge identifications actually increased for the third largest and the largest decoy databases in comparison to the second largest decoy database set. On further investigation, it was observed that some of the +1 charged false identifications with high cross-correlation scores end up with smaller and smaller deltCN values with increasing database size (number of candidate peptide count increases with increasing database size and these new candidates replacing second best match to generate smaller deltCN) resulting in their elimination from the false identification list. MASPIC showed a more consistent behavior with increasing database size. For the PSM test set, on an average over four different databases, MASPIC performed 4.5% better for tryptic and 9.7% better for nontryptic searches. For the extended PSM, MASPIC, on average performed 12.2 and 9.9% better on tryptic and nontryptic searches. For the ribosome data set, MASPIC on average performed 12.7 and 7.6% better on tryptic and nontryptic Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
7591
Table 2. Performance Comparison between Sequest and the MASPIC System for the Extended PSM Test Set with deltCN ) 0.1a extended PSM data set
resultb attribute
number of decoy ORFs 8933 17400
4467
34023
MASPIC tryptic
peptides spectra charge state % decrease
421 (499) [61,211,149]
409 (487) [58,206,145] {2.22}
401 (474) [52,204,145] {4.0}
401 (475) [52,204,145] {6.4}
Sequest tryptic
peptides spectra charge state % decrease
384 (456) [44,198,142]
353 (422) [21,196,136] {11.9}
364 (432) [45,189,130] {14.6}
354 (420) [41,188,125] {21.6}
MASPIC nontryptic
peptides spectra charge state % decrease
753 (845) [113,422,218]
724 (812) [97,410,217] {7.9}
713 (799) [94,409,210] {12.8}
678 (759) [80,397,201] {16.3}
Sequest nontryptic
peptides spectra charge state % decrease
701 (787) [96,417,188]
654 (736) [77,405,172] {5.1}
644 (726) [79,399,166] {13.7}
611 (686) [67,381,163] {19.9}
a The values given in the first line of a cell without any brackets are the number of unique peptides identified (“P”). This is a qualitative measure. The values given in parentheses (”(Q)”) are the number of spectra identified. This is a quantitative measure. The value given in brackets (“[X,Y,Z]”) in the second line are the breakup of peptide identifications in terms of parent charge states +1, +2, and +3. Note that X + Y + Z should equal the number of unique peptides identified (P). The values given in braces (“{R}”) are the percentage decrease in number of peptides identified when compared to results from least decoy database size (results in column 3). b Each cell contains the number of unique peptide identifications, number of spectra identified, number of +1, +2, and +3 peptide identifications, and percentage decrease with decoy database size.
Table 3. Performance Comparison between Sequest and the MASPIC System for the Ribosomal Data Set with deltCN ) 0.1a ribosome data set
resultb attribute
6141
number of decoy ORFs 12489
25249
MASPIC tryptic
peptides spectra charge state % decrease
920 (1771) [127,478,315]
888 (1706) [116,472,300] {2.0%}
869 (1674) [106,465,298] {6.9%}
Sequest tryptic
peptides spectra charge state % decrease
837 (1604) [117,441,279]
788 (1509) [100,429,259] {8.3}
752 (1442) [96,414,242] {13.7}
MASPIC nontryptic
peptides spectra charge state % decrease
795 (1493) [92,439,264]
756 (1411) [87,423,246] {9.5}
713 (1340) [73,401,239] {16.0}
Sequest nontryptic
peptides spectra charge state % decrease
730 (1396) [87,407,236]
700 (1344) [80,393,227] {8.7}
673 (1296) [71,380,222] {18.0}
a The values given in the first line of a cell without any brackets are the number of unique peptides identified (“P”). This is a qualitative measure. The values given in parentheses (”(Q)”) are the number of spectra identified. This is a quantitative measure. The value given in brackets (“[X,Y,Z]”) in the second line are the breakup of peptide identifications in terms of parent charge states +1, +2, and +3. Note that X + Y + Z should equal the number of unique peptides identified (P). The values given in braces (“{R}”) are the percentage decrease in number of peptides identified when compared to results from least decoy database size (results in column 3). b Each cell contains the number of unique peptide identifications, number of spectra identified, number of +1, +2, and +3 peptide identifications, and percentage decrease with decoy database size.
searches. For the proteinase-K data set, MASPIC on an average performed 4.9% better. Overall, based on ROC curves and tables, MASPIC either identifies more peptides than Sequest or identifies peptides with higher confidence than Sequest even when deltCN is set to Sequest’s optimal performance criterion of 0.1. Overall, Sequest seems to benefit more from considering deltCN than MASPIC. But, deltCN calculated for MASPIC is on a log scale; hence, in terms of probability, it is likely to be more significant.
ing improved peptide identification and reliability. The MASPIC system currently incorporates intensities, density zoning, and m/z error factors. The approach shows promise for further improvements as it does not currently model neutral loss and isotope peaks. Future work can take any of the following directions: (1) There are obvious implications for extending this system to incorporate theoretical model intensities. Recent efforts to build more accurate theoretical intensity models should provide an
CONCLUSION This study presents a probability-based scoring scheme that performs experimental-theoretical spectrum comparisons yield-
(43) Feller, W. An Introduction to Probability Theory and Its Applications, 3rd ed.; John Wiley and Sons: New York, 1967; Vol. I, pp 59 and 172. (44) Burington, R. S.; May, D. C. Handbook of Probability and Statistics with Tables, 2nd ed.; McGraw-Hill Publications: New York, 1970; pp 103-104.
7592 Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
Table 4. Performance Comparison between Sequest and the MASPIC System for the Extended PSM Proteinase-K Digest with deltCN ) 0.1a number of decoy ORFs 8933 17400
extended PSM Prot-K
resultb attribute
MASPIC nontryptic
peptides spectra charge state % decrease
354 (371) [30,202,122]
348 (364) [29,199,120] {0.0}
322 (336) [25,186,111] {6.5}
308 (321) [22,178,108] {19.0}
Sequest nontryptic
peptides spectra charge state % decrease
346 (360) [34,206,106]
331 (344) [12,200,119] {4.44}
301 (314) [9,185,107] {19.5}
293 (306) [12,176,105] {29.4}
4467
34023
a The values given in the first line of a cell without any brackets are the number of unique peptides identified (“P”). This is a qualitative measure. The values given in parentheses (”(Q)”) are the number of spectra identified. This is a quantitative measure. The value given in brackets (“[X,Y,Z]”) in the second line are the breakup of peptide identifications in terms of parent charge states +1, +2, and +3. Note that X + Y + Z should equal the number of unique peptides identified (P). The values given in braces (“{R}”) are the percentage decrease in number of peptides identified when compared to results from least decoy database size (results in column 3). b Each cell contains the number of unique peptide identifications, number of spectra identified, number of +1, +2, and +3 peptide identifications, and percentage decrease with decoy database size.
Table 5. Commona and Uniqueb Peptide Identifications Made by Both the Algorithms on the Four Data Sets with deltCN ) 0.1 data set and search type
parent charge
both methods
MASPIC
Sequest
PSM test tryptic
1 2 3
71 98 41
14 7 7
8 1 10
PSM test nontryptic
1 2 3
61 99 29
17 11 17
7 3 5
extended PSM tryptic
1 2 3
39 186 124
13 18 21
2 2 1
extended PSM nontryptic
1 2 3
56 365 154
24 32 47
11 16 9
ribosome tryptic
1 2 3
85 397 226
21 68 72
11 17 16
ribosome nontryptic
1 2 3
57 354 190
16 47 49
14 26 32
extended PSM Prot-K nontryptic
1 2 3
10 158 82
12 20 26
2 18 23
a The number of peptides identified by MASPIC and Sequest in the three different charge states. These values are presented in the third column. b The number of peptide identifications made by one of the systems and not by the other. These values are provided in column 4 for MASPIC and column 5 for Sequest.
excellent framework for this effort.28,30,31,40 (2) The MASPIC system’s architecture for multiple binning could potentially be improved. Specifically, there might be better ways of binning and partitioning the experimental spectrum to improve performance further. For instance, one can adopt the concepts of MASPIC into the framework of PEP_PROBE.23 In this way, the size of the database could also be factored into the scoring scheme. (3) Neutral loss information has not been incorporated into the current system. Results from work done by other groups indicate
an improvement in performance when such losses are included.12,14,16,21 (4) In this work, a multinomial distribution was used to compare m/z profile with a theoretical peak list. Other models such as Poisson distribution and hypergeometric distribution have been used for achieving this objective.23,24 While multinomial distribution is an adequate framework for the spectral comparison problem,43,44 it might be worthwhile to frame the spectral comparison problem using alternative theoretical or numerical distributions. (5) The MASPIC system has been currently implemented for ion trap instruments. Other types of mass spectrometers (e.g., Q-TOFs and linear ion traps) that are in use for studying proteins will likely provide their own sets of required optimizations for the MASPIC system. These continued efforts should lead to even better tandem MS peptide identification systems. ACKNOWLEDGMENT The authors thank Manesh Shah for helping with the evaluation of the softwares mentioned. The authors also thank Dr. Michael Brad Strader for providing the R. palustris ribosomal data. The authors also thank internal reviewers, Dr. Hayes McDonald, Dr. Loren Hauser, Dr. Andrey Gorin, and Dr. Frank Larimer, for suggesting improvements to the manuscript. C.N. thanks Dr. Andrey Gorin for providing perl scripts for analyzing a database of spectral assignments. This work was supported by the U.S. Department of Energy’s Office of Biological and Environmental Research and Office of Advanced Scientific Computing under the Genomics:GTL Program. SUPPORTING INFORMATION AVAILABLE Visit http://compbio.ornl.gov/MASPIC/index.html for supplemental data and ORF size dependent cutoff values for MASPIC and Sequest. Additional information is noted in text. This material is available free of charge via the Internet at http://pubs.acs.org. Received for review September 9, 2005.
January
27,
2005.
Accepted
AC0501745
Analytical Chemistry, Vol. 77, No. 23, December 1, 2005
7593