MS Data Sets for

Sep 13, 2008 - Tandem mass spectrometry (MS/MS)-based sequencing of protein digests is ..... list is then segmented along the m/z axis into a series o...
18 downloads 0 Views 138KB Size
Anal. Chem. 2008, 80, 7846–7854

Sequential Interval Motif Search: Unrestricted Database Surveys of Global MS/MS Data Sets for Detection of Putative Post-Translational Modifications Jian Liu, Alexandre Erassov, Patrick Halina, Myra Canete, Nguyen Dinh Vo, Clement Chung, Gerard Cagney, Alexandr Ignatchenko, Vincent Fong, and Andrew Emili* Banting and Best Department of Medical Research, Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada Tandem mass spectrometry is the prevailing approach for large-scale peptide sequencing in high-throughput proteomic profiling studies. Effective database search engines have been developed to identify peptide sequences from MS/MS fragmentation spectra. Since proteins are polymorphic and subject to post-translational modifications (PTM), however, computational methods for detecting unanticipated variants are also needed to achieve true proteome-wide coverage. Different from existing “unrestrictive” search tools, we present a novel algorithm, termed SIMS (for Sequential Motif Interval Search), that interprets pairs of product ion peaks, representing potential amino acid residues or “intervals”, as a means of mapping PTMs or substitutions in a blind database search mode. An effective heuristic software program was likewise developed to evaluate, rank, and filter optimal combinations of relevant intervals to identify candidate sequences, and any associated PTM or polymorphism, from large collections of MS/MS spectra. The prediction performance of SIMS was benchmarked extensively against annotated reference spectral data sets and compared favorably with, and was complementary to, current stateof-the-art methods. An exhaustive discovery screen using SIMS also revealed thousands of previously overlooked putative PTMs in a compendium of yeast protein complexes and in a proteome-wide map of adult mouse cardiomyocytes. We demonstrate that SIMS, freely accessible for academic research use, addresses gaps in current proteomic data interpretation pipelines, improving overall detection coverage, and facilitating comprehensive investigations of the fundamental multiplicity of the expressed proteome. Biological diversity is enhanced by the accumulation of protein variants. Site-specific post-translational modifications (PTMs), in particular, are ubiquitous and diverse. Yet while the preponderance of sequence polymorphisms among outbred organisms can * To whom correspondence should be addressed. Tel: (416) 946-7281. Fax: (416) 978-7437. E-mail: [email protected].

7846

Analytical Chemistry, Vol. 80, No. 20, October 15, 2008

be confidently estimated by genomic techniques,1,2 the diversity of PTMs, both within and among different proteins, remains uncertain. A hint of the underlying complexity was suggested by exhaustive shotgun sequencing of human lens proteins,3 which produced evidence of extensive oxidation, methylation, and other forms of putative chemical alterations. Large-scale proteomic surveys of soluble proteins isolated from human cell lines have also tentatively mapped thousands of putative sites of phosphorylation, ubiquitination, and acetylation.4-6 Since such alterations likely affect both the biophysical and biochemical properties of the proteome, comprehensive maps of PTMs (and polymorphisms) on a global scale should ultimately lead to a better understanding of the molecular mechanisms that govern the activities of biochemical pathways. The detection of PTMs in a systematic and unbiased manner is therefore the subject of growing research interest. Tandem mass spectrometry (MS/MS)-based sequencing of protein digests is currently the cornerstone of high-throughput protein characterization. Several configurations of MS instrumentation allow for the isolation and fragmentation of precursor peptide ions to generate informative product ion spectra.7 Instrument parameters are usually chosen such that fragmentation occurs primarily along the peptide amide bond backbone to generate informative b- and y-ion ladders. Subsequent matching of the resulting MS/MS spectra to the theoretical patterns predicted for peptides present in a protein sequence database typically works well as a “bottom-up” approach to protein identification. In principle, stable covalent modifications to specific amino acid residues can potentially also be deduced from spectra (1) Chamary, J. V.; Parmley, J. L.; Hurst, L. D. Nat. Rev. Genet. 2006, 7, 98– 108. (2) Calarco, J. A.; Xing, Y.; Ca´ceres, M.; Calarco, J. P.; Xiao, X.; Pan, Q.; Lee, C.; Preuss, T. M.; Blencowe, B. J. Genes Dev. 2007, 21, 2963–2975. (3) MacCoss, M. J.; McDonald, W. H.; Saraf, A.; Sadygov, R.; Clark, J. M.; Tasto, J. J.; Gould, K. L.; Wolters, D.; Washburn, M.; Weiss, A.; Calrk, J. I.; Yates, J. R. Proc. Natl. Acad. Sci. U. S. A. 2002, 99, 7900–7905. (4) Olsen, J. V.; Blagoev, B.; Gnad, F.; Macek, B.; Kumar, C.; Mortensen, P.; Mann, M. Cell 2006, 127, 635–648. (5) Beausoleil, S. A.; Jedrychowski, M.; Schwartz, D.; Flias, J. E.; Ville´n, J.; Li, J.; Cohn, M. A.; Canteley, L. C.; Gygi, S. P. Proc. Natl. Acad. Sci. U. S. A 2004, 101, 12130–12135. (6) Crosas, B.; Hanna, J.; Kirkaptrick, D. S.; Zhang, D. P.; Tone, Y.; Hathaway, N. A.; Buecker, C.; Leggett, D. S.; Schmidt, M.; King, R. W.; Gygi, S. P.; Finley, D. Cell 2006, 127, 1401–1413. (7) Nesvizhskii, A. I. Methods Mol. Biol. 2007, 367, 87–119. 10.1021/ac8009017 CCC: $40.75  2008 American Chemical Society Published on Web 09/13/2008

as offset product ion series.8 Nevertheless, automated determination of PTMs and polymorphisms in a reliable and comprehensive manner remains challenging. Historically, conventional de novo sequence prediction, such as the PEAKS9 and PepNovo,10 have had very limited capabilities to detect unknown PTMs as fixed amino acid masses are specified a priori. On the other hand, popular database search programs, such as SEQUEST11 and MASCOT,12 are usually run under the assumption that peptides are either chemically unmodified or at most are decorated by common, well-studied types of PTMs (e.g., phosphorylation). Moreover, since complexity increases exponentially when considering combinations of PTMs, it is mandatory to constrain the search space for such conventional tools. For instance, while Mascot allows a maximum of nine variable modifications, this strategy is discouraged as overly computationally demanding. To remedy this limitation, several search engines13,14 have now adopted a multipass search strategy, wherein one first conducts a basic search to limit the search space, such as by first identifying candidate proteins based on unmodified peptides, then performing a subsequent assignment of single modifications to unassigned spectra. However, the fundamental constraints of prespecifying modifications are not lifted. More flexible ad hoc methods like the SALSA database search algorithm15 also rely on a user’s expertise of peptide fragmentation to interpret MS/MS spectra, which is impractical for unrestrictive searches of large collections of spectra. Consequently, polymorphic variants and PTMs usually remain undetected and unreported in current large-scale proteomic profiling studies. To tackle this essential problem, a new-generation computational approaches have been developed in the past few years that increasingly enable unrestricted PTM searching. In general, these so-called “blind” search tools can be divided into two basic classes. In the first group, exemplified by sophisticated algorithms such as ModifiComb16 and Spectral Network,17 PTM detections are performed by a systematic spectrum to spectrum comparison, such that unassigned spectra bearing a PTM are aligned to annotated versions of the corresponding unmodified peptide with an incomplete overlap indicating a potential modification site. Despite its conceptual efficiency, such methods suffer from a limited capacity to discover uniquely modified peptides. The second group aims to improve database search procedures to allow for more efficient PTM detection. Such tools can be further subclassified into two major categories. One set predicts PTM occurrence by deducing the optimal alignment between an observed spectra and a theoretical unmodified sequence repre(8) Cantin, G. T.; Yates, J. R. J. Chromatogr., A 2004, 1053, 7–14. (9) Ma, B.; Zhang, K.; Chendrie, C.; Liang, C; Li, M.; Doherty-Kirby, A.; Lajoie, G. Rapid Commun. Mass Spectrom. 2003, 17, 2337–2342. (10) Frank, A.; Tanner, S.; Bafna, V.; Pevzner, P. A. J. Proteome Res. 2005, 4, 1287–1295. (11) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. (12) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551–3567. (13) Creasy, D. M.; Cottrell, J. S. Proteomics. 2002, 2, 1426–1434. (14) Craig, R. B. Rapid Commun. Mass Spectrum. 2003, 17, 2310–2316. (15) Liebler, D. C.; Hansen, B. T.; Davey, S. W.; Tiscareno, L; Mason, D. E. Anal. Chem. 2002, 7, 4203–210. (16) Savitski, M. M.; Nielse, M. L.; Zubarev, R. A. Mol. Cell. Proteomics 2006, 5, 934–948. (17) Bandeira, N.; Tsur, D.; Frank, A.; Pevzner, P. A. Proc. Natl. Acad. Sci. U. S. A 2007, 104, 6140–6145.

sentation using peak-shifting graphical procedures. The most highly cited software tool in this category is the InsPecT program,18 which uses dynamic programming for the spectral alignment. This software has be used to document large numbers of putative PTMs in diverse spectral data sets.19,20 The TwinPeak21 search tool likewise identifies peptide modifications via optimizing of the cross-correlations between experimental and theoretical spectra. The second class of tools, which includes OpenSea,22 GutenTag,23 and Popitam,24 is a first attempt to find a set of subset of the residues or sequence “tags” by interpreting the major features in the spectrum graphs and then map these back to their cognate protein sequences, with modifications inferred from any gaps in the observed ion series continuity. Another innovative graph-based tool, PFSM,25 which constructs so-called finite state machines, can also be generalized for blind detection of PTMs, but its success hinges on the proper recognition of the critical ion series. In this report, we introduce a complementary database search approach, termed Sequential Interval Motif Search (SIMS), which allows for blind modification searches of large collections of MS/ MS spectra. Our standalone software is expressly designed for unbiased, proteome-scale surveys of candidate PTMs and polymorphic variants during standard global “bottom-up” shotgun proteomic surveys. Different from the aforementioned tools, the principle inference strategy is not based on the mapping of individual peaks or the alignment of contiguous sequence tags. Rather, the search process revolves around the notion of intervals, corresponding to pairs of product ion peaks whose m/z difference is equivalent to amino acid residue masses (within a predefined mass tolerance). The PTM mapping problem is therefore formulated as a “stitching” procedure, wherein the stringent selection of intervals that best optimizes the similarity between the sequence and the spectrum reveals both the site and mass of putative PTMs. We have benchmarked the software extensively against two other prevailing search engines (namely, SEQUEST and InsPecT) under a variety of test scenarios. Our results demonstrate that SIMS is both effective and complementary to these search tools in a highthroughput proteomic setting and confirm its suitability as the basis for the reliable systematic detection of PTMs in large collections of MS/MS spectra at an affordable computational cost. We provide unfettered access to this software to facilitate biological discovery by the broader proteomic community. METHODS The SIMS software suite was developed using the C++ language. A schematic overview of the kernel modules are outlined in Figure 1. Briefly, MS/MS spectra are imported in batch mode and processed individually to improve recognition of the primary (18) Tanner, S.; Shu, H.; Frank, A.; Wang, L. C.; Zandi, E.; Mumby, M.; Pevzner, P. A.; Bafna, V. Anal. Chem. 2005, 77, 4626–4639. (19) Tanner, S.; Pevzner, P. A.; Bafna, V. Nat. Protoc. 2006, 1, 67–72. (20) Tsur, D.; Tanner, S.; Zandi, E.; Bafna, V.; Pevzner, P. A. Nat. Biotechnol. 2005, 23, 1562–1566. (21) Havilio, M.; Wool, A. Anal. Chem. 2007, 79, 1362–1368. (22) Searle, B. C.; Dasari, S.; Willmarth, P. A.; Turner, M.; Reddy, A. P.; David, L. L.; Nagalla, S. R. J. Proteome Res. 2005, 4, 546–554. (23) Tabb, D. L.; Saraf, A.; Yates, J. R. Anal. Chem. 2003, 75, 6415–6421. (24) Hernandez, P.; Gras, R.; Frey, J.; Appel, R. D. Proteomics 2003, 3, 870– 878. (25) Falkner, J.; Andrews, P. Bioinformatics 2005, 21, 2177–2184.

Analytical Chemistry, Vol. 80, No. 20, October 15, 2008

7847

Figure 1. Schematic diagram illustrating the SIMS-based PTM search process. Key tasks at each sequential stage in the spectra analysis pipeline for interpreting large-scale MS/MS data sets are indicated (Sims_config.txt specifies the parameters of database search; and simsmod.cfg indicates the allowed modification range for each amino acid).

embedded product ion series. To enhance interpretability, complementary virtual “ghost” peaks are predicted for spectra lacking complete b- or y-ion ladders (see Supporting Information). The program then extracts “intervals”, defined as all pairs of peaks with a difference in m/z equal to an amino acid residue mass (as specified in Sims_config.txt; default mass tolerance preset at 0.8 Da for use with ion trap spectra). Weights are assigned to individual intervals based on the intensities of the associated peaks, followed by ranking and filtering of lower scoring intervals. The default mode is to retain the 400 top-ranked intervals (this number can be varied) for a subsequent “approval” stage, which then cross-maps significant intervals against candidate peptide sequences imported from a reference protein database. Instead of performing an exhaustive peak alignment, a “preliminary” search is conducted using a more efficient search procedure based on extensive filtering and pruning, with the aim of compiling the optimal subset of adjacent intervals that maximize concordance (and consequently the preliminary score) with each imported peptide string. However, modification (or δ) masses are inserted as needed to compensate for breaks between inconsecutive intervals. Although restricted to only a single possible occurrence per spectrum, this mode of PTM mapping is constrained only by the user-specified modification mass range (default is [10, 200] Da). Candidate peptides are assigned a preliminary score based on the sum of ion weights of the matched intervals, with the correct sequence and PTM combination intuitively accounting for the most prominent peaks. Finally, at the “approval” stage, the 500 most probable candidates derived from the preliminary search, with or without a putative PTM, are evaluated by a dot-product function to derive the correlation between the observed experimental spectrum and 7848

Analytical Chemistry, Vol. 80, No. 20, October 15, 2008

theoretical peak representations of the candidate sequences. These correlation scores determine the final ranking, while any predicted modification mass of the top-ranked peptide further tuned up by second, more precise round of peak alignment. A statistical model is then used to assign a confidence (probability) score to topranked peptide identifications. Finding and Filtering Intervals. Intervals are derived in straightforward fashion by computing pairwise m/z differences among the processed peak lists. To reduce bogus intervals after insertion of ghost peaks (see Supporting Information), interval generation is limited to pairs of true or ghost peaks after peak filtering. Because a mixed peak implies a real and a ghost peak, these complementary peaks play a role to connect real and ghost peaks (i.e., ghost and real peaks can be linked via such peaks). Particularly, since the intensities of such peaks are signified (see Supporting Information), the generation of intervals effectively emphasizes complementary peaks. Since it is not uncommon to generate thousands of putative intervals from noisy spectra, most of which are spurious, stringent filtering is necessary to reduce the computational burden while maintaining correct and relevant intervals. To this end, SIMS assigns weights to each interval based on their respective peak intensities as follows: First, each individual peak is given a weight equaling to the square root of the observed intensity. The peak list is then segmented along the m/z axis into a series of equalsized regions of 300 Da, with the peak weights normalized within each region (using a range of [1, 100]). Interval weights are then assigned based on the sum of the associated peak weights. The intervals are then rank sorted, and the top-scoring 400 intervals retained and indexed for fast retrieval at the subsequent “preliminary” database search phase. After completion of interval scoring, the process of spectrum interpretation is effectively transformed into a problem of finding an optimal path that connects the correct set of appropriate intervals. Preliminary Database Search. At this stage, SIMS in silico digests (default enzyme is trypsin) the protein sequences imported from the specified reference database and evaluates the resulting peptides against a given spectrum. Starting at each cleavage site of a protein sequence, SIMS conducts the search process in two sequence orientations: first matching theoretical b-ions in an Nto C-terminal manner and then matching y-ions in the reverse direction. The search process can be conceived as finding an optimal matching path corresponding to the true interval set embedded in the peak list. Each node along the path therefore corresponds to an amino acid in the target sequence and is characterized by the cumulative m/z value and corresponding matching score, which is essentially the weighted sum of matched peaks. Proper PTM mapping therefore relies on the detection of telltale shifts in the interpeak mass values of the (modified) b-/ y-ion series. Although users can specify an arbitrary modification masses, a biologically realistic perspective should be kept in mind. Without loss of generality, assuming at most one interval per amino acid per node, the candidate matching paths are dynamically established in a “depth first search”26 starting from an artificially inserted “head” seed node (i.e., at 1 and 19 Da for the (26) Cormen, T. H.; Leiserson, C. E.; Rivest, R. L.; Stein, C. Introduction to algorithms, 2nd ed.; MIT Press: Cambridge, MA, 2001.

b- and y-ion series, respectively). At each node in the path, SIMS assesses if there is an interval corresponding to the current amino acid being read in from the peptide. If present, it advances to the flanking peaks and continues the matching process in an iterative manner. Otherwise, SIMS attempts the following two alternate options in turn: (1) adds a gap in the path and advances to the next node to continue the path elaboration; (2) alternately, determines all peaks falling within the modification range allowable for a particular amino acid and then iteratively imposes tentative PTMs before moving on to next node. When a gap is added, the path score is increased by a modest increment of 1 to penalize the missing peak. However, since experimental m/z values (particularly on ion trap spectra) are not always accurate, matching the next peak is permitted within the preset tolerance (1 Da) whenever a gap is appended. This cycle continues until the algorithm hits the end of peak list and matches the precursor mass successfully within tolerance of precursor mass. A match score is then returned for each peptide, and only the 500 highestscoring candidates are kept for the subsequent approval stage. To improve efficiency, the search process is implemented in an iterative manner, rather than by recursion. Other measures have also been taken to enhance the speed. First, when initiating a search for a particular peptide, the widest mass range allowable for a possible PTM can be determined in advance as the modification range for each amino acid is known. Offending peptides are removed from further consideration if the actual precursor mass falls outside this allowable range. Similarly, if addition of a PTM results in the peptide undershooting or overshooting the parent ion mass, the search aborts. Pruning is also conducted when the number of gaps along the path exceeds a predefined threshold. When one PTM is allowed, in the worst scenario, SIMS attempts to place the modification at each amino acid alternatively, aligning the rest of the amino acid residues. Therefore, the computational complexity for a given peptide candidate is proportional to O(n2), where n is the sequence length. Computing Spectral Pairwise Similarity. To finalize the ranking of short-listed candidates from the preliminary search, SIMS adopts a pairwise spectral similarity metric between the experimental spectrum and each of the top 500 peptides. This statistical measure27 has proven effective for singling out true positive matches when the candidate pool is reasonably small and has been adopted by other search engines such as SEQUEST.11 While more sophisticated kinetic models have been developed to predict collision-induced dissociation spectra for use in MS/MS assignments,28,29 due to lack of access, we opted for a simplified model for generating theoretical fragmentation patterns. For each top-ranked candidate, either bearing or devoid of a putative PTM, hypothetical singly charged b- and y-ion series (including the corresponding native isotopic peaks) are predicted. This theoretical peak list is vectorized by partitioning the m/z axis into a sequence of consecutive small bins of same size (i.e., 1 Da by default). The experimental spectrum is then binned in the same manner, and the correlation is computed based on the normalized (27) Frewen, B. E.; Merrihew, G. E.; Noble, W. S.; MacCoss, M. J. Anal. Chem. 2006, 78, 5678–5684. (28) Zhang, Z. Anal. Chem. 2004, 76, 3908–3922. (29) Sun, S.; Meyer-Arendt, K.; Eichelberger, B.; Brown, R.; Yen, C. Y.; Old, W. M.; Pierce, K.; Cios, K. J.; Ahn, N. G.; Resing, K. A. Mol. Cell. Proteomics, in press.

inner product of the two vectors. As it is assumed that assignment of a putative PTM to the true modification site results in a maximal correlation score, SIMS further computes the similarity with the modification at all possible sites along a peptide sequence. Finally, the peptide matches (both with or lacking a predicted PTM) are ranked by their corresponding correlation scores. Tuning the Modification Mass. As the discrete binning process when computing spectral similarity is not very sensitive to mass drifts, any modification mass determined initially is subject to measurement error. To improve accuracy, the modification mass is further refined by more precise alignment of the theoretical peaks to the experimental spectra, with an m/z alignment error calculated for each pair of matched cognate peaks. Since error in the preliminary modification mass should affect all modified ions, the experimentally observed peaks are subgrouped depending on whether the corresponding ions bear the putative modification or not. Since these two sets always contain the same number of b-/y-ions, regardless of the modification locus, the average errors in these two groups should be similar if the predicted modification mass is accurate. The original modification mass prediction is thus adjusted by small increments to minimize the difference between the two group’s average errors. Assessing Discriminating Scores for PTM Identification. As proteomic studies typically generate large numbers of MS/ MS spectra, a discriminating measure is necessary for gauging the likelihood of a particular search result, as was developed for PTMfinder.30 One pragmatic approach is to search an equivalent number of reference proteins and reversed “decoy” sequences under the assumption that if the search process fails to recognize the correct peptide/PTM combination, there is an equal probability of the resulting false positive match being either a reference or decoy sequence. Based on such an assumption, a multidimension statistical model was developed under the same principles of a previous study of unmodified peptide matches by our group,31 wherein each top-ranked peptide (both forward and reversed matches) is mapped to a data point in a hyperspace using a set of SIMS identification features. This space is then partitioned into a multidimensional array of smaller hypercubes, wherein the estimated probability of true positives for a hypercube was determined based on the ratio of forward to reversed matches. This measure was adopted as the discriminating score and was used to estimate false discovery rate (FDR) for a collection of tentative identifications. EXPERIMENTAL RESULTS SIMS was implemented as standalone software for both the Windows and Linux operating systems. It was optimized for blind database searching of MS/MS spectra for peptides bearing a single modification. As triply charged precursor ions usually generate complicated fragmentation patterns, SIMS restricts the search to the analysis of singly or doubly charged precursors. Throughout the proof-of-concept experiments outlined below, the following default settings were exercised: mass tolerances for precursor and product ion peak matching of 3.0 and 0.8 Da, (30) Tanner, S.; Payne, S. H.; Dasari, S.; Shen, Z.; Willmarth, P. A.; David, L. L.; Loomis, W. F.; Briggs, S. P.; Bafna, V. J. Proteome Res. 2008, 7, 170–181. (31) Kislinger, T.; Cox, B.; Kannan, A.; Chung, A.; Hu, P.; Ignatchenko, A.; Scott, M. S.; Gramolini, A. O.; Morris, Q.; Hallett, M. T.; Rossant, J.; Hughes, T. R.; Frey, B.; Emili, A. Cell 2006, 125, 173–186.

Analytical Chemistry, Vol. 80, No. 20, October 15, 2008

7849

Figure 2. Benchmarking of SIMS-based PTM identification accuracy. Shown are test results in terms of modification site precision (left panels) and predicted modification mass (right panels) over two sets of reference spectra. The histograms indicate the consistency of SIMS predictions with (a) Ascore-based phosphopeptide5 annotations and (b) InsPecT-based detection of putatively modified human lens proteins.20 SIMS generally returned the same exact sequences/PTM loci combination, with only a small residual discrepancy in terms of δ mass (typically 0.95) PTM predictions detected for ∼1600 subunits of affinity-purified yeast protein complexes.33 The pattern spans the entire modification mass range [0, 200] Da, with local trends. (b) Distribution of 307 570 putative PTM instances, mapping to ∼1800 proteins across a broad [-200, 500] Da mass range, identified by global shotgun profiling of the adult mouse heart proteome.35 Most of the candidate modifications were enriched in a narrow [20, 150] Da mass range.

over 2 million ion trap MS/MS spectra produced during a comprehensive proteomic survey of affinity purified endogenous yeast protein complexes.33 A subset (∼4.5%) of these spectra had previously been mapped with high confidence to 4087 different polypeptides components of 547 putative multiprotein complexes using SEQUEST. In this current work, we research the remaining spectra using SIMS to look for evidence of overlooked PTMs, using a modification mass range setting of [0, 200] Da. Using a discriminating score cutoff of >0.95, producing an estimated FDR of 4.3%, this relatively unrestrictive PTM search unveiled one or more high-confidence putative modifications on ∼1600 distinct proteins, or roughly one-third of the expressed yeast proteome, corresponding to 117 distinct complexes (i.e., with at least one subunit modified per complex). Although many of these modifications represented common in vitro artifacts, such as carboxyamidomethylated cysteine, oxidized methionine, and sodium/potassium salt adducts, the predicted PTM patterns were very diverse overall (see Figure 3a), spanning the full allowable δ mass range. Supporting Information Table S-3 lists the 50 most frequent variants detected, and their apparent amino acid preferences. Putative site-specific phosphorylation events were one of the most frequent PTMs and were preferentially associated with the amino acids serine and threonine (see Supporting Information (33) Krogan, N. J.; Gagney, G.; Yu, H.; Zhong, G.; et al. Nature 2006, 440, 637–643.

Figure S-2). The result was consistent with our biological expectation that tyrosine phosphorylation is exceptionally rare in yeast, with very few known biological targets. Only 40% of these identified sequences were reported in a previous large-scale yeast phosphopeptide profiling study34 (Supporting Information Table S-4 contains the complete list of these phosphorpeptides). These results indicate that SIMS is capable of detecting plausible phosphopeptide variants and localizing the altered amino acid fairly accurately. Closer examination of the remaining assignments indicated the predicted residues were typically immediately adjacent to serine or threonine residues (see Supporting Information Figure S-3). Presumably the modifications indeed occurred at the serine/threonine sites, with the subtle shifts caused by the inherent inaccuracy of PTM site localization based on noisy, low-resolution ion trap spectra. Detecting Variants in a Global Survey of the Mouse Heart Proteome. We next applied SIMS to a giant data set of 16 million MS/MS ion trap spectra obtained in a recent exhaustive proteomic survey of adult mouse heart tissue,35 of which a subset (5%) were previously mapped with high confidence to ∼6200 proteins using SEQUEST. The blind PTM search was performed over a wide mass range [-200, 500] Da to detect potential PTMs. However, (34) Li, X.; Berber, S. A.; Rudner, A. D.; Beausoleil, S. A.; Haas, W.; Ville´n, J.; Elias, J. E.; Gygi, S. P. J. Proteome Res. 2007, 6, 1190–1197. (35) Gramolini, A. O.; Kislinger, T.; Alikhani-Koopaei, R.; Fong, V.; et al. Mol. Cell. Proteomics 2008, 7, 519–533.

Analytical Chemistry, Vol. 80, No. 20, October 15, 2008

7851

Figure 4. Empirically derived search confidence scores. Histogram showing the numbers of PTM candidates detected by SIMS analysis of the mouse heart proteome data set35 above various arbitrary discriminating threshold scores. The vast majority of decoy matches (inverted peptide sequences) are typically associated with low discriminating scores, giving a good separation and low false discovery rate estimate at higher score cutoffs.

given the scope of this study, we first inferred peptide sequence tags to accelerate the analysis by constraining the search space. Since sequence tags have been extensively studied, we did not develop a new module for this purpose. Rather, we used the wellestablished PepNovo10 generator to produce candidate sequence tags consisting of three consecutive amino acid residues, specifying a conservative probability cutoff of at least 5% per tag (i.e., the likelihood that the true peptide does not actually contain the tag is deemed less than 5%). Figure 4 shows the cumulative number of spectra putatively bearing modifications predicted by SIMS above various confidence thresholds. Our statistical model achieved good overall discrimination, as the fraction of decoy (reversed) peptide matches tended to have low scores whereas markedly higher scores were mostly associated with native (forward) peptide candidates. A stringently filtered subset of 307 570 high-confidence (with an estimated FDR of 4.7%) putatively modified peptides was produced (Figure 3b), revealing variants for roughly one-third (∼1800 proteins) of the expressed mouse heart proteome. Over half of these candidates (∼1000) had evidence for at least one alternate form of modification (Supporting Information Figure S-4 depicts the histogram of PTM multiplicity). The δ mass distribution (Figure 3a) was generally consistent with that observed with the yeast complexes, implying that SIMS has no implicit bias when defining putative modification masses during the search process, but the actual PTM patterns were markedly different. Supporting Information Table S-5 lists the top 50 most common modifications 7852

Analytical Chemistry, Vol. 80, No. 20, October 15, 2008

found in heart, which again include common chemical artifacts, but many plausible biological instances. We further compared SIMS performance against InsPecTbased PTM predictions over a large subset of randomly selected heart spectra (see Supporting Information). While both algorithms produced a substantial overlap in terms of tentative PTM identifications (Supporting Information Figure S-5), these benchmarking results also confirmed that SIMS is able to discover many unique, high-confidence matches in a manner complementary to InsPecT. DISCUSSION One major goal of the emerging discipline of systems biology is a more complete description of the activity states of critical components of key cellular pathways and processes. Since PTMs and polymorphisms are often functionally significant, we developed the SIMS software to facilitate broader systematic investigations of the global molecular diversity of cells and tissues, adding another perspective to studies of the interactome33,36 and expressed proteome.37 As the search engine’s name implies, SIMS relies on finding an informative and meaningful set of “intervals” in the processed MS/MS spectra. Proper selection and scoring of appropriate intervals is there(36) Gavin, A. C.; Aloy, P.; Grandi, P.; Krause, P.; et al. Nature. 2006, 440, 631–636. (37) Kislinger, T.; Rahman, K.; Radulovic, D.; Cox, B.; Rossant, J.; Emili, A. Mol. Cell. Proteomics 2003, 2, 96–106.

fore a key issue for achieving effective database matching. The detection of gaps in the b- and y-ion fragment series is also exploited to indicate potential variants. Though thousands of candidate intervals can be derived from a noisy MS/MS spectrum, we have found that relevant intervals representing the genuine b-/y-ion ladders are usually assigned high rankings after normalization of the corresponding peak weights (as shown in Supporting Information Figures S-7 and S-8). Thus, proper data processing facilitates the elucidation of biologically meaningful patterns. Since the ion ladders of ion trap and other spectra are usually incomplete, we explicitly insert “ghost” peaks to reduce the number of obfuscating gaps due simply to missing information rather than the presence of a PTM (see Supporting Information). While the complementarity of the major product ion series has been widely exploited in MS/MS spectra analysis, for instance, to derive more accurate precursor mass estimates,38 to the best of our knowledge, no other existing search engine inserts virtual peaks into the original spectra as a preprocessing step. To validate the usefulness of these additions, we have evaluated SIMS performance with this feature inactivated and the corresponding interval set size reduced by half to make a fair comparison. Accuracy dropped by 5.3% for the phosphopeptide data sets described in the Experimental Results section (see also Supporting Information Table S-1). Moreover, with the mouse heart spectra, the SIMS overlap with InsPecT/SEQUEST fell by 6.3% (shown graphically in Supporting Information Figure S-5 and summarized in detail in Table S-6), further confirming the utility of this functionality. Though limited, such an improvement demonstrates that the inclusion of ghost peaks can enhance prediction performance. Additional investigations were performed on the phosphopeptide and human lens data sets to assess the impact of interval set size thresholds. After we relaxed the limit from the default (400) to upward of 1000 (or more) candidate intervals, accuracy exhibited only a limited improvement (i.e., less than 5%), whereas the overall search time nearly doubled (Supporting Information Table S-7), a major consideration when dealing with large-scale profiling data sets. Therefore, the empirical setting provides a reasonable tradeoff in performance in terms of accuracy at an affordable computational cost. Nevertheless, even in default mode, the SIMS search speed is typically slower than that of InsPecT, though its theoretical computational complexity is lower. This difference is mainly caused by implementation-related factors. For instance, InsPecT has a built-in sequence tag generator for filtering peptides. To speed up the process, SIMS can likewise import sequence tags from programs like PepNovo. Using such tags, search time can be reduced sharply by 30% or more, whereas PTM detection sensitivity remains largely unaffected (Supporting Information Figure S-5). We have established that the SIMS correlation analysis improves considerably upon the initial search rankings. According to a recent study,39 computing a correlation score using only the observed b-/y-ions present in the peak lists can potentially provide (38) Venable, J. D.; Xu, T.; Cociorva, D.; Yates, J. R. Anal. Chem. 2006, 78, 1921–1929. (39) Fa¨lth, M.; Svensson, M.; Nilsson, A.; Sko ¨ld, K; Fenyo ¨, D.; Andren, P. E. J. Proteome Res. 2008, 7, 3049–3053.

a more sensitive metric for assessing pairwise spectral similarity. To test this in practice, we developed a version of SIMS that computes a correlation score based on the actual b-/y-ion series. PTM mapping performance dropped substantively, however, in comparison the standard dot-product assessment mode. The fundamental cause appears to be that false positive matches with high correlation can occur with a higher chance when certain significant peaks, mostly true b-/y-peaks, are ignored by the correlation scoring function. We present the SIMS suite as open source toolkit to the proteomics community in hopes of complementing existing tools and soliciting additional functionality. Both the source code and executables for Windows and Linux platforms are freely available under the general GNU license schema via http://emililab.med. utoronto.ca/. For example, in order to further optimize SIMS performance, the development of a plug-in module incorporating ad hoc models of precursor ion fragmentation patterns should be beneficial since it is well-known that certain PTMs markedly effect peptide dissociation.40-42 For instance, the spectra of phosphopetides are often characterized by a pronounced neutral loss ion peak, which is exploited by certain MS/MS analysis tools.43 Improvements that make the software suite computationally more efficient are also encouraged. CONCLUSIONS Our extensive benchmarking of SIMS, and its rigorous application to real life yeast and mouse proteome studies, demonstrate the unique capabilities of this search engine for reliably detecting protein variants across a wide range of possible modification masses. Once SIMS detects the correct peptide sequence, it generally assigns the modification to the proper (or nearby) position with reasonably good δ mass precision. Yet despite its general efficacy, only a small fraction of the total acquired spectra are typically deduced with high confidence by SIMS, or indeed any other existing search algorithm, presumably due in part to potentially intractable issues such as suboptimal stochastic ion fragmentation patterns44 and precursor ion multiplexing, which poses a major ongoing conundrum for the field. While we believe that SIMS is complementary to existing tools and search strategies and so will help to chart the remarkable complexity of the proteome in the postgenomic era, the general lack of convergence with well-established search engines means that no single software tool will establish itself as the de facto standard. Indeed, a trend sees proteomics laboratories choosing to combine the identifications made by multiple engines to increase both detection coverage and (40) Ghesquie`re, B.; Damme, J. V.; Martens, L.; Vandekerckhove, J.; Gevaert, K. J. Proteome Res. 2006, 5, 2438–2447. (41) Dongre´, A. R.; Jones, J. L.; Somogyi, A.; Wysocki, V. H. J. Am. Chem. Soc. 1996, 118, 8365–8374. (42) Leitner, A.; Foettinger, A.; Linder, W. J. Mass Spectrom. 2007, 42, 950– 959. (43) Lehmann, W. D.; Kru ¨ ger, R.; Salek, M.; Huang, C. J. Proteome Res. 2007, 6, 2866–2873. (44) Ahn, N. G.; Shabb, J. B.; Old, W. M.; Resing, K. A. ACS Chem. Biol. 2007, 2, 39–52.

Analytical Chemistry, Vol. 80, No. 20, October 15, 2008

7853

confidence.45 In practice, a meta-search strategy coupling SIMS with other extant software, such as SEQUEST, InsPecT, or newer, innovative tools should facilitate proteomic discoveries in years to come.

Centre of Excellence to A.E. The authors thank laboratory members for critical suggestions and access to experimental spectra, and Dr. S. Gygi (Harvard University) and Dr. P. Pevzner (University of California, San Diego) for generously providing annotated spectra.

ACKNOWLEDGMENT This study was supported by research grants from the Ontario Genomics Institute and Genome Canada, the McLaughlin Centre for Molecular Medicine, and the Heart & Stroke/Richard Lewar

SUPPORTING INFORMATION AVAILABLE Additional information as noted in text. This material is available free of charge via the Internet at http://pubs.acs.org.

(45) Resing, K. A.; Meyer-Arendt, K.; Mendoza, A. M.; Aveline-Wolf, L. D.; Jonscher, K. R.; Pierce, K. G.; Old, W. M.; Cheung, H. T.; Russell, S.; Wattawa, J. L.; Goehle, G. R.; Knight, R. D.; Ahn, N. G. Anal. Chem. 2004, 76, 3556–3568.

Received for review May 1, 2008. Accepted August 12, 2008.

7854

Analytical Chemistry, Vol. 80, No. 20, October 15, 2008

AC8009017