Implicit Biology in Peptide Spectral Libraries - Analytical Chemistry

Aug 21, 2012 - Manor Askenazi* and Michal Linial. Department of ... *Tel: +972 2 6585425. Fax: +972 2 ... Manor Askenazi , Michal Linial. Israel Journ...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/ac

Implicit Biology in Peptide Spectral Libraries Manor Askenazi* and Michal Linial Department of Biological Chemistry, Hebrew University of Jerusalem, Jerusalem 91904, Israel S Supporting Information *

ABSTRACT: Mass spectral libraries are collections of mass spectra curated specifically to facilitate the identification of small molecules, metabolites, and short peptides. One of the most comprehensive peptide spectral libraries is curated by NIST and contains upward of half a million annotated spectra dominated by human and model organisms including budding yeast and mouse. While motivated primarily by the technological goal of increasing sensitivity and specificity in spectral identification, we have found that the NIST spectral library constitutes a surprisingly rich source of biological knowledge. In this Article, we show that data-mining of these published libraries while applying strict empirical thresholds yields many characteristics of protein biology. In particular, we demonstrate that the size and increasingly comprehensive nature of these libraries, generated from whole-proteome digests, enables inference from the presence but crucially also from the absence of spectra for individual peptides. We illustrate implicit biological trends that lead to significant absence of spectra accounted for by complex post-translational modifications and overlooked proteolytic sites. We conclude that many subtle biological signatures such as genetic variants, regulated proteolysis, and post-translational modifications are exposed through the systematic mining of spectral collections originally compiled as general-purpose, technology-oriented resources.

S

count). Peptide libraries effectively solve this problem empirically and for this reason show increased sensitivity compared to protein database searching algorithms (for previously observed peptides). The combination of this increase in sensitivity as well as higher identification speeds has driven a strong increase in the adoption of peptide libraries in proteomics.4 In the absence of large-scale collections of synthesized peptide standards, the curators of spectral libraries generate high-quality peptide spectral libraries by collecting the raw spectral data for as many experiments as possible. These data files are then run through a uniform and a general-purpose identification pipeline. This process generates peptide-spectral matches (PSMs) on a comparable quality scale that can then be appropriately combined to generate consensus PSMs of higher aggregate quality. These consensus PSMs constitute the core of the spectral library.5 Spectral libraries vary in the level of disclosure regarding the number and source of the spectra which comprise the consensus spectrum: the NIST libraries are particularly helpful in that they provide both the number of individual spectra used to generate every consensus PSM and also the experimental data sets from which these spectra were acquired. This information is provided in a convenient and easily parsed text-based format, and it is this primary information that was mined in our study. We will show that,

pectral libraries have played a central role in massspectrometry (MS) since the earliest days of GC/MS where they have always provided the primary means by which compounds are identified.1 The absence of a comprehensive theory for prediction of fragmentation patterns from chemical structure has driven mass-spectrometry to rely on curated libraries of measurements from pure (standard) samples. When an unknown sample needs to be identified, e.g., in a toxicological or forensic scenario, spectra are generated from the sample and compared to existing spectra in the library.2 Once a sufficiently similar match is found, an identity is reported. Until recently, the field of MS-based proteomics constituted an exception to this rule. Proteins are typically digested using specific enzymes (e.g., Trypsin) yielding short peptides that tend to fragment in a predictable pattern along the peptides backbone. This property along with the completion of genome sequences for the major model organisms has provided the basis for the development of algorithms (e.g., SEQUEST, Mascot, X!Tandem, Myrimatch, and OMSSA) that can effectively assign peptides to spectra without the need for purely run standards.3 The algorithms all function by predicting the main peaks due to backbone fragmentation and comparing the predicted peaks with the observed ones using ad-hoc or probabilistic scoring methods. It is important to note that, while the fragmentation pattern of peptides is predictable with respect to the m/z axis (i.e., which masses to expect), there is currently no definitive mechanism for prediction of the relative intensity of each resulting fragment (i.e., the Y-axis of the spectrum, corresponding to the ion © 2012 American Chemical Society

Received: June 17, 2012 Accepted: August 21, 2012 Published: August 21, 2012 7919

dx.doi.org/10.1021/ac301674y | Anal. Chem. 2012, 84, 7919−7925

Analytical Chemistry

Article

Figure 1. Statistically significant correlation between a protein’s number of molecules per cell and its maximal unique peptide scan count in the NIST spectral library.

Database (http://downloads.yeastgenome.org/sequence/ S288C_reference/orf_protein/orf_trans.fasta.gz) on March, 2012. The human proteome was downloaded on March 2012 from UniProt7 (http://www.uniprot.org/uniprot/), by querying for Homo sapiens entries, excluding sequences annotated as “Fragment”. To avoid low quality sequences, we only include fully curated sequences (referred to by UniProtKB as “reviewed”). Standard Peptides. Purely tryptic peptides were generated for every protein, keeping only those with a mass of 500−3000 Da and a minimum length of 6 amino acids. Peptides containing underspecified amino acids were excluded (e.g., “B” for aspartic acid or asparagine). We refer to the resulting peptides for each protein as the protein’s “standard peptides”. For each standard peptide, we record the number of scans used to generate its consensus PSM in the NIST library. We also define the Neighboring Scan Counts score (NSC) as the average number of scans for the two adjacent standard peptides (the N- and C-terminal neighbors). Evidently, there are instances in which only one neighbor is defined (i.e., when the peptide in question is itself the most N- or C-terminal standard peptide). N-Terminal Analysis. Each N-terminus was generated with and without the initiation methionine. Only the most abundant form was kept in the subsequent analyses. An exception is the N-terminal analysis in which the difference in the abundance of the two forms is reported.

despite the technologically oriented nature of the NIST library, the richness of the primary information yields interesting biological observations. We benefit in particular from the diversity and scale of the library in that it aggregates results from a large range of experimental conditions and yields a high overall coverage. For example, the yeast library achieves a coverage of 27% at the sequence level and 93% at the protein level. Clearly, such coverage, based on the compilation of many experiments, is not achievable by any individual experiment however exhaustive (e.g., a median of 23% coverage on 63% of yeast proteins6).



MATERIALS AND METHODS Data Sets. The analyses described in this manuscript are based on publically available data sets from multiple sources. The yeast and human ion trap peptide libraries were downloaded from NIST (http://peptide.nist.gov/), the specific files being 2011_05_24_yeast_consensus_final_true_lib.tar.gz and 2011_05_26_human_consensus_final_true_lib.tar.gz. To maximize the biological relevance of spectral counts derived from the libraries, we excluded counts from data sets marked as “isb_synthpeptides”, as these samples are synthetic. For each entry in the NIST MSP files, the following data elements were extracted: (1) the peptide sequence as represented in the “Name” entry and (2) the list of samples and associated total spectral counts as represented by the “Sample” entry in the “Comments”. The Saccharomyces cerevisiae strain S288C proteome was downloaded from the Saccharomyces Genome 7920

dx.doi.org/10.1021/ac301674y | Anal. Chem. 2012, 84, 7919−7925

Analytical Chemistry

Article

typically use more complex scan-counting transformations,10,11 and yet, even under this extreme constraint of only one peptide per protein, a significant correlation of scan number in the NIST library with the number of molecules per cell is observed. The number of molecules in the cells is estimated from the tandem affinity purification (TAP-IP). The correlation for data points having nonzero values by both measurement techniques is high (R = 0.62, p-value < 1.0 × 10−16) and compares well with individual experimental results (e.g., R = 0.58 using total spectral count per protein in a very large yeast fractionation study12). “A Tale of Two Termini”: Mining the Methionine Aminopeptidase (MAP) Activity. Having established that spectral counts in the NIST library can be used to approximate protein quantity, we proceeded to investigate a robust, proteome-wide biological phenomenon, namely, the processing of N-terminal methionines. Evidence of this process is provided by the detection of either one or both of the two N-terminal forms (with and without the initiating methionine) for many of the proteins in the spectral library. We hypothesized that the overwhelming amount of peptide data would enable the straightforward characterization of the methionine aminopeptidase (MAP) activity, as well as the facile regeneration of known biochemical constraints on MAP enzymatic activity. We (indirectly) studied MAP activity by comparing the number of scans seen with and without a leading methionine. For any Nterminal peptides attributable to multiple proteins, only one protein was chosen at random, to avoid artificial signal inflation in the subsequent functional analysis of the results. Upon inspection, it became apparent that in almost all cases only one form is seen in the library: we therefore classified proteins as “M” or “notM” on the basis of simple majority. We first estimated overall MAP activity using the ratio of “notM”/(“M”+“notM”) spectra. This fraction is 83% in yeast and 86% in human. An alternative estimate is using the ratio of “notM”/(“M”+“notM”) at the protein rather than the spectral level (i.e., using protein counts rather than spectral counts). The protein-based estimate of MAP activity is 68% in yeast and 66% in human. While the values are substantially lower than the values based on spectra, the estimate remains remarkably similar across the two organisms. We investigate the biochemical basis underlying MAP activity by tabulating the “M” vs “notM” status of every protein. The main feature that determines MAP activity is encoded by the amino acid immediately following its leading methionine (referred to as position P2). Detailed experimental analysis of point mutations has demonstrated the impact of the radius of gyration13 on the activity of MAP.14 The MAP enzyme was found to be least active when the P2 amino acid had a large radius of gyration. It is interesting to note that a simple mining of experiments not explicitly designed to characterize MAP activity15,16 can indeed regenerate this fact (Figure 2, Table 1). Remarkably, the existence in the human data set of protein isoforms with P2 amino acids of high and low radius of gyration allows us to confirm the trend by observing a shift in scan counts from notM to M (Table 2) within the same protein. A striking exception to the gyration rule is that of beta- and gamma-actin which are both seen (extremely frequently) in the methionine-free form, despite being effectively incompatible with MAP activity (Table 3). To our knowledge, the enzyme responsible for this transformation has not yet been identified, though a candidate gene from rat liver was isolated.17 Finally, we functionally characterized the methionine bearing proteins

Proteotypicality Scores. Proteotypicality scores (PTP scores) for the standard yeast and human peptides were extracted as tab-delimited files (Ens64_Sc_proteotypic_mapped.tsv and Ens64_Hs_proteotypic_mapped.tsv) from the PTPAtlas which is itself a component of PeptideAtlas8 (http://www.peptideatlas.org/ptpatlas/). Unobserved Peptides. In the analysis of “absent peptides”, a secondary search for the 13 unobserved peptides reported in Table 4 was undertaken to ensure that they were indeed present in the proteome searched by NIST (human_100_021411_FWD_combined.fasta). A flow diagram describing the use of these data sets and the intermediate and final report files generated by the analysis along with the names of the Python scripts are shown in Supplementary Figure S1, Supporting Information. All source files, input datasets, intermediate files, and result files are available upon request.



RESULTS AND DISCUSSION “Singular Peptides”: A Straightforward Proxy for Protein Quantity. The primary role of the NIST library is to provide reference (consensus) spectra for a large collection of peptides from a target organism. However, in addition to the spectrum itself, the NIST library provides useful information regarding the number of individual scans used to generate the consensus spectrum. Throughout the following analyses, we consider the number of scans as a rough approximation for peptide and, hence, protein abundance in the cells of the target organism. A validation of this assumption can be seen in Figure 1. A straightforward quantitation strategy appears to correlate

Figure 2. MAP activity (as defined in Table 1) versus radius of gyration for each P2 amino acid. Size corresponds to protein evidence, defined as the total number of proteins with observed N-termini (M + notM). Color denotes MAP activity (green = high; orange = low).

well with experimental results that were explicitly designed for estimating the quantities of proteins in cells.9 The scan-based quantitation is based on the number of scans from the most commonly observed tryptic peptide uniquely attributable to each protein. We selected the single, most observed peptide, per protein, to eliminate the unavoidable correlation with protein length. Quantitative approaches based on scan-counting 7921

dx.doi.org/10.1021/ac301674y | Anal. Chem. 2012, 84, 7919−7925

Analytical Chemistry

Article

by the use of a gene-list enrichment analysis tool.18 To control for the inherent abundance bias of proteins with characterized N-termini, we used the two-list version of the tool supplying the M-bearing protein list as a target and the complete set of Nterminally characterized protein list as a control. We found a broad characterization in the human data sets in the form of an unexpected enrichment (p-value = 5.07 × 10−4) toward the Gene Ontology19 (GO) cellular component: “integral to the membrane” (GO:0016021). The observation concords with a recently described bias20 towards N-terminal acetylation of cytosolic proteins. The suggested process would be that proteins destined for the cytosol tend to be acetylated which depends on prior MAP activity. In contrast, proteins destined for the membrane will tend to retain their N-terminal methionine in the cytosol until the completion of signal peptide removal. “The Sound of Silence”: Identifying Surprisingly Absent Peptides. We proceeded to refine our data-mining from a broad proteome-wide phenomenon (e.g., N-terminal cleavage) to protein specific modifications. We hypothesized that, as libraries become more comprehensive in scope, it should be possible to highlight overlooked biological phenomena such as rare post-translational modifications (PTMs), cleavage sites, or sequence variants from the absence of peptide detections. In defining measures to highlight such unexpected “gaps” in coverage, we emphasized local considerations over total protein coverage to enable the mining of “gaps” even in proteins with otherwise imperfect coverage. We therefore processed the scan numbers of all the standard peptides for any given protein searching for peptides that appear to be “missing” despite having immediate neighbors identified with high frequency. The consistency with which such peptides remain elusive can be demonstrated with, e.g., a yeast peptide from the protein ADE5-7: DGMPFVGVLFTGMILVK, which is absent from the NIST library. This peptide is missing despite having standard neighbors that are observed hundreds of times (TIDSQIVKPTIDGMR with 227 spectra and TNQLVPEVLEYNVR with 245 spectra). Considering two very comprehensive peptide identification repositories, namely, PeptideAtlas and GPMDB,21 we see that the peptide is, in fact, entirely absent from PeptideAtlas database and reported only once in GPMBD. There it appears to be present22 due to a possibly incorrect identification of the semitryptic form DGMPFVGVL, a previously unreported peptide which may have been missed in another independent study23 (Figure 3; for more details, see Supporting Information). A straightforward explanation for the lack of any observation of the fully tryptic form would be an inherent incompatibility of the peptide with the tandem MS identification protocols or some aspects of the physical chemistry of the MS itself. This phenomenon of relative compatibility (often termed proteotypicality) is well documented.24,25 To control for proteotypicality, we used the PTP databases for yeast and human (see Materials and Methods) which provide a score (PTP) to each standard peptide predicting its proteotypicality. To this score, we added a measure we defined as the Neighboring Spectral Count (NSC), namely, the average spectral count from both of its standard neighbors or the single neighbor if it is N- or Cterminals. Combining these two scorings appears to compensate for weaknesses in either of the scores taken separately: the NSC does not adequately penalize short peptides whereas the PTP score does not benefit from protein level information

Table 1. Number of Proteins Observed with (M) and without (notM) Their Leading Methionine, Broken down by P2 Amino Acid Identity (P2-AA)a

a

The entries are sorted by MAP activity, defined as notM/(notM + M), and reported (high activity = green; low activity = orange) along with the radius of gyration for the P2 amino acid. The maximum radius of gyration in the notM entries and minimal radius of gyration in the M entries are shown in bold.

Table 2. Pairs of Isoforms of the Same Protein Having Different P2 Amino Acids and Consequently Undergoing Different Rates of MAP Activitya

a

An empty entry in the isoform column denotes the canonical form of the protein. The M and notM columns correspond to the number of scans of the N-terminal peptide with (M) and without (notM) a leading methionine. For both scans and P2 amino acids, green denotes high MAP activity while orange denotes low MAP activity. CRBA1 Isoform2 constitutes an interesting exception to the rule (low MAP activity with a P2 amino acid which is typically associated with high MAP activity as seen in Figure2 and Table 1).

Table 3. Evidence of Targeted MAP-Like Activity for Actinsa

a

The actin variants ACTB and ACTG are observed overwhelmingly without their N-terminal methionine, despite having P2 amino acids incompatible with typical MAP activity (columns defined as in Table2).

7922

dx.doi.org/10.1021/ac301674y | Anal. Chem. 2012, 84, 7919−7925

Analytical Chemistry

Article

Figure 4. Conversely, a PTP score greater than 0.93 does not enrich for observed human peptides nearly as effectively as an NSC value greater than 200 (Figure 5). By studying the empirical distributions of PTP and NSC scores for human peptides, it is possible to estimate a probability of observation for every peptide given its PTP and NSC scores. A threshold can then be selected to highlight “surprising gaps” in coverage. In our case, we estimated a 99% probability of observation for any peptide having an NSC score greater than 200 and a PTP score greater than 0.93 (Figure 5). At these very high thresholds, the thirteen peptides conforming to these conditions while remaining unobserved in the NIST library (Figure 5 and Table 4) can be considered “surprisingly absent”. Furthermore, to guard against potential differences in proteome sources, a secondary search confirmed that the absent peptides were indeed present in the proteome searched by NIST when creating the spectral library (see Materials and Methods). “From Absence to Evidence”: Mining Protein-Specific Modifications. We inspect the explicit biology behind each of the “surprisingly absent” peptides. Of the thirteen peptides in Table 4, five contain documented N-linked GlcNAcs reported in UniProtKB and localized by N-glycosidase digestion followed by standard MS-based identification. The resulting N-deamidations observed in such spectra are not reported in PeptideAtlas. However, the GPMDB contains thousands of spectra confirming the existence of these glycosylation sites. Three of the peptides correspond to sequence variants. Notably, these sequence variants, while being “canonical” in UniProtKB, apparently correspond to the minority case in the

Figure 3. Independent identification of the semitryptic DGMPFVGVL in a study yielding extremely high coverage of ADE5-7 but no identification of DGMPFVGVLFTGMILVK (see Supporting Information for detailed discussion).

(which is implicitly present in the NSC score). This can be shown empirically, e.g., if we consider NSC scores >200 (which correspond to the top 3.8% of all NSC values in human standard peptides), there is an unhelpful overrepresentation of short peptides as compared to the equivalent percentile of human PTP scores (the top 3.4% or a PTP > 0.93) as seen in

Figure 4. Distribution of peptide lengths colored by NSC and PTP score thresholds. 7923

dx.doi.org/10.1021/ac301674y | Anal. Chem. 2012, 84, 7919−7925

Analytical Chemistry

Article

Figure 5. Relative enrichment of observed peptides by NSC and PTP score thresholds. The thirteen peptides shown in the figure are the only unobserved peptides having both exceeded NSC > 200 and PTP > 0.93 which correspond to an empirical observation probability threshold of 99%.

Table 4. Short List of “Surprisingly Absent” Peptides in Human NIST Librarya

a

Hypotheses for the absence of each peptide are presented where possible, and scan counts from PeptideAtlas and the GPMDB are also reported and broken down by variant sequence (VAR) or post-translational modification (MOD) where available. PTP and NSC are reported for each peptide along with the total number of scans for the protein (Scans) as well as the total and observed “standard peptides” for the protein. Putative VARs are colored blue while MODs are colored green. Secondary or ambiguous annotations are underlined.

human samples underlying the NIST library (as well as PeptideAtlas and the GPMDB). Three of the remaining peptides belong to Collagen chains that are known to undergo intense modification. While PeptideAtlas does not report such peptides, at least in the case of GAPGAVGAPGPAGATGDR (line 4, Table 4) there is overwhelming evidence in GPMDB to warrant localization of hydroxyprolines (positions P678 and P684 are not currently curated as Hydroxyproline sites in the UniProt entry for P08123). It is interesting to note that the GPMDB database allows for rare PTM identifications in its search pipeline on a protein specific basis (using an internally curated database of

known protein−PTM pairs). It follows that the absence of hydroxyproline identifications in GEAGDPGPPGLPAYSPHPSLAK is not likely due to overlooked hydroxyprolines and hence suggests that other PTMs are more likely to account for the absence of identifications for this peptide. O f t h e re m a i n i n g t w o p e p t i d e s , t h e c a s e o f GNLTNMETNGVVPGM (line 7, Table 4) is of a particular interest. The peptide corresponds to the C-terminus of 4hydroxyphenylpyruvate dioxygenase (HPPD), a protein which is known26 to undergo a C-terminal cleavage of the last 2 or 6 amino acids in the rat or porcine cases, respectively. The 7924

dx.doi.org/10.1021/ac301674y | Anal. Chem. 2012, 84, 7919−7925

Analytical Chemistry

Article

(6) Nagaraj, N.; Kulak, N. A.; Cox, J.; Neuhauser, N.; Mayr, K.; Hoerning, O.; Vorm, O.; Mann, M. Mol. Cell Proteomics 2012, 11, M111.013722. (7) The UniProt Consortium. Nucleic Acids Res. 2010, 38, D142− 148. (8) Deutsch, E. W.; Lam, H.; Aebersold, R. EMBO Rep. 2008, 9, 429−434. (9) Ghaemmaghami, S.; Huh, W.-K.; Bower, K.; Howson, R. W.; Belle, A.; Dephoure, N.; O’Shea, E. K.; Weissman, J. S. Nature 2003, 425, 737−741. (10) Ishihama, Y.; Oda, Y.; Tabata, T.; Sato, T.; Nagasu, T.; Rappsilber, J.; Mann, M. Mol. Cell Proteomics 2005, 4, 1265−1272. (11) Braisted, J. C.; Kuntumalla, S.; Vogel, C.; Marcotte, E. M.; Rodrigues, A. R.; Wang, R.; Huang, S.-T.; Ferlanti, E. S.; Saeed, A. I.; Fleischmann, R. D.; Peterson, S. N.; Pieper, R. BMC Bioinf. 2008, 9, 529. (12) Liu, H.; Sadygov, R. G.; Yates, J. R., 3rd Anal. Chem. 2004, 76, 4193−4201. (13) Levitt, M. J. Mol. Biol. 1976, 104, 59−107. (14) Sherman, F.; Stewart, J. W.; Tsunasawa, S. BioEssays 1985, 3, 27−31. (15) Moerschell, R.; Hosokawa, Y.; Tsunasawa, S.; Sherman, F. J. Biol. Chem. 1990, 265, 19638−19643. (16) Frottin, F.; Martinez, A.; Peynot, P.; Mitra, S.; Holz, R. C.; Giglione, C.; Meinnel, T. Mol. Cell. Proteomics 2006, 5, 2336−2349. (17) Sheff, D. R.; Rubenstein, P. A. J. Biol. Chem. 1992, 267, 20217− 20224. (18) Eden, E.; Navon, R.; Steinfeld, I.; Lipson, D.; Yakhini, Z. BMC Bioinf. 2009, 10, 48. (19) Ashburner, M.; Ball, C. A.; Blake, J. A.; Botstein, D.; Butler, H.; Cherry, J. M.; Davis, A. P.; Dolinski, K.; Dwight, S. S.; Eppig, J. T.; Harris, M. A.; Hill, D. P.; Issel-Tarver, L.; Kasarskis, A.; Lewis, S.; Matese, J. C.; Richardson, J. E.; Ringwald, M.; Rubin, G. M.; Sherlock, G. Nat. Genet. 2000, 25, 25−29. (20) Forte, G. M. A.; Pool, M. R.; Stirling, C. J. PLoS Biol. 2011, 9, e1001073. (21) Craig, R.; Cortens, J. P.; Beavis, R. C. J. Proteome Res. 2004, 3, 1234−1242. (22) Lee, M. V.; Topper, S. E.; Hubler, S. L.; Hose, J.; Wenger, C. D.; Coon, J. J.; Gasch, A. P. Mol. Syst. Biol. 2011, 7, 514. (23) Swaney, D. L.; Wenger, C. D.; Coon, J. J. J. Proteome Res. 2010, 9, 1323−1329. (24) Craig, R.; Cortens, J. P.; Beavis, R. C. Rapid Commun. Mass Spectrom. 2005, 19, 1844−1850. (25) Mallick, P.; Schirle, M.; Chen, S. S.; Flory, M. R.; Lee, H.; Martin, D.; Ranish, J.; Raught, B.; Schmitt, R.; Werner, T.; Kuster, B.; Aebersold, R. Nat. Biotechnol. 2007, 25, 125−131. (26) Neve, S.; Aarenstrup, L.; Tornehave, D.; Rahbek-Nielsen, H.; Corydon, T. J.; Roepstorff, P.; Kristiansen, K. Cell Biol. Int. 2003, 27, 611−624.

absence of this peptide suggests a similar (but currently unreported) modification in humans as well.



CONCLUSIONS In this manuscript, we demonstrate that, by careful analysis of large-scale spectral libraries, it is possible to extract biological knowledge ranging from broad aspects of cellular proteosynthesis to specific enzymatic activity and PTMs (even when they were not targeted nor searched for by the library curators). We chose the NIST library as it represents one of the largest peptide spectral libraries and has become a de facto standard in mass spectrometry. However, a similar protocol can be applied to any large, comprehensive library, even if it is customized to a targeted subset of cell biology (e.g., a specific cellular compartment, developmental stage, or pathological state). We highlighted two major aspects of protein biology, which are relevant to any protein, namely, the overall quantitative level of proteins as well as the removal of N-terminal methionines. In addition, we presented the ability to infer PTMs from the absence of spectra for peptides which are otherwise predicted to be highly proteotypic. The application of this approach to specialized spectral libraries would benefit from prior biological knowledge specific to the targeted domain, in effect augmenting proteotypicality with other relevant priors. The integration of this contextual knowledge into a generalized Bayesian framework for the detection of unexpectedly absent spectra constitutes a natural avenue for future research.



ASSOCIATED CONTENT

S Supporting Information *

A flow-diagram of the analysis and additional supplementary information. This material is available free of charge via the Internet at http://pubs.acs.org.



AUTHOR INFORMATION

Corresponding Author

*Tel: +972 2 6585425. Fax: +972 2 6586448. E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS



REFERENCES

We thank Nati Linial for useful comments and suggestions. We thank Tomer Ravid and Tsiona Eliyau for experimental testing of candidates for protein proteolysis in yeast. This study was supported by the Binational Science Foundation (BSF) (219/ 07) and the Prospects consortium [EU FRVII]. Funding for open access charge: Prospects consortium (EU FRVII).

(1) Ausloos, P.; Clifton, C. L.; Lias, S. G.; Mikaya, A. I.; Stein, S. E.; Tchekhovskoi, D. V.; Sparkman, O. D.; Zaikin, V.; Zhu, D. J. Am. Soc. Mass Spectrom. 1999, 10, 287−299. (2) Stein, S. E.; Scott, D. R. J. Am. Soc. Mass Spectrom. 1994, 5, 859− 866. (3) Eng, J. K.; Searle, B. C.; Clauser, K. R.; Tabb, D. L. Mol. Cell Proteomics 2011, 10, R111.009522. (4) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Proteomics 2007, 7, 655−667. (5) Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; Stein, S. E.; Aebersold, R. Nat. Methods 2008, 5, 873−875. 7925

dx.doi.org/10.1021/ac301674y | Anal. Chem. 2012, 84, 7919−7925