Anal. Chem. 2004, 76, 276-282
Prediction of Posttranslational Modifications Using Intact-Protein Mass Spectrometric Data Mark R. Holmes† and Michael C. Giddings*,†,‡
Departments of Microbiology & Immunology and of Biomedical Engineering, School of Medicine, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599-7290
We present a Web-based application that uses wholeprotein masses determined by mass spectrometry to identify putative co- and posttranslational proteolytic cleavages and chemical modifications. The protein cleavage and modification engine (PROCLAME) requires as input an intact mass measurement and a precursor identification based on peptide mass fingerprinting or tandem mass spectrometry. This approach predicts massmodifying events using a depth-first tree search, bounded by a set of rules controlled by a custom-built fuzzy logic engine, to explore a large number of possible combinations of modifications accounting for the experimental mass. Candidates are saved during a search if they are within a user-specified instrument mass accuracy; the total number of possible candidates searched is based on a specified fuzzy cutoff score. Candidates are scored and ranked using a simple probabilistic model. There is generally not enough information in an intact mass measurement to determine a single unique protein characterization; however, the program provides utility by expediting the identification of sets of putative events consistent with the mass data and ranking them for further investigation. This approach uses a simple, intuitive rule base and lends itself to discovery of unannotated posttranslational events. We have assessed the program with both in silico-generated test data and with published data from an analysis of large ribosomal subunit proteins, both from the yeast S. cerevisiae. Results indicate a high degree of sensitivity and specificity in characterizing proteins whose masses resulted from reasonable proteolysis and covalent modification scenarios. The application is available on the web at http://proclame.unc.edu. A major goal of mass spectrometry (MS) in proteomics is the identification and characterization of cellular proteins from extremely precise mass measurements. While a diverse set of software tools is available for studying peptides resulting from proteolytic digestion of sample proteins, few development efforts are underway to support inference from mass measurements of intact proteins. The ProSight PTM tool, under development by Kelleher,1 represents one of a small number of projects dedicated * Corresponding author. Tel.: +1 (919) 843-3513. Fax: +1 (919) 962-8103. E-mail:
[email protected]. † Department of Microbiology & Immunology. ‡ Department Biomedical Engineering. (1) Taylor, G. K.; Ku˘ n, Y.-B.; Forbes, A. J.; Meng, F.; McCarthy, R.; Kelleher, N. L. Anal. Chem. 2003, 75, 4081-4086.
276 Analytical Chemistry, Vol. 76, No. 2, January 15, 2004
to direct experimental integration of “top-down” and “bottom-up” proteomic computational analysis. A small but growing number of studies report the measurement of intact masses,2 generally using a combination of elecrospray ionization4 and either timeof-flight or Fourier transform ion cyclotron resonance (FTICR) mass spectrometry. Work with whole-cell lysates is also progressing in the laboratories of Lubman5 and Kelleher,6 while Cohen has developed novel profiling methods using 2D gel electrophoresis.7,8 Protein characterization based on mature intact masses is difficult due to the precursors’ frequent modification following translation; several major types of event are well known to alter the mass of a maturing protein. Posttranslational modifications (PTMs) involve the covalent addition or removal of chemical groups on particular amino acids in the polypeptide chain. Phosphorylation, by far the most common PTM,9 accounts for at least half of those known, affecting proteins involved in signal transduction, enzyme activation, and other key metabolic functions. Glycosylation or lipoylation occurs frequently, especially in membrane proteins. Cleavage of a localization signal from the peptide backbone is also common, following the delivery of a protein to its functional destination. The protein cleavage and modification engine (PROCLAME) uses intact mass measurements to determine sets of putative events accounting for the measured protein mass. The program considers two types of modification events: simple covalent modifications, such as phosphorylation, and proteolysis. At this time, the program does not address lipid or polysaccharide modifications due to their greater complexity. The analysis requires at least two measurements from a protein sample: an intact mass measurement and an identifying measure(2) VerBerkmoes, N. C.; Bundy, J. L.; Hauser, L.; Asano, K. G.; Razumovskaya, J.; Larimer, F.; Hettich, R. L.; Stephenson, J. L., Jr. J. Proteome Res. 2002, 1, 239-252. (3) Lee, S. W.; Berger, S. J.; Martinovic, S.; Pasa-Tolic, L.; Anderson, G. A.; Shen, Y.; Zhao, R.; Smith, R. D. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 5942-5947. (4) Smith, R. D.; Loo, J. A.; Edmonds, C. G.; Barinaga, C. J.; Udseth, H. R. Anal. Chem. 1990, 62, 882-899. (5) Lubman, D. M.; Kachman, M. T.; Wang, H.; Gong, S.; Yan, F.; Hamler, R. L.; O’Neil, K. A.; Zhu, K.; Buchanan, N. S.; Barder, T. J. J. Chromatogr., B: Anal. Technol. Biomed. Life Sci. 2002, 782, 183-196. (6) Meng, F.; Cargile, B. J.; Patrie, S. M.; Johnson, J. R.; McLoughlin, S. M.; Kelleher, N. L. Anal. Chem. 2002, 74, 2923-2929. (7) Ariel, N.; Zvi, A.; Makarova, K. S.; Chitlaru, T.; Elhanany, E.; Velan, B.; Cohen, S.; Friedlander, A. M.; Shafferman, A. Infect. Immunol. 2003, 71, 4563-4579. (8) Cohen, A. M.; Rumpel, K.; Coombs, G. H.; Wastling, J. M. Int. J. Parasitol. 2002, 32, 39-51. (9) Manning, D. R.; DiSalvo, J.; Stull, J. T. Mol. Cell. Endocrinol. 1980, 19, 1-19. 10.1021/ac034739d CCC: $27.50
© 2004 American Chemical Society Published on Web 12/09/2003
ment, such as a peptide mass fingerprint or a tandem mass spectrum of selected peptides. The peptide mass fingerprint process involves proteolytically digesting the protein sample using an enzyme such as trypsin and then measuring the fragment masses by MS. The resulting fingerprint is matched against putative masses that result from in silico digestion of entries from either protein or translated nucleotide databases. Matches are most commonly made against a database of known proteins, using available software tools.10-12 However, although the genome fingerprint scanning (GFS) method we recently reported13 matches the fingerprint against uninterpreted genome sequence, identifying the coding locus without reliance upon prior genome annotation. Tandem mass spectrometry (MS/MS) can be used to provide further identifying information. Select peptides from the aforementioned proteolytic digest are serially fragmented from both the carboxy (C) and amino (N) termini, producing a spectrum that can be used to identify the peptide producing it. Various methods can be used for matching MS/MS data, such as the cross-correlation approach.14 PROCLAME is designed to complement these, when operating as part of a larger integrated tool set for protein characterization. Computational analysis of putative chemical modifications for peptides has been implemented by several groups. Both FindMod15 and MASCOT10 match peptide mass data against a list of possible PTMs. Localization signals can be predicted using such tools as PSORT,16-18 SignalP,19,20 and MITOPROT.21 Complex modification events such as glycosylation and lipoylation are more difficult to predict, due to variations in the length and composition of chains added to the peptide backbone, though at least one group has built software, GlycoMod,22 that analyzes N-linked polysaccharides on peptides, using a database of common carbohydrate structures. However, matching intact protein data present greater challenges because many more modifications can affect the mass of a complete protein than of a single peptide. Protein characterization is aided by intact-protein measurement. For example, the identification of a signal peptide cleavage event is much more straightforward with intact data than peptides, because the former identification relies on a positively observed entity, whereas the latter relies on the absence of one or more peptides from among the spectra, and peptides may be absent from a spectrum for several reasons. When PTMs are analyzed, the confidence in a particular characterization will be greatly (10) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551. (11) MacCoss, M. J.; Wu, C. C.; Yates, J. R., 3rd Anal. Chem. 2002, 74, 55935599. (12) Clauser, K. R.; Baker, P.; Burlingame, A. L. Anal. Chem. 1999, 71, 28712882. (13) Giddings, M. C.; Shah, A. A.; Gesteland, R.; Moore, B. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 20-25. (14) Yates, J. R., 3rd; Eng, J. K.; McCormack, A. L.; Schieltz, D. Anal. Chem. 1995, 67, 1426-1436. (15) Wilkins, M. R.; Gasteiger, E.; Gooley, A. A.; Herbert, B. R.; Molloy, M. P.; Binz, P. A.; Ou, K.; Sanchez, J. C.; Bairoch, A.; Williams, K. L.; Hochstrasser, D. F. J. Mol. Biol. 1999, 289, 645-657. (16) Nakai, K.; Kanehisa, M. Proteins 1991, 11, 95-110. (17) Nakai, K.; Kanehisa, M. Genomics 1992, 14, 897-911. (18) Nakai, K.; Horton, P. Trends Biochem. Sci. 1999, 24, 34-36. (19) Nielsen, H.; Engelbrecht, J.; Brunak, S.; von Heijne, G. Int. J. Neural Syst. 1997, 8, 581-599. (20) Nielsen, H.; Brunak, S.; von Heijne, G. Protein Eng. 1999, 12, 3-9. (21) Claros, M. G. Comput. Appl. Biosci. 1995, 11, 441-447. (22) Cooper, C. A.; Gasteiger, E.; Packer, N. H. Proteomics 2001, 1, 340-349.
strengthened if the same modifications are observed in both the intact mass data and the resultant peptides. To use PROCLAME, the user provides to the program the intact mass of the protein and either an open-reading frame identifier or an amino acid sequence. The ability to enter a raw and perhaps unpublished protein sequence is of particular benefit to those studying alternative transcripts, polymorphisms, or any putative protein beyond the ∼6400 S. cerevisiae cDNA-based entries currently in the database. Other application parameters specify the estimated instrument accuracy in ppm and the fuzzy search cutoff score that controls how exhaustively the program searches. Users can choose to test for either PTMs or cleavages affecting intact precursors or can select simultaneous testing for both cleavages and modifications. A standard list of 14 PTMs can be used for the search, or the list may be customized by the user. The probability and fuzzy frequency scores of each PTM can also be modified. The search space for this problem is too large to comprehensively explore. Since PTMs can both increase and decrease the total mass, the time complexity of the search is O(n[m]), where n is the number of modifications considered and m is the size of the modification list considered. This mandates the use of heuristics to limit the search space to realistic candidates, so that the calculation completes in a reasonable time. A custom-built fuzzy logic engine determines modification candidate fitness using a small set of intuitive rules that heuristically bounds the search space. The engine uses simple, plain-English rules. The modification lists returned by the search are ranked by a separate, probability-like scoring method that estimates of the frequency of various types of modifications. While it is desirable to accurately represent the probability of observing a given modification event, it is not possible to determine at present due to the limited experimental information available. We use estimated probability values, based on an analysis of SwissProt, to provide a starting point assisting the user’s consideration of the putative modifications identified. We expect that this scoring function can be refined as data from large-scale proteomics studies accumulate. Users can also enter their own probability estimates that pertain to a given realm of biological study. METHODS AND SOFTWARE The modification search can be envisioned as a tree-traversal, with the original calculated mass for the unmodified ORF at the root and each successive level representing the consideration of an additional modification (Figure 1). Running on a 2-GHz system, the application can predict a precursor’s PTMs and proteolytic cleavages in about 1 min when using a reasonable search cutoff score (e.g., 0.7). Result quality is proportional to the accuracy of the mass spectrum. Further discussion of this is in the Results and Discussion section. PROCLAME’s general rules limit bias against the discovery of novel events; context-specific rules would limit the set of potential predictions. While it is true that some of its matches can be unlikely in a given experimental context, it is equally trues and potentially more importantsthat other, novel, matches produced by the application could not have been predicted by other tools that rely heavily on sequence context. The program’s purpose is to guide investigators toward a manageable set of Analytical Chemistry, Vol. 76, No. 2, January 15, 2004
277
Figure 1. Sample traversal through the search tree. The search is depth first, with leftmost nodes being the most frequent modifications. For compactness, different nodes containing the same PTM are shown with multiple paths. Thick lines indicate the path to a single match. Dotted lines indicate paths not shown for clarity. This particular scenario, having 3 PTMs and 96 ppm ∆Mass from the observation (out of a specified MMA tolerance of 100 ppm), might have a low probability score but might also rank highly among the results, depending on other viable scenarios found.
possible event scenarios that could produce the observed intact mass within the specified instrument accuracy. Because of its generality, PROCLAME can produce matches that are not viable for a particular biological milieu. For example, in the study of yeast large-subunit ribosomal proteins used for testing the application,3 no phosphorylation was found by the researchers, possibly due to the alkaline nature of these subunit proteins. However, using the standard list of modifications, PROCLAME proposed matches containing phosphorylations because it knows only that this PTM is very common in general. In such cases where knowledge is available to narrow the scope of the search, the user may adjust the list and scoring of searchedfor modifications accordingly. Usage. To begin a search, the user enters the observed intact protein mass and a protein identity from peptide mass fingerprinting or MS/MS. Alternatively a protein sequence can be entered. The investigator can choose to test for single or double cleavages, for covalent modifications, or for both. Three additional fields specify search constraints. The quality cutoff score is the fuzzy fitness score below which the search will stop going deeper; this score is described in detail below. The match precision score specifies the estimated mass measurement accuracy (MMA) of the mass spectrometer. Candidates with masses outside this range are not considered matches and are not saved. Users can choose the number of matches to be saved for review; typically, only a small set of 10 or 20 is of potential interest. Finally, a button allows access to a screen where the list of PTMs can be modified or rescored. Following completion of the search, a hyperlinked list of matches is displayed; the list is normally ranked by probability score but can be sorted on other fields. A download button allows capture of a text file containing the input information and the complete list of matches. A brief visual review of this file often reveals related 278
Analytical Chemistry, Vol. 76, No. 2, January 15, 2004
combinations of events, potentially narrowing the focus of confirmatory analyses. A sample match file is in the Supporting Information. The program is freely available on the Web at http:// proclame.unc.edu. Users wishing to run many tests can also be set up for batch mode, in which all tests are run in series and their output written to text files. Methodology. If provided with an identifying measurement, the program retrieves from the database the sequence of the protein(s) believed to be translated from the specified locus. Alternatively, the user’s own sequence can be used. This provides the baseline mass for the immature protein. Then the application selects a set of hypothetical cleavages from the unmodified precursor that, within a reasonable range, could match the target mass after the subsequent inclusion of a number of chemical modifications. The protein sequence is traversed from both ends and its mass is recalculated with the removal of each amino acid. Each such cleavage is retained if its mass falls within a reasonable range to allow for subsequent PTMs; this tolerance is currently set to 5 kDa. The scoring of precursor cleavages is discussed in the Probability Scoring section. These candidates are then used as starting points in the traversal of the search space containing combinations of covalent modifications. Figure 1 presents a schematic view of this process. This tree is searched depth first and is heuristically bounded by the fuzzy logic engine. Each modification is added and the final mass recalculated. PTMs continue to be added in order until the candidate becomes unrealistic as dictated by its fuzzy fitness score. At this point the search traversal returns to the next highest node in the tree, continuing recursively until all matches have been collected for a given precursor. The process then begins again using the next putative cleaved precursor.
If the measured mass of any candidate is within the specified instrument accuracy, the candidate is considered a match and is added to the array of matches presented to the user at the end of the search. As matches are added they are collated by score. If the array size exceeds the match set size specified by the user, then the poorest-scoring matches are discarded. Fuzzy Logic: A Rule-Based Approach. The search for posttranslational events is constrained using fuzzy logic (FL), which represents a nonbinary system of quantification and reasoning.23 An FL system uses rules to measure state and relate antecedent and consequent conditions in a nonbinary way. Assessment is performed by functions that determine the degree of membership in a set; for example, a fuzzy function might determine that a mass has a high degree of membership in a set named Light, low membership in a set named Heavy, and half membership in a set called Middleweight. These “fuzzy fitness” values can be used within a generalized framework of intuitive rules that are adaptable to automated decision making. Fuzzy logic rule sets are easily adjusted and have the advantage of being easy to interpret by subject matter experts. A fuzzy rule set contains assertions relating fuzzy fitness in different domains of interest. For example, a rule might state that if a molecule’s mass is Light then its size is Small; more specifically, such a rule relates a degree of Lightness to, perhaps, a similar degree of Smallness. Notably, this rule does not relate mass to size per se but rather it relates a particular type of mass to a degree of size. To get a complete, automation-capable framework relating all possible masses to all sizes, additional rules are required, such as one that states that if a mass is Very Heavy then its size is Very Large. An important requirement for the viability of a fuzzy system is the selection of a well-integrated group of membership sets. Sets and their values must be well-chosen for each domain in the problem context. For example, if high temperature is an important consideration, then a fuzzy domain called Temperature might contain sets named Warm, Hot, and Very Hot, which overlap each other to allow the construction of a balanced rule set. Fuzzy Engine. A custom engine was constructed using the Java programming language. The Engine class encapsulates control logic for managing three other classes named Set, Domain, and Rule, which implement three critical fuzzy logic entities. The Set class contains methods encoding the three basic fuzzy functions S, Z, and Pi, which can indicate degrees of membership within a range of related values. The three functions, named for the shapes of their curves, are based on the S-function:
S(x,a,b) ) 0; x < (a - b) )
(x - (a - b))2 ; (a - b) e x e a 2b2
)1-
((a + b) - x)2 ; a < x e(a + b) 2b2
degrees of membership for one of usually several ranges of related values. The Set class was coded in a generic fashion for portability. Instantiation of the class populates a common reflected method that provides an efficient, common interface regardless of the specific function being used in that instance. The Domain class encapsulates a group of sets and covers all possible values within a particular sphere of interest such as the delta between candidate and observed masses. Each domain should contain multiple sets, to allow for a full range of classification. An illustrated integration of one domain with three sets is included in Supporting Information. Our application currently contains three input domainssDelta, Depth, and Frequencysand one output domain, Quality. The Delta domain encompasses the distance between a calculated candidate mass and the observed target mass and includes sets such as “Close” and “Far”. The Depth domain encodes the range of depths in the search tree; though this range is theoretically infinite, in practice it is understood that most proteins will be affected by relatively few modifications. This domain contains several sets such as “Shallow” (one or several modifications) and “Deep” (more than about seven modifications). The Frequency domain encodes the estimated probability that a given candidate modification will occur; it uses an arbitrary one-point scale whose values are based on analysis of PTM annotations in the Swiss-Prot21 and YPD databases.25 Phosphorylation, by far the most frequently observed PTM, is present in the “Frequency” domain with a high score in the Frequent set and a low score in the “Seldom” set. Fuzzy Rules. Selection and specification of rules is critical to acquiring a complete and balanced overview of all possible test states that the program may encounter. The simplest form of rule has the format if (antecedent state(s)) then (consequent state). In our system, the consequent to all rules applies to one of the sets in the Quality domain, and the antecedent state can be one or two conditions in the three input domains. For example, the rule if candidate mass is close to target mass then candidate quality is good can be more precisely stated as the candidate mass’s degree of membership in the Close set of the Delta domain is the candidate’s degree of membership in the Good set of the Quality domain. The more “close” the candidate’s mass to the target mass, the more “good” its quality. Our system currently uses 10 simple rules, all relating conditions in the domains Delta (∆ mass to target), Depth (number of events), and Frequency (averaged estimated incidence of modifying event) to a degree of membership in the Quality domain; these are listed in Supporting Information. The current rule set encodes only a few common-sense assertions, avoiding protein context rules in order to preserve the ability to discover novel events. For example, one rule states that if Frequency is High then Quality is Good. During a search, each of the rules is fired and the results are added to a common array representing the Quality domain. After all rules are fired, the resulting landscape is integrated over 40 intervals and its center of mass becomes a single quality score
) 1; x > (a + b) where a is the midpoint of the membership set and b is its width. The Z-function is its inverse, and the Pi-function combines the two into a bell-shaped curve. One instance of the Set class encodes
(23) Zadeh, L. A. IEEE Trans. Fuzzy Syst. 1996, 4, 103-111. (24) Bairoch, A.; Apweiler, R. Nucleic Acids Res. 2000, 28, 45-48. (25) Costanzo, M. C.; Hogan, J. D.; Cusick, M. E.; Davis, B. P.; Fancher, A. M.; Hodges, P. E.; Kondu, P.; Lengieza, C.; Lew-Smith, J. E.; Lingner, C.; RobergPerez, K. J.; Tillberg, M.; Brooks, J. E.; Garrels, J. I. Nucleic Acids Res. 2000, 28, 73-76.
Analytical Chemistry, Vol. 76, No. 2, January 15, 2004
279
for the candidate. This quality score is compared with the investigator’s specified fuzzy cutoff score in order to bound the search. Probability Scoring. A separate probabilistic scoring system was implemented, offering a simple way to score the likelihood that a candidate is the true positivessomething that the fuzzy score does not provide.26 It also offers a context in which to relate the number of posttranslational events to the relative chances that each of those events will occur. To confirm the validity of this system, a series of randomly generated scenarios was run to test for sensitivity and specificity; these are discussed in Results and Discussion below. The approximate probability score (p-score) of a candidate is a number between 0 and 1, with 1.0 indicating maximal likelihood that the events in a match actually produced the observed mass. This score determines a candidate’s rank among all the other matches stored during a test. It is calculated by multiplying the scores assigned to each of the modifying events in that putative match: ptotal ) (p1 × p2 × p3.... pn), where px is the p-score assigned to event x. P-scores were assigned in much the same way as fuzzy frequency scores, though with some differences. For example, cleavage of the N-terminal methionine is ubiquitous but is also probably independent of the likelihood that an additional cleavage will subsequently occur for subcellular localization. As much as possible, the estimated probabilities for related alternative events were assigned to add up to 1. The scoring of peptide cleavages was based on the following simple rules. Removal of one or two amino acids from the N-terminus is common and thus given the highest probability (0.8) while the absence of this event is given a correspondingly low score (0.1). Among further N-terminal cleavages, removal of up to 30 amino acids is given a relatively high score (0.7), up to 50 a lower score (0.2), and over 50 a much lower score (0.05). Cleavages of any length from the C-terminus are scored relatively low (0.05) and large double cleavagessfrom both ends of the proteinsare also scored low (0.1). Table 1 includes the default scores assigned to each PTM, which can be modified to suit alternative hypotheses about likelihood and frequency. The scores are only approximate, and they do not yet account for the likelihood that any given event will occur in multiples or in conjunction with another event. For example, a protein may be more likely to be doubly or triply phosphorylated than to have a single one. We have designed the system to accommodate this in the future. Data Sources. The application database comprises two data sets whose data were obtained from four databases: SGD (http:// www.yeastgenome.org), MITOP,27 DeltaMass (http://www. abrf.org/index.cfm/dm.home), Swiss-Prot,24 and YPD.25 The proteins data set was compiled from two sources, SGD and MITOP. The posttranslational modifications data set contains the 313 possible PTMs in the DeltaMass database, along with statistically inferred values estimating the known frequency of each modification. To select and rank the modifications for testing, we quantitatively reviewed annotations in the Swiss-Prot and YPD (26) Zadeh, L. A. Technometrics 1995, 37, 271-276. (27) Scharfe, C.; Zaccaria, P.; Hoertnagel, K.; Jaksch, M.; Klopstock, T.; Dembowski, M.; Lill, R.; Prokisch, H.; Gerbitz, K. D.; Neupert, W.; Mewes, H. W.; Meitinger, T. Nucleic Acids Res. 2000, 28, 155-158.
280 Analytical Chemistry, Vol. 76, No. 2, January 15, 2004
Table 1. Default Posttranslational Modifications and Scores Used by PROCLAME, Sorted by Probability Scorea name phosphorylation acetylation (Ac) methylation amide formation (C terminus) deamidation disulfide bond formation farnesylation formylation (CHO) myristoylation ornithine (from arginine) oxidation of methionine (to sulfone) oxidation of methionine (to sulfoxide) palmitoylation selenocysteine (from serine)
pscore
freq score
av mass
mono mass
0.5 0.1 0.1 0.05
0.8 0.5 0.5 0.2
79.9799 42.0373 14.0269 -0.9847
79.9663 42.0106 14.0157 -0.9840
0.05 0.05 0.05 0.05 0.05 0.05 0.05
0.3 0.3 0.2 0.3 0.2 0.2 0.3
0.9847 -2.0159 204.3556 28.0104 210.3598 -42.0400 31.9988
0.9840 -2.0157 204.1878 27.9949 210.1984 -42.0218 31.9898
0.05
0.3
15.9994
15.9949
0.05 0.05
0.2 0.2
238.4136 62.9606
238.2297 63.9216
a The frequency score is used in bounding the fuzzy logic search, while the p-score is used to rank the list of matches found.
databases. In Swiss-Prot, the MOD•RES subset of entries within the FEATURE•TAG annotations provided PTM frequencies for numerous species, while the PTMs annotated in YPD provided complementary values curated specifically for yeast. Both annotations yielded a generally similar view of modification frequency, which was encoded in our PTM data set. Modifications were scored, and a few events of interest to the research team were noted. For tractability, the default list was reduced to 14 PTMs (Table 1) for use in searches, but the user has the ability to add to or remove from this list. RESULTS AND DISCUSSION PROCLAME was tested with several data sets, including one set of 75 mass measurements of large-subunit ribosomal proteins published in the literature and one in silico set of 40 randomly generated scenarios. An additional series of six randomly generated scenarios was tested with six increasing levels of instrument accuracy, to test for specificity by relating the false-positive rate to the specified MMA. Hypothetical Scenarios in Silico. To test PROCLAME for a full spectrum of scenarios, a Perl script was written to generate random event sets. The test generator selected a random protein from the database and then selected a random cleavage and a random number of randomly chosen PTMs. After the scenario was chosen, an arbitrary amount of instrument error was introduced, and the final hypothetical observed mass was calculated. In selecting cleavages, the test program chose randomly between cleavages of 0, 1, or 2-50 amino acids from the N terminus. The number of hypothetical PTMs was randomized between 0 and 4, and to introduce noise, an MMA of 0-20 ppm was randomly chosen with an equal chance of being either positive or negative. Forty such scenarios were generated, all assuming the use of average isotopic masses. Note that many of the test cases are so improbable that they would likely never be observed, but their inclusion was critical in determining the sensitivity of PROCLAME at varying levels of event probability.
Table 2. Sample Scenarios Having Probability Scores around 0.002a p-score cleavage (p-score)
PTMs (p-score)
1 n-terminal (0.8) phosphorylation (0.5) + methylation (0.1) + oxidation of methionine to sulfoxide (0.05) 0.002 1 n-terminal (0.8) formylation (0.05) + farnesylation (0.05) 0.00175 9 n-terminal (0.7) disulfide bond formation (0.05) + palmitoylation (0.05) 0.001 3 n-terminal (0.8) phosphorylation (0.5) + oxidation of methionine to sulfone (0.05) + myristoylation (0.05) 0.002
Figure 2. Putative ranks for 40 randomly generated in silico tests, run at two fuzzy cutoff scores, with second-order power trend lines. Solid diamonds and trend line represent a cutoff score of 0.6, while open triangles and dotted trend line resulted from tests rerun at cutoff score 0.5. Ranking was determined by adding one to the number of false positives in each test. Sensitivity changes markedly below p-scores of ∼0.002.
The 40 hypothetical scenarios were tested twice; one run used a fuzzy cutoff score of 0.6 and a match set size of 200, and the second run used a cutoff score of 0.5 with 300 matches. The two series were done to assess the number of matches that would be missed or found as the size of the search space was changed. The large set sizes allowed more accurate measurement of the number of false positives for lower-probability events that were not matched. Among the 40 random scenarios, 21 cases had probability scores (p-scores) of 0.002 or greater; when tested with a fuzzy cutoff score of 0.6, 18 were ranked first and three ranked second. The remaining 19 cases had p-scores between 0.002 and 0.000 001; 5 of these were ranked within the top 10, and 11 were not matched at all. These cases with p-scores below 0.002 had too many false positives to be of practical predictive value. To test the effect on sensitivity of enlarging the search space, the series was rerun at a lower fuzzy cutoff score of 0.5. Among the 23 cases with p-scores of 0.001 75 or greater, 19 of these were ranked first, three ranked second, and one ranked fourth. Notably, two of these cases were not found when run at the higher cutoff score, indicating some increase in sensitivity. The remaining 17 cases had p-scores between 0.001 and 0.000 001; 4 of these were ranked in the top 10 and 8 were not matched at all. Interestingly, among the tests in both series that did not match, many would have ranked in the top 10 if the program had ignored PTMs around 1 Da (e.g., amidation); also, most of these latter cases had high levels of instrument error, close to the 20 ppm specified for the experiment. These results indicate that PROCLAME can correctly rank matches within the top several candidates, with scenario p-scores of g∼0.002 when using a fuzzy cutoff score of e0.6. Figure 2 shows the rankings and p-scores of both test series; the test cases and numerical values are included in Supporting Information. In practical use, tests can be rerun with lowered fuzzy cutoff scores in order to provide an increased level of sensitivity. The only disadvantage to enlarging the search space by lowering the fuzzy cutoff score is that tests take longersseveral minutes versus ∼1 min. In practice, this typically does not matter because tests can be run in batch mode and the results obtained later at the user’s convenience.
a The application as currently configured will usually rank scenarios up to this level of complexity as either the first or second most likely matches among all those found, with instrument precisions of 20 ppm or less.
Table 3. Rankings for Six Randomly Generated in Silico Tests, Run at Six Different MMA Constraints specified MMA (ppm)a
rank test p-score (events)
0.5
1
5
20
50
100
0.7 (1) 0.04 (1) 0.02 (2) 0.004 (2) 0.008 (2) 0.000125 (4)
1 1 1 1 1 1
1 1 1 1 1 1
1 1 2 1 2 1
1 2 2 1 2 1
1 3 2 2 5 3
1 4 2 4 6 6
a Numbers in parentheses indicate the number of cleavage or modification events, excepting cleavage of the N-terminal methionine.
To provide a realistic perspective of scenarios whose probability is around the limit of the application’s reliability, Table 2 lists several sets of events at p-scores of 0.002. A second suite of tests was performed in order to elucidate the false-positive rates at different levels of instrument precision. The MMA constraint tells the application to store and rank any matches it finds that are within the specified ppm limit and to discard matches outside this range. Six scenarios were randomly generated, having p-scores of 0.7, 0.04, 0.02, 0.008, 0.004, and 0.000 125. The exact final masses were calculated for each scenario, and no instrument error was introduced, so that the program could make very precise guesses about each one. Then the scenarios were tested at six levels of hypothetical MMA: 100, 50, 20, 5, 1, and 0.5 ppm. In all six cases, the true positive was ranked either first or second at an MMA of 20 ppm or lower; the highest number of false positives, six, was observed at 100 ppm for the two cases with lowest p-scores. Table 3 shows the results for the complete series, as well as the number of mass-modifying events in each scenario. Tests of Published Experimental Data. A recent study by Lee et al.3 used FTICR-MS to examine the intact masses of the 65 ribosomal large subunit proteins (rpLs) of yeast, in which all but 5 of the 59 observed proteins were calculated as having posttranslational cleavages or modifications. Following mass measurement, the investigators used in-house software to infer common cleavages and PTMs. The presence of double measurements and duplicate genes yielded 75 tests to perform with PROCLAME. For these tests, an MMA tolerance of 25 ppm was specified due to the accuracy of this instrument. Analytical Chemistry, Vol. 76, No. 2, January 15, 2004
281
Of the 75 tests performed, the application correctly ranked as first (62) or second (4) the same result as the investigators in 66 cases, or 88%. Of the remaining nine cases, two were associated with one protein for which the investigator-identified modification ranked sixth according to PROCLAME. The other seven tests did not match the published results. One of these was because the average mass was listed and was confirmed by the authors as a typographical error. Another nonmatching case, rpL25, is inconsistent with N-terminal Met loss and was noted as such by PROCLAME. In a third disagreement, the protein sequence for rpL26B differed in our database from that used in the published study. Finally, four tests associated with a single protein were listed with masses that are inconsistent with the PTMs hypothesized; in these cases, PROCLAME identified an alternative scenario which appears to be very reasonable. The study authors inferred that the protein produced by the duplicate gene rpL23AB was affected by terminal methionine loss and either seven methylations or four methylations plus an acetylation. Excluding predictions containing phosphorylations, unlikely to occur in this mileu, PROCLAME predicted the most likely scenario as an N-terminal Met loss, two acetylations, and a methylation (total p-score, 0.0008). This alternative scenario identified by PROCLAME seems very possible within this biological context and is only 4.5 ppm smaller than the observed mass. Though both of the above test suites were performed with MMA thresholds of 25 ppm or less, we have performed preliminary tests with unpublished data at precisions as low as 150 ppm and found that the application is usually able to propose very reasonable scenarios that are consistent with investigators’ hypotheses. Summary and Future Plans. The application is currently nearing release for scriptable, command-line use. This will add significant flexibility to high-throughput operation, particularly on high-performance clusters; this modularization will also, importantly, ease the integration of PROCLAME with other analysis
282
Analytical Chemistry, Vol. 76, No. 2, January 15, 2004
tools such as GFS for providing genomic input and, hopefully, tools that might correlate peptide data against the application’s match list. Other desirable enhancements include the ability to dynamically add or adjust rules, to allow more context sensitivity for particular studies, and further research on and population of the probability scores for reported co-occurring events such as multiple phosphorylations. The PROCLAME application provides utility to investigators working with intact-protein MS data by offering a list of scenarios that can assist in guiding subsequent characterization. The program performs well in the prediction of event sets having probability scores of g0.002 and should improve in reliability as planned refinements are implemented. The simple rule base allows for prediction of novel sets of mass-modifying events and provides an apt framework for improvements in both sensitivity and specificity. ACKNOWLEDGMENT The authors thank Atul Shah for editorial and programming assistance and Dr. Clyde Hutchison, for guidance on PTMs. We thank Drs. Sang-Won Lee, Ljiljana Pasˇa-Tolic´, and Richard Smith for assistance in working with their published data. We are grateful for the opportunity to have begun this work in the laboratory of Drs. Raymond Gesteland and John Atkins, Department of Human Genetics, University of Utah (Salt Lake City, UT). This work was supported by NIH Genome Scholar award HG00044 to M.C.G. SUPPORTING INFORMATION AVAILABLE Additional information as noted in the text. This material is available free of charge via the Internet at http://pubs.acs.org. Received for review July 4, 2003. Accepted October 23, 2003. AC034739D