DAPPLE 2: a Tool for the Homology-Based Prediction of Post

UbiProt, 37, http://ubiprot.org.ru/index.php ... For measuring speed, DAPPLE 2 was run on the (otherwise quiescent) SAPHIRE web server (http://saphire...
2 downloads 13 Views 376KB Size
Subscriber access provided by The University of British Columbia Library

Article

DAPPLE 2: a tool for the homology-based prediction of post-translational modification sites Brett Trost, Farhad Maleki, Anthony Kusalik, and Scott Napper J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b00304 • Publication Date (Web): 01 Jul 2016 Downloaded from http://pubs.acs.org on July 2, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

DAPPLE 2: a tool for the homology-based prediction of post-translational modification sites Brett Trost,∗,†,‡ Farhad Maleki,‡ Anthony Kusalik,‡ and Scott Napper†,¶ †Vaccine and Infectious Disease Organization, University of Saskatchewan, Saskatoon, SK, Canada ‡Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada ¶Department of Biochemistry, University of Saskatchewan, Saskatoon, SK, Canada E-mail: [email protected] Phone: +1-306-966-1495 Abstract The post-translational modification of proteins is critical for regulating their function. While many post-translational modification sites have been experimentally determined, particularly in certain model organisms, experimental knowledge of these sites is severely lacking for many species. Thus, it is important to be able to predict sites of post-translational modification in such species. Previously, we described DAPPLE, a tool that facilitates the homology-based prediction of one particular post-translational modification—phosphorylation—in an organism of interest using known phosphorylation sites from other organisms. Here we describe DAPPLE 2, which expands and improves upon DAPPLE in three major ways. First, it predicts sites for many posttranslational modifications (20 different types) using data from several sources (15 online databases). Second, it has the ability to make predictions approximately 2–7 times

1

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 29

faster than DAPPLE, depending on the database size and the organism of interest. Third, it simplifies and accelerates the process of selecting predicted sites of interest by categorizing them based on gene ontology terms, keywords, and signaling pathways. We show that DAPPLE 2 can successfully predict known human post-translational modification sites using, as input, known sites from species that are either closely (e.g., mouse) or distantly (e.g., yeast) related to human. DAPPLE 2 can be accessed at http://saphire.usask.ca/saphire/dapple2.

Keywords post-translational modifications, phosphorylation, acetylation, ubiquitination, methylation, glycosylation, nitrosylation, homology, cross-species comparisons

Introduction Post-translational modifications (PTMs) are changes made to proteins after they are translated. PTMs may alter various properties of a protein, including its cellular localization, three-dimensional structure, catalytic activity, and ability to interact with other biomolecules. 1 Many PTMs (and those of interest in this paper) involve the addition of a chemical group to a particular amino acid residue. For example, sulfation is the addition of a sulfate group to a tyrosine residue, and appears to play a role in controlling protein-protein interactions, 2 while the addition or removal of the small ubiquitin-related modifier (SUMO) is known to be involved in the regulation of gene expression. 3 While PTM sites can be identified using techniques like mass spectrometry, these methods can be expensive and time-consuming, and are not available to many laboratories. Thus, the ability to predict putative sites of modification is of considerable importance. Many computational methods for predicting post-translational modification sites have been proposed, most of which are based on machine learning and/or statistics. These methods are too 2

ACS Paragon Plus Environment

Page 3 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

numerous to provide an exhaustive list; for instance, over 40 techniques exist for phosphorylation alone, 4 and methods are available for several other PTMs, including glycosylation, 5 acetylation, 6 ubiquitination, 7 and nitrosylation. 8 Previously, we developed DAPPLE, a tool that facilitates the prediction of one particular PTM—phosphorylation. 9 Unlike machine learning-based tools, DAPPLE uses a direct homology-based approach, in which phosphorylation sites in the organism of interest (the “target organism”) are predicted based on sequence similarity to experimentally-determined phosphorylation sites in other organisms. Specifically, experimentally-determined phosphorylation sites are represented in DAPPLE by 15-mer peptides, with the residue known to be phosphorylated in the center. A length of 15 was chosen because it is consistent with the specificity determinants of many kinases, and is the length used by several machine learning-based tools. 4,10–12 For each 15-mer peptide, DAPPLE uses BLAST to search for similar 15-mer peptides in the proteome of the target organism. As a hypothetical example, suppose that the target organism is mouse, that residue S45 is a known phosphorylation site in the human protein with accession number P12345, and that the 15-mer peptide corresponding to this site is CDEPLMNSQSQSTRY. The serine residue in the central position of this peptide is the one that is known to be phosphorylated. Further, suppose that the closest match to CDEPLMNSQSQSTRY in the mouse proteome is CDEPLMNSNSQSTRY, which occurs in the mouse protein with accession number P67890. Since CDEPLMNSQSQSTRY and CDEPLMNSNSQSTRY are identical except for one sequence difference, it is likely that the central serine residue in CDEPLMNSNSQSTRY represents a real phosphorylation site. We have previously shown that the fewer sequence differences that exist between a query 15-mer and its best match in the target proteome, the more likely it is that the best match represents a known phosphorylation site. 13 For example, an exact match (a match having zero sequence differences) is likely to contain a known phosphorylation site, while a match with seven sequence differences is less likely to contain a known site. Along with the best-matching peptide, DAPPLE reports numerous additional pieces of information re-

3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

garding each match that allow the user to evaluate its plausibility, including descriptions of the query and matching proteins, the locations of the putative phosphorylation sites in the protein sequences, whether the two proteins are putative orthologues, and so on. For more details, please see the original manuscript describing DAPPLE, 9 as well as a follow-up paper, 13 which characterizes and evaluates DAPPLE in further detail. Here we describe DAPPLE 2, a substantial expansion of DAPPLE. DAPPLE 2 contains three major additions and improvements. First, DAPPLE 2 does not focus solely on phosphorylation, but instead facilitates the prediction of 20 different PTMs. Second, DAPPLE 2 has greatly increased speed. Third, DAPPLE 2 facilitates the rapid identification of predicted PTM sites of interest via the inclusion of gene ontology (GO), 14 keyword, and signaling pathway information. Details regarding these improvements, as well as a description of how DAPPLE 2 was validated, are provided in the next section.

Experimental procedures Additional post-translational modifications In order to include a large variety of PTMs in DAPPLE 2, a list of databases of experimentallydetermined PTM sites was compiled from the literature. While the information provided by each database differed, each record from a given database included, at a minimum, the identifier of a particular protein and the residue in that protein that is known to be posttranslationally modified. After collection, the data from each database were converted into a common format. If a given database did not use UniProt identifiers to refer to the protein in which a particular PTM occurs, then the ID mapping feature of UniProt was used to convert the supplied identifiers to UniProt identifiers. A database record was removed if the UniProt identifier corresponding to that record did not exist in the complete UniProt proteome for the corresponding organism, or if the 15-mer peptide corresponding to a given site contained ambiguous amino acids. Redundant database records were removed. 4

ACS Paragon Plus Environment

Page 4 of 29

Page 5 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 1 lists the databases from which the data incorporated into DAPPLE 2 were derived, as well as the reference and URL for each. For each PTM that was represented in these databases, Table 2 lists the databases that contained data for that PTM, as well as the aggregate number of unique sites after the filtering procedures described above.

Faster database searching The majority of the compute time used by DAPPLE is consumed by BLAST, which is used to find homologues of the sites (represented as 15-mer peptides) in the proteome of the target organism. DAPPLE 2 provides the option to use an alternative search method called RAPSearch2. 15 RAPSearch2 uses a reduced amino acid alphabet and a variable seed length to reduce search time. Both the sensitivity and speed of DAPPLE 2 when using RAPSearch2 as compared to when using BLAST were evaluated using several different combinations of PTM database and target organism. Specifically, four different PTM databases were tested (see also Tables 1 and 2): 1. the CPLM butyrylation database (39 records); 2. the N-ace acetylation database (1215 records); 3. the mUbiDiDa ubiquitination database (48810 records); and 4. the PhosphoSitePlus phosphorylation database (273326 records). These databases were selected because they spanned the spectrum of database sizes. Each database was tested using two different target organisms: human (89005 proteins) and Saccharomyces cerevisiae strain ATCC 204508 / S288c (6643 proteins). These organisms were selected because their proteomes had disparate sizes, and because they are evolutionary distant. For measuring both speed and sensitivity, three separate trials were conducted for each combination of database, target organism, and search method (RAPSearch2 or BLAST).

5

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

To measure speed, DAPPLE 2 was run on the (otherwise quiescent) SAPHIRE web server (http://saphire.usask.ca), which contains four 3.2 GHz Intel processors and 32 GB of memory, and the elapsed time for each run was recorded. Both RAPSearch2 and BLAST were set to use four threads. To compare the sensitivity of RAPSearch2 with the sensitivity of BLAST, we answered the following question for each combination of database and target organism: “If BLAST reports a given number of sequence differences between a query peptide (15-mer) and its best match in the target organism, how many sequence differences are reported by RAPSearch2 for that same query peptide”? For instance, suppose that, for a given peptide, both BLAST and RAPSearch2 find a best match in the target organism with two sequence differences. In this particular case, BLAST and RAPSearch2 could be said to be equally sensitive. However, suppose that for another peptide, BLAST finds a best match with two sequence differences but RAPSearch2’s best match has four sequence differences. In this case, BLAST could be said to be more sensitive than RAPSearch2. To compare the sensitivity of BLAST and RAPSearch2 among all the peptides in a given database, the above question was answered for each peptide. The results were then summarized by computing the percentage of BLAST hits with X sequence differences that had RAPSearch2 hits with Y sequence differences. For example, suppose that, for a hypothetical database and target organism, there were 100 query peptides having X = 2 sequence differences according to BLAST. If 98 of these peptides also had Y = 2 sequence differences according to RAPSearch2, then BLAST and RAPSearch2 could be said to have similar sensitivity for these short-peptide searches. If, however, RAPSearch2 found 60 peptides that had Y = 2 sequence differences and 20 peptides that had Y = 4 sequence differences, and 20 peptides had no match at all, then RAPSearch2 could be said to be substantially less sensitive than BLAST.

6

ACS Paragon Plus Environment

Page 6 of 29

Page 7 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Improved ability to identify sites of interest Depending on the size of the PTM database used as input, the number of predictions made by DAPPLE 2 can be very large, making it time-consuming to identify predictions of interest. To remedy this, the output table generated by DAPPLE 2 includes additional columns containing keywords, GO terms, 14 and Reactome signaling pathways 16 for both the query protein and the hit protein, allowing the user to easily search for results of interest. This search can be done using the search functions in the user’s text editor or spreadsheet program, or (for more advanced users) a UNIX program like grep. Keywords and GO terms for a given UniProt entry were derived from the UniProt text-format record for that entry. A mapping of UniProt accession numbers to Reactome pathways was downloaded from the Reactome website (http://www.reactome.org/download/current/UniProt2Reactome_All_ Levels.txt).

Validation The same procedure that was previously used to validate DAPPLE for phosphorylation sites 13 was used to validate DAPPLE 2 for three additional PTMs: acetylation, glycosylation, and ubiquitination. These three PTMs were chosen because they had the greatest numbers of known sites. Briefly, DAPPLE 2 was used to identify PTMs in the human proteome starting with queries from species other than human. We then ascertained the relationship between the number of sequence differences between a query 15-mer and its best match in the human proteome, and the likelihood that the best match is a known human PTM site.

7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Results Additional post-translational modifications Fifteen PTM databases were compiled from the literature, which contained data for 20 different PTMs (Tables 1 and 2). The total number of unique sites among all PTMs was 630786. The post-translational modification with the greatest number of unique sites was phosphorylation, with 379247. Phosphorylation data were also present in more databases than any other PTM (eight databases). Other prominent PTMs, both in terms of the number of unique sites and the number of databases that contained data for them, included ubiquitination (149689 unique sites in five databases), acetylation (70650 unique sites in six databases), glycosylation (12576 unique sites in two databases), methylation (8724 unique sites in three databases), and nitrosylation (4050 unique sites in two databases). Most PTMs had at least 100 unique sites; exceptions to this included butyrylation (39), crotonylation (58), malonylation (32), and propionylation (49).

Faster database searching As described above, DAPPLE 2 gives the user the option of choosing BLAST or RAPSearch2 for the purposes of searching for homologues of the 15-mer peptides in the proteome of the target organism. Empirical tests were done using four PTM databases and two target organisms in order to evaluate how the two search tools compare in terms of speed and sensitivity. Timing results showed that the advantage in speed of RAPSearch2 over BLAST was negligible for small databases, but substantial (2-7× faster) for large databases (Figure 1). For example, using database #1 (39 records), as defined above, and with the human proteome as the target, the average time taken by DAPPLE 2 when using RAPSearch2 was 3.7 minutes versus 4.0 minutes for BLAST. In contrast, when using database #4 (273326 records) and the proteome of S. cerevisiae as the target, the average time taken by DAPPLE 2 was 21.3 8

ACS Paragon Plus Environment

Page 8 of 29

Page 9 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

minutes when using RAPSearch2 versus 147.0 minutes for BLAST. To determine whether this increase in speed comes at a significant cost in terms of sensitivity, the sensitivity of RAPSearch2 was compared to that of BLAST. These results are summarized in Table 3 (for database #4 and human as the target organism) and Tables S1-7 (the remaining combinations of database and target organism). As an example of reading these tables, consider the row with heading “4” in Table 3. This row describes, for peptides whose best match according to BLAST had four sequence differences in the human proteome, what percentage of these peptides had a given number of sequence differences with their best match according to RAPSearch2. Specifically, none of these peptides had zero or one sequence differences according to RAPSearch2; 0.3% and 5% had two and three differences, respectively. For a majority of these peptides (58.6%), the number of sequence differences reported by BLAST and the number reported by RAPSearch2 were equal (four). However, a sizable percentage (32.8%) of the peptides with four sequence differences according to BLAST had either more than 7 sequence differences according to RAPSearch2, or had no match at all in the human proteome (last column), indicating that BLAST exhibited much greater sensitivity for many peptides. Overall, for query peptides with a close match in the target proteome, their sensitivities were nearly identical; however, the weaker the match, the less sensitive RAPSearch2 was compared to BLAST. This suggests that, if primarily strong matches are expected (e.g., if the target organism is closely related to those represented in the PTM database), then RAPSearch2 would be preferable. However, if weaker matches are expected (e.g., if the target organism is distantly related to those represented in the PTM database), then selecting BLAST would be preferable despite the extra computational time.

Validation Unlike PTM prediction methods that are based on machine learning, the traditional metrics of sensitivity and specificity are problematic to calculate for DAPPLE and DAPPLE 2, making it difficult to determine their accuracy. For further explanation, see the Discussion section 9

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

of this paper, as well a previous paper describing DAPPLE. 13 Nonetheless, it is possible to evaluate the usefulness of DAPPLE 2 by answering the question: “How does the number of sequence differences between a query 15-mer (containing an experimentally-determined PTM site) and its best match in the target proteome affect the likelihood that the best match contains an experimentally-determined PTM site?” There are two goals associated with answering this question: first, to show that DAPPLE 2 can indeed yield true positive matches, and second, to assess the relationship between the number of sequence differences and the likelihood that the match is a known PTM site. Using the same methodology as was previously employed for phosphorylation sites, 13 here we have addressed the above question for three additional PTMs: acetylation, glycosylation, and ubiquitination. BLAST was used as the search method, as it was shown in the previous section to have higher sensitivity than RAPSearch2. The results are given in Table 4, which shows that DAPPLE 2 indeed yields a large number of true positive matches for all three PTM types. For instance, of the 9247 known mouse acetylation sites whose corresponding 15-mer had zero sequence differences with its best match in the human proteome, 3817 (41.3%) were known human sites. Because not all human PTM sites have been discovered, these numbers must be interpreted with caution. Specifically, in the previous example, it is not the case that the remaining 5430 sites are all false positives; rather, many are likely to be real human acetylation sites that have yet to be discovered. The data also show that the usefulness of DAPPLE 2 is not limited to situations in which the known PTMs are from species that are closely related to the target species. For instance, the percentage of ubiquitination sites from S. cerevisiae with two sequence differences in the human proteome that are known human ubiquitination sites is high (93/139 = 66.9%). This shows that known PTM sites from even distantly related species can be used to identify human sites. (Interestingly, this percentage is higher than the corresponding percentages for rat and mouse (34.4% and 36.1%, respectively). This observation should also be interpreted with caution: there is undoubtedly substantial bias in which PTM sites have been examined in prior studies, and it is possible that the PTM sites

10

ACS Paragon Plus Environment

Page 10 of 29

Page 11 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

in S. cerevisiae that have been well-studied are those that are more likely to be conserved in a wide variety of lineages.) Table 4 also shows that, consistent with our previous results, 13 the percentage of matches that are known human PTM sites decreases as the number of sequence differences increases. Thus, when interpreting output from DAPPLE 2, it is prudent to first consider selecting putative sites that have few or no sequence differences relative to the corresponding experimentally-determined site. Finally, from Table 4, one can also easily calculate how many potentially novel sites are predicted by DAPPLE 2. As an example, for known acetylation sites from Bos taurus having matches in the human proteome with zero sequence differences, the number of predicted sites that are not known human sites (and thus are potentially real sites that have yet to be discovered) is 107 − 71 = 36.

Input and output DAPPLE 2 uses a simple web-based user interface. There are five parameters—the target organism (the organism for which predictions are desired), the PTM of interest, the database of known PTM sites, the search method (RAPSearch2 or BLAST), and the maximum number of matches per query PTM site. For the database of known PTM sites, the user has the option of selecting either an individual database, or a non-redundant aggregation of the data from all of the databases that contained data for the chosen PTM. The target organism can be selected from any of the 432 organisms for which complete UniProt proteomes are currently available. Once a job is complete, the user will receive an e-mail containing a link to the results in spreadsheet-compatible, tab-delimited text format. In order to list the highest-confidence predictions first, the results are sorted in order of decreasing number of sequence differences between the query 15-mer peptide and its best match in the proteome of the target organism. Table S8 contains a guide to interpreting the columns of the output table generated by DAPPLE 2, while sample DAPPLE 2 output can be accessed at http: //saphire.usask.ca/saphire/dapple2/sample_output.txt.

11

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Discussion In this paper, we have described DAPPLE 2, which allows the user to predict sites for many different PTMs in a wide variety of organisms. As verification of the methodology, we have shown that it can successfully predict many sites that have already been experimentally determined to exist. The homology-based approach used by DAPPLE 2 is different from other methods for predicting PTM sites, most of which use machine learning approaches. Homology-based prediction has a number of advantages over machine learning-based methods. First, it relies entirely on sequence conservation, which is well-established as a powerful predictor of structure and function. Second, DAPPLE 2 uses a simple, transparent procedure that does not involve the use of more opaque models like artificial neural networks—the user can easily see and understand why a particular prediction was made. Third, and perhaps most importantly, machine learning methods typically require a substantial amount of training data in order to create an accurate predictive model, and thus would not work well for PTMs for which little training data are available, such as butyrylation and propionylation (see Table 2). In contrast, the accuracy of DAPPLE 2 is not affected by the number of known sites for a particular PTM. There are also disadvantages associated with DAPPLE 2’s homology-based prediction strategy. First, the potential number of predicted sites in the target organism is limited by the number of known sites in other organisms. For instance, if there are only 50 known sites for a particular PTM, then no more than 50 predicted sites could be identified in the target organism. This is in contrast to machine learning-based methods, which could theoretically identify all actual sites in the target organism (although, due to limited accuracy, would not typically do so in practice). Second, because machine learning-based methods may incorporate subtle sequence-based features into their models, they have the potential to identify putative PTM sites that would not be identified by the direct detection of sequence conservation. 12

ACS Paragon Plus Environment

Page 12 of 29

Page 13 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The discussion above represents a qualitative comparison between DAPPLE 2 and machine learning-based methods. It would also be of interest to perform a quantitative comparison; however, DAPPLE 2’s direct use of sequence homology makes such a comparison difficult. The reasons for this have been discussed previously (see Supplementary File S2 of Trost et al. 13 ), and thus will not be reiterated here. However, with appreciation of the obstacles to applying traditional measures of accuracy, this paper (as well as a previous work 13 ) provided a measure of validation for DAPPLE 2 by posing the following question: “How does the number of sequence differences between a query 15-mer (containing an experimentallydetermined PTM site) and its best match in the target proteome affect the likelihood that the best match contains an experimentally-determined PTM site?” By answering this question, we have shown that DAPPLE 2 indeed gives many true positive matches. Additionally, it was found that the lower the number of sequence differences, the higher the likelihood that the match is an experimentally-determined PTM site. This suggests that, when selecting DAPPLE 2 matches for further consideration, users should preferentially select matches with few or no sequence differences. As with all classification methods, the accuracy and usefulness of DAPPLE 2 are inextricably linked to the quality of the input data. Thus, it is important for users of DAPPLE 2 to be able to assess the degree of confidence in each known PTM site on which its predictions are based. The output table generated by DAPPLE 2 contains two columns that facilitate this assessment. Specifically, the “high-throughput references” column contains the number of references that describe the use of high-throughput techniques, such as mass spectrometry, to discover and/or characterize a particular PTM site, while the “low-throughput references” column contains the number of references that describe the use of low-throughput techniques, such as site-directed mutagenesis. When selecting predicted PTM sites in the organism of interest (which may form the basis of further research), we encourage users of DAPPLE 2 to carefully consider the number of high- and low-throughput references. Of course, greater confidence can be placed in the existence of a particular PTM site if there are several ref-

13

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

erences supporting it. However, the technique(s) used are also important—for example, one would generally be more confident in a site identified using low-throughput techniques than one identified solely by high-throughput techniques. In the original version of DAPPLE, which predicted only phosphorylation sites, the output table contained four columns that indicate the degree of similarity between the query 15-mer and its best match in the target proteome. Although there is no generally accepted length for protein kinase recognition sequences, a length of 15 was chosen because it is consistent with the set of residues considered to be important for recognition. 4 Thus, the first such column contains the number of sequence differences between the entire query 15-mer and its closest match, while the second column contains the number of non-conservative sequence differences between the entire query 15-mer and its closest match. The third and fourth columns contain the same information, except the number of differences are calculated only for the nine-residue window surrounding the modified residue (i.e., the modified residue plus four residues on either side). The purpose of also including the 9-mer columns in the original version of DAPPLE was that the residues closer to the modified residue may have a greater impact on the ability of the catalyzing enzyme to recognize the site than residues that are farther away. 18 In DAPPLE 2, there is a second reason why the 9-mer columns are useful—namely, that some PTMs may have more local modification sequences, and thus only residues closer to the modified residue may be relevant. Therefore, users of DAPPLE 2 should consider the characteristics of the particular PTM being predicted when deciding whether to use the 15-mer sequence difference columns or the 9-mer sequence difference columns. A promising avenue for future work on DAPPLE 2 would be to incorporate filters based on known or predicted properties of a given protein. For instance, many post-translational modifications are generally considered to occur only for intracellular proteins or only for extracellular proteins. Phosphorylation, for example, is generally limited to intracellular proteins. Thus, DAPPLE 2 could potentially be made faster and more accurate when pre-

14

ACS Paragon Plus Environment

Page 14 of 29

Page 15 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

dicting phosphorylation by performing a pre-processing step to determine, and limit the search to, intracellular proteins. Although theoretically appealing, there would be multiple challenges associated with such a strategy. First, for a significant portion of the proteome, cellular localization is not known and would thus have to be computationally predicted. Although computational methods for localization prediction are reasonably accurate (see Wan et al. 19 and references therein), it is not clear whether adding more layers of prediction (i.e., predicting the localization of a protein as well as the phosphorylation of residues within that protein) would lead to an overall increase in accuracy. Second, while many PTMs are thought to be specific to either intra- or extra-cellular locales, there is evidence that these distinctions may not be as strong as previously thought. For example, while phosphorylation was once considered to be an intracellular-only event, there are a growing number of examples of extracellular phosphorylation. 20 Despite these challenges, it is plausible that incorporating location data into DAPPLE 2, if done carefully, could increase its accuracy.

Conclusions PTMs play a critical role in regulating protein function. 1 The mechanisms by which they modulate function are complex, with one particular modification often constituting the signal for other modifications. These modifications then collectively influence the behavior of the protein. The exact relationship between combinations of PTMs and function, a language that has been dubbed the “PTM code”, 21 remains poorly understood. However, efforts have been made to catalogue, predict, and characterize the PTM code. For example, PTMcode 2 is a database of known and predicted associations between PTMs, and currently contains data from 19 eukaryotic species. 22,23 Identifying PTM sites is the first step in understanding the PTM code. In this paper, we have described DAPPLE 2, a web-based tool that allows users to identify putative PTM sites in an organism of interest using known sites from other organisms. DAPPLE 2 features

15

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

three primary improvements over the original DAPPLE: the ability to predict for many post-translational modifications, not just phosphorylation; the option to reduce search times by using RAPSearch2 rather than BLAST; and the inclusion of GO term, keyword, and pathway information, which substantially reduce the effort involved in identifying results that are related to a particular biological pathway or process of interest. Given its versatility, we believe that DAPPLE 2 should be a valuable tool for a wide range of researchers.

Supporting Information Available Please see the file supporting_information.pdf for the following pieces of supporting information.

Table S1: Sensitivity comparison between BLAST and RAPSearch2 using database #1 (see text) as the database and human as the target proteome.

Table S2: Sensitivity comparison between BLAST and RAPSearch2 using database #2 (see text) as the database and human as the target proteome.

Table S3: Sensitivity comparison between BLAST and RAPSearch2 using database #3 (see text) as the database and human as the target proteome.

Table S4: Sensitivity comparison between BLAST and RAPSearch2 using database #1 (see text) as the database and Saccharomyces cerevisiae strain ATCC 204508 / S288c as the target proteome.

Table S5: Sensitivity comparison between BLAST and RAPSearch2 using database #2 (see text) as the database and Saccharomyces cerevisiae strain ATCC 204508 / S288c as the

16

ACS Paragon Plus Environment

Page 16 of 29

Page 17 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

target proteome.

Table S6: Sensitivity comparison between BLAST and RAPSearch2 using database #3 (see text) as the database and Saccharomyces cerevisiae strain ATCC 204508 / S288c as the target proteome.

Table S7: Sensitivity comparison between BLAST and RAPSearch2 using database #4 (see text) as the database and Saccharomyces cerevisiae strain ATCC 204508 / S288c as the target proteome.

Table S8: Meanings of the columns in DAPPLE 2’s output table. This material is available free of charge via the Internet at http://pubs.acs.org/.

17

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

References (1) Hunter, T. The age of crosstalk: phosphorylation, ubiquitination, and beyond. Mol Cell 2007, 28, 730–8. (2) Moore, K. L. The biology and enzymology of protein tyrosine O-sulfation. J Biol Chem 2003, 278, 24243–6. (3) Gill, G. Post-translational modification by the small ubiquitin-related modifier SUMO has big effects on transcription factor activity. Curr Opin Genet Dev 2003, 13, 108–13. (4) Trost, B.; Kusalik, A. Computational prediction of eukaryotic phosphorylation sites. Bioinformatics 2011, 27, 2927–35. (5) Hamby, S. E.; Hirst, J. D. Prediction of glycosylation sites using random forests. BMC Bioinformatics 2008, 9, 500. (6) Hou, T.; Zheng, G.; Zhang, P.; Jia, J.; Li, J.; Xie, L.; Wei, C.; Li, Y. LAceP: lysine acetylation site prediction using logistic regression classifiers. PLoS One 2014, 9, e89575. (7) Radivojac, P.; Vacic, V.; Haynes, C.; Cocklin, R. R.; Mohan, A.; Heyen, J. W.; Goebl, M. G.; Iakoucheva, L. M. Identification, analysis, and prediction of protein ubiquitination sites. Proteins 2010, 78, 365–80. (8) Xue, Y.; Liu, Z.; Gao, X.; Jin, C.; Wen, L.; Yao, X.; Ren, J. GPS-SNO: computational prediction of protein S-nitrosylation sites with a modified GPS algorithm. PLoS One 2010, 5, e11290. (9) Trost, B.; Arsenault, R.; Griebel, P.; Napper, S.; Kusalik, A. DAPPLE: a pipeline for the homology-based prediction of phosphorylation sites. Bioinformatics 2013, 29, 1693–5.

18

ACS Paragon Plus Environment

Page 18 of 29

Page 19 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(10) Yaffe, M. B.; Leparc, G. G.; Lai, J.; Obata, T.; Volinia, S.; Cantley, L. C. A motifbased profile scanning approach for genome-wide prediction of signaling pathways. Nat Biotechnol 2001, 19, 348–53. (11) Xue, Y.; Ren, J.; Gao, X.; Jin, C.; Wen, L.; Yao, X. GPS 2.0, a tool to predict kinasespecific phosphorylation sites in hierarchy. Mol Cell Proteomics 2008, 7, 1598–608. (12) Biswas, A. K.; Noman, N.; Sikder, A. R. Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information. BMC Bioinformatics 2010, 11, 273. (13) Trost, B.; Napper, S.; Kusalik, A. Case study: using sequence homology to identify putative phosphorylation sites in an evolutionarily distant species (honeybee). Brief Bioinform 2015, 16, 820–9. (14) Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nat Genet 2000, 25, 25–9. (15) Zhao, Y.; Tang, H.; Ye, Y. RAPSearch2: a fast and memory-efficient protein similarity search tool for next-generation sequencing data. Bioinformatics 2012, 28, 125–6. (16) Croft, D. et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res 2011, 39, D691–7. (17) Blom, N.; Sicheritz-Pontén, T.; Gupta, R.; Gammeltoft, S.; Brunak, S. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics 2004, 4, 1633–49. (18) Neuberger, G.; Schneider, G.; Eisenhaber, F. pkaPS: prediction of protein kinase A phosphorylation sites with the simplified kinase-substrate binding model. Biol Direct 2007, 2, 1.

19

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(19) Wan, S.; Mak, M.-W.; Kung, S.-Y. Sparse regressions for predicting and interpreting subcellular localization of multi-label proteins. BMC Bioinformatics 2016, 17, 97. (20) Yalak, G.; Vogel, V. Extracellular phosphorylation and phosphorylated proteins: not just curiosities but physiologically important. Sci Signal 2012, 5, re7. (21) Creixell, P.; Linding, R. Cells, shared memory and breaking the PTM code. Mol Syst Biol 2012, 8, 598. (22) Minguez, P.; Letunic, I.; Parca, L.; Bork, P. PTMcode: a database of known and predicted functional associations between post-translational modifications in proteins. Nucleic Acids Res 2013, 41, D306–11. (23) Minguez, P.; Letunic, I.; Parca, L.; Garcia-Alonso, L.; Dopazo, J.; Huerta-Cepas, J.; Bork, P. PTMcode v2: a resource for functional associations of post-translational modifications within and between proteins. Nucleic Acids Res 2015, 43, D494–502. (24) Wang, L.; Du, Y.; Lu, M.; Li, T. ASEB: a web server for KAT-specific acetylation site prediction. Nucleic Acids Res 2012, 40, W376–9. (25) Liu, Z.; Wang, Y.; Gao, T.; Pan, Z.; Cheng, H.; Yang, Q.; Cheng, Z.; Guo, A.; Ren, J.; Xue, Y. CPLM: a database of protein lysine modifications. Nucleic Acids Res 2014, 42, D531–6. (26) Lu, C.-T.; Huang, K.-Y.; Su, M.-G.; Lee, T.-Y.; Bretaña, N. A.; Chang, W.-C.; Chen, Y.-J.; Chen, Y.-J.; Huang, H.-D. DbPTM 3.0: an informative resource for investigating substrate site specificity and functional association of protein post-translational modifications. Nucleic Acids Res 2013, 41, D295–305. (27) Lee, T.-Y.; Chen, Y.-J.; Lu, C.-T.; Ching, W.-C.; Teng, Y.-C.; Huang, H.-D.; Chen, Y.J. dbSNO: a database of cysteine S-nitrosylation. Bioinformatics 2012, 28, 2293–5.

20

ACS Paragon Plus Environment

Page 20 of 29

Page 21 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(28) Johansen, M. B.; Kiemer, L.; Brunak, S. Analysis and prediction of mammalian protein glycation. Glycobiology 2006, 16, 844–53. (29) Chen, T.; Zhou, T.; He, B.; Yu, H.; Guo, X.; Song, X.; Sha, J. mUbiSiDa: a comprehensive database for protein ubiquitination sites in mammals. PLoS One 2014, 9, e85744. (30) Lee, T.-Y.; Hsu, J. B.-K.; Lin, F.-M.; Chang, W.-C.; Hsu, P.-C.; Huang, H.-D. NAce: using solvent accessibility and physicochemical properties to identify protein Nacetylation sites. J Comput Chem 2010, 31, 2759–71. (31) Yao, Q.; Bollinger, C.; Gao, J.; Xu, D.; Thelen, J. J. P(3)DB: an integrated database for plant protein phosphorylation. Front Plant Sci 2012, 3, 206. (32) Gnad, F.; Gunawardena, J.; Mann, M. PHOSIDA 2011: the posttranslational modification database. Nucleic Acids Res 2011, 39, D253–60. (33) Durek, P.; Schmidt, R.; Heazlewood, J. L.; Jones, A.; MacLean, D.; Nagel, A.; Kersten, B.; Schulze, W. X. PhosPhAt: the Arabidopsis thaliana phosphorylation site database. An update. Nucleic Acids Res 2010, 38, D828–34. (34) Dinkel, H.; Chica, C.; Via, A.; Gould, C. M.; Jensen, L. J.; Gibson, T. J.; Diella, F. Phospho.ELM: a database of phosphorylation sites–update 2011. Nucleic Acids Res 2011, 39, D261–7. (35) Sadowski, I.; Breitkreutz, B.-J.; Stark, C.; Su, T.-C.; Dahabieh, M.; Raithatha, S.; Bernhard, W.; Oughtred, R.; Dolinski, K.; Barreto, K.; Tyers, M. The PhosphoGRID Saccharomyces cerevisiae protein phosphorylation site database: version 2.0 update. Database (Oxford) 2013, 2013, bat026. (36) Yang, C.-Y.; Chang, C.-H.; Yu, Y.-L.; Lin, T.-C. E.; Lee, S.-A.; Yen, C.-C.; Yang, J.-M.; Lai, J.-M.; Hong, Y.-R.; Tseng, T.-L.; Chao, K.-M.; Huang, C.-Y. F. PhosphoPOINT: 21

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

a comprehensive human kinase interactome and phospho-protein database. Bioinformatics 2008, 24, i14–20. (37) Hornbeck, P. V.; Kornhauser, J. M.; Tkachev, S.; Zhang, B.; Skrzypek, E.; Murray, B.; Latham, V.; Sullivan, M. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Res 2012, 40, D261–70. (38) Chernorudskiy, A. L.; Garcia, A.; Eremin, E. V.; Shorina, A. S.; Kondratieva, E. V.; Gainullin, M. R. UbiProt: a database of ubiquitylated proteins. BMC Bioinformatics 2007, 8, 126. (39) Hart, G. W.; Akimoto, Y. In Essentials of Glycobiology; Varki, A., Cummings, R. D., Esko, J. D., Freeze, H. H., Stanley, P., Bertozzi, C. R., Hart, G. W., Etzler, M. E., Eds.; Cold Spring Harbor Laboratory Press, 2008; Chapter 18.

22

ACS Paragon Plus Environment

Page 22 of 29

Page 23 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Tables Table 1: List of databases of experimentally-determined PTM sites used by DAPPLE 2. Note that ASEB and N-Ace are not formal databases, but rather are datasets compiled for the purposes of training a predictor. Abbreviations are as follows: ASEB, acetylation set enrichment based; CPLM, compendium of protein lysine modifications; dbPTM, database of post-translational modifications; dbSNO, database of S-nitrosylation; mUbiDiDa, mammalian ubiquitination site database; P3 DB, plant protein phosphorylation database; PhosPhAt, Arabidopsis protein phosphorylation site database. Database name Reference 24 ASEB 25 CPLM 26 dbPTM 27 dbSNO 28 GlycateBase 29 mUbiDiDa 30 N-Ace 3 31 P DB 32 PHOSIDA 33 PhosPhAt 34 Phospho.ELM 35 PhosphoGRID PhosphoPOINT 36 PhosphoSitePlus 37 38 UbiProt

URL http://bioinfo.bjmu.edu.cn/huac/ http://cplm.biocuckoo.org http://dbptm.mbc.nctu.edu.tw http://dbsno.mbc.nctu.edu.tw http://www.cbs.dtu.dk/databases/GlycateBase-1.0 http://reprod.njmu.edu.cn/mUbiSiDa http://n-ace.mbc.nctu.edu.tw http://www.p3db.org http://www.phosida.com http://phosphat.uni-hohenheim.de http://phospho.elm.eu.org http://www.phosphogrid.org http://kinase.bioinformatics.tw http://www.phosphosite.org http://ubiprot.org.ru/index.php

23

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 29

Table 2: Databases containing data for each PTM. Databases are as follows: a, ASEB; b, CPLM; c, dbPTM; d, dbSNO; e, GlycateBase; f, mUbiDiDa; g, N-Ace; h, P3 DB; i, PHOSIDA; j, PhosPhAt; k, Phospho.ELM; l, PhosphoGRID; m, PhosphoPOINT; n, PhosphoSitePlus; o, UbiProt. The parentheses after each letter contain the number of sites unique to that database (number before the semicolon) and the number of sites in that database that were found in at least one other database (number after the semicolon) for that PTM. The last column contains the number of unique sites when data from the databases were combined. “O-β-GlcNAcylation” refers to the addition of β-linked N-acetylglucosamine to serine or threonine residues, 39 while “Glycosylation” includes other types of O-linked glycosylation sites, as well as C-linked and N-linked sites. PTM Acetylation

Databases a(100;4246), b(36515;9938), c(1616;7679), g(22;1193), i(973;2472), n(19628;10594) Butyrylation b(39;0) Carboxylation c(232;0) Crotonylation b(58;0) Glycation b(279;36), e(34;36) Glycosylation c(7112;529), i(4934;529) Malonylation b(32;0) Myristoylation c(148;0) Methylation b(437;491), c(192;813), n(7021;960) Nitrosylation c(30;2982), d(1038;2982) O-β-GlcNAcylation c(607;0) Palmitoylation c(237;0) Prenylation c(127;0) Phosphoglycerylation b(136;0) Phosphorylation c(16002;131941), h(24682;3921), i(26309;35616), j(5208;3931), k(1139;34492), l(9177;10845), m(65;10235), n(157189;116137) Propionylation b(49;0) Succinylation b(2417;0) Sulfation c(114;0) Sumoylation b(323;545), c(118;663), n(154;661) Ubiquitination b(76598;51034), c(217;22983), f(5052;43758), n(15222;45119), o(0;185)

24

ACS Paragon Plus Environment

Total Unique Sites 70650 39 232 58 349 12576 32 148 8724 4050 607 237 127 136 379247

49 2417 114 1305 149689

Page 25 of 29

Table 3: Sensitivity comparison between BLAST and RAPSearch2 using database #4 (the PhosphoSitePlus phosphorylation database; see text) as the database and human as the target organism. The row and column headings represent the number of sequence differences between a given query peptide (15-mer) and its best match in the target organism for BLAST and RAPSearch2, respectively. The number at row x and column y represents the percentage of query peptides with x sequence differences according to BLAST that had y sequence differences according to RAPSearch2. The numbers in each row sum to 100%. † Represents cases in which the best match for a given peptide had either eight or more sequence differences, or there was no match whatsoever in the target organism.

# seq. diff. (BLAST)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

0 1 2 3 4 5 6 7 †

0 1 100.0 0.0 0.0 99.9 0.0 1.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.0 0.0 0.0

# seq. diff. 2 3 0.0 0.0 0.1 0.0 94.7 0.1 3.6 79.0 0.3 5.0 0.1 0.9 0.1 0.4 0.0 0.1 0.0 0.0

25

(RAPSearch2) 4 5 6 7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.4 0.4 0.6 0.5 58.6 0.9 1.3 1.0 7.2 36.3 2.2 1.5 1.4 7.3 18.7 2.0 0.3 1.2 3.5 6.4 0.0 0.1 0.2 0.3

ACS Paragon Plus Environment

† 0.0 0.0 3.7 15.6 32.8 51.8 70.0 88.4 99.4

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 29

Table 4: Validation of DAPPLE 2 for acetylation sites, glycosylation sites, and ubiquitination sites. Known PTM sites, which were represented as 15-residue peptides with the modified residue in the center, from the species in the first column (abbreviations: Bt, Bos taurus; Ce, Caenorhabditis elegans; Cp, Cavia porcellus; Dm, Drosophila melanogaster ; Mm, Mus musculus; Pf, Plasmodium falciparum; Rn, Rattus norvegicus; Sc, Saccharomyces cerevisiae; Tg, Toxoplasma gondii) were searched against the human proteome using DAPPLE 2, with BLAST as the search method. The remaining columns indicate the number of matches from DAPPLE 2 with a given number of sequence differences that are known to be human PTM sites (the number before the slash) out of the total number of peptides with that number of sequence differences (the number after the slash). For example, there were 107 known acetylation sites from Bos taurus whose corresponding 15-mer peptide had zero sequence differences with its best match in the human proteome. Of these, 71 were experimentallydetermined human acetylation sites (66%). If a cell contains “–”, it means that there were no peptides from that species whose best match in the human proteome had the indicated number of sequences differences. Only species with at least 200 sites for a particular PTM are included in the table. 0 Bt Cp Dm Mm Pf Rn Sc Tg

71/107 298/466 91/116 3817/9247 7/7 2574/6644 19/28 12/12

Bt Ce Mm Rn

28/42 1/2 328/1607 30/72

Bt Mm Rn Sc

165/209 10479/18014 1451/2497 42/46

Number of sequence differences 3 4 Acetylation 18/49 19/36 10/25 2/17 135/242 47/129 29/73 14/29 65/82 50/107 77/151 62/145 1111/3891 546/2501 273/1526 159/1004 3/4 5/10 6/7 2/6 867/2906 433/1779 243/1114 112/608 32/50 43/79 47/133 37/164 11/15 3/9 9/14 11/19 Glycosylation 25/42 22/34 25/39 13/28 0/1 0/1 – 1/1 252/1053 203/836 165/644 123/486 24/42 17/24 8/26 10/18 Ubiquitination 2/3 0/1 – – 2553/6051 1255/3473 574/2126 317/1409 264/539 100/291 35/115 19/87 57/81 93/139 99/173 88/217 1

2

26

ACS Paragon Plus Environment

5

6

7

3/16 13/25 53/158 95/658 7/15 61/458 63/236 8/18

1/4 10/18 41/210 34/462 5/28 37/320 49/402 4/38

1/8 1/1 25/415 27/464 3/76 15/248 36/877 2/95

14/30 1/6 81/371 3/9

6/22 1/22 60/280 2/9

3/21 4/100 24/241 1/14

– – – 171/831 117/646 68/599 14/69 4/27 6/38 119/274 107/494 152/1258

Page 27 of 29

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure Legends Figure 1: Comparison of the time taken to run DAPPLE 2 when using BLAST or RAPSearch2 on different combinations of databases and target organisms. All runs were performed on the SAPHIRE web server, which contains four 3.2 GHz Intel processors and 32 GB of memory. Both RAPSearch2 and BLAST, run separately, were set to use four threads. Data represent the mean running time over three separate trials. Database numbers are as follows. Database #1: the CPLM butyrylation database (39 records); database #2: the N-ace acetylation database (1215 records); database #3: the mUbiDiDa ubiquitination database (48810 records); and database #4: the PhosphoSitePlus phosphorylation database (273326 records).

27

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 28 of 29

Page 29 of 29

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment