PhoStar: Identifying Tandem Mass Spectra of Phosphorylated

Oct 23, 2017 - Standard proteomics workflows use tandem mass spectrometry followed by sequence database search to analyze complex biological samples. ...
0 downloads 16 Views 2MB Size
Subscriber access provided by United Arab Emirates University | Libraries Deanship

Article

PhoStar: Identifying tandem mass spectra of phosphorylated peptides before database search Sebastian Dorl, Stephan Winkler, Karl Mechtler, and Viktoria Dorfer J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.7b00563 • Publication Date (Web): 23 Oct 2017 Downloaded from http://pubs.acs.org on October 24, 2017

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

PhoStar: Identifying tandem mass spectra of phosphorylated peptides before database search Sebastian Dorl,∗,† Stephan Winkler,† Karl Mechtler,‡,¶ and Viktoria Dorfer∗,† University of Applied Sciences Upper Austria, Bioinformatics Research Group, Softwarepark 11, 4232 Hagenberg, Austria, Research Institute of Molecular Pathology (IMP), Protein Chemistry, Campus-Vienna-Biocenter 1, 1030 Vienna, Austria, and Institute of Molecular Biotechnology (IMBA), Protein Chemistry, Vienna Biocenter (VBC), Dr. Bohr-Gasse 3, 1030 Vienna, Austria E-mail: [email protected]; [email protected]

Abstract Standard proteomics workflows use tandem mass spectrometry followed by sequence database search to analyse complex biological samples. The identification of proteins carrying post-translational modifications (PTMs), for example phosphorylation, is typically addressed by allowing variable modifications in the searched sequences. Accounting for these variations exponentially increases the combinatorial space in the database, which leads to increased processing times and more false positive identifications. The here presented tool PhoStar identifies spectra that originate from phosphorylated peptides before database search using a supervised machine learning approach. The model ∗

To whom correspondence should be addressed University of Applied Sciences Upper Austria, Bioinformatics Research Group, Softwarepark 11, 4232 Hagenberg, Austria ‡ Research Institute of Molecular Pathology (IMP), Protein Chemistry, Campus-Vienna-Biocenter 1, 1030 Vienna, Austria ¶ Institute of Molecular Biotechnology (IMBA), Protein Chemistry, Vienna Biocenter (VBC), Dr. BohrGasse 3, 1030 Vienna, Austria †

1 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

for the prediction of phosphorylation was trained and validated with an accuracy of 97.6% on a large set of high-confidence spectra collected from publicly available experimental data. Its power was further validated by predicting phosphorylation in the complete NIST human and mouse high collision-dissociation (HCD) spectral libraries achieving an accuracy of 98.2% and 97.9%, respectively. We demonstrate the application of PhoStar by using it for spectra filtering before database search. In database search of HeLa samples the peptide search space was reduced by 27–66% while finding at least 97% of total peptide identifications (at 1% FDR) compared to a standard workflow.

Keywords mass spectrometry, proteomics, post-translational modification, phosphorylation, search space reduction, machine learning, random forest classification

Introduction Tandem mass spectrometry analysis coupled with separation via liquid chromatography (LCMS/MS) has become the driving force in high-throughput proteomics research. 1,2 Sequence database search is a common method to derive peptide and protein identifications from tandem mass spectra. 3 A variety of database search engines are available for this task. 4–9 In brief, a peptide dictionary is created by in-silico digestion of a protein sequence database and is used to find candidate peptides that can explain the experimental spectra. Then, each experimental spectrum is compared to the theoretical spectra derived from the candidate peptide sequence. The total number of comparisons that have to be made, i.e. the total number of candidate peptides, is called the peptide search space. A big search space not only leads to long processing times, but also to more false positive identifications because of the increased probability of finding high-scoring incorrect matches. 10 Additionally, database

2 ACS Paragon Plus Environment

Page 2 of 19

Page 3 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

search engines allow for the identification of post-translational modifications (PTMs) by adding modified sequences to the search space. When accounting for multiple possible modifications and multiple modification sites, the search space increases exponentially. In open database search, which considers many different PTMs, the search space usually increases by several orders of magnitude. 11 For instance, using a standard database of HeLa proteome data we saw the search space increase by 330% on average when considering phosphorylation(S,T,Y) as variable modification. Several different strategies have been developed to address the search space explosion in database search for modified peptides. Multi-pass database searches can help to limit the search space size. 12 Spectral alignment approaches allow for fast comparisons of experimental spectra to many similar candidate peptide sequences. 13,14 Sequence tag approaches can be used to efficiently filter out candidate peptide sequences. 15,16 These approaches focus on optimizing the search algorithm and are dependant on the peptide database. Alternatively, data-centric strategies aim to extract information on PTMs directly from the sample dataset. A common approach is based on the observation that both the modified and unmodified versions of a peptide are usually present in the sample. Such spectral pairs can be found either by looking for precursor mass shifts that correspond to specific PTMs 17,18 or by matching spectra based on MS/MS fragment ion similarity. 19–21 Thus, these methods are limited by the abundance of these spectral pairs in the sample and their effectiveness depends on the dataset composition. Strategies to assess individual spectra have also been developed. Several methods are available to predict spectrum quality from raw data. 22,23 The goal is usually the removal of low-quality spectra before database search to decrease overall search space without losing identifications. Similarly, charge state prediction can help to avoid searching candidate peptides of different charge states. 24–26 In contrast, there have been few attempts at finding peptides with PTMs using spectrum features. Liang et al. recently presented a machine learning approach to identify N-glycopeptide spectra. 27 Lu et al. published Colander which uses a support vector machine to filter non-phosphorylated spectra and improve search

3 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

space in phosphopeptide enriched datasets. Colander uses five spectrum features and was trained on 752 high-confidence spectra (50% of which were phosphorylated) from linear and quadrupole ion traps. 28 We present PhoStar, which identifies individual spectra of phosphorylated peptides without consulting a sequence database or examining an entire dataset. PhoStar uses a classifier trained by machine learning to distinguish tandem mass spectra of phosphorylated peptides. This is possible because peptides with phosphorylated serine, threonine, or tyrosine often show distinct MS/MS peak patterns resulting from the neutral loss of the phosphate group during fragmentation. 29 PhoStar uses a random forest model 30 which considers the neutral loss events in each individual spectrum to calculate a phosphorylation score. The model was trained in a supervised learning process using a training dataset of 2.8 million highconfidence identified spectra. We obtained these by re-processing multiple publicly available experimental data spanning different organisms and tissue types.

Methods Training data preparation To provide training data for the machine learning process we assembled a set of highconfidence identified phosphorylated and non-phosphorylated spectra by re-processing public data. Spectra in .RAW file format were downloaded from the PRIDE repository. 31 A complete list of the used experiments can be found in Supplemental Table S1. Peptides were identified using the MS Amanda 8 database search engine in the Proteome Discoverer environment (Version 2.1.0.81, Thermo Fisher Scientific) and validated with Percolator. 32 Spectra were searched against the UniProt Swiss-Prot FASTA databases (v2016-04-13) for human, mouse, or rat (depending on the sample) combined with a database of common contaminants. 33 Each database was concatenated with a shuffled decoy database of equal size. The search parameters used for MS Amanda were: precursor ion mass tolerance of 4 ACS Paragon Plus Environment

Page 4 of 19

Page 5 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

5 ppm, fragment ion tolerance of 20 ppm, full tryptic peptides only, maximum of two missed cleavages, carbamidomethyl(C) set as static modification, as well as oxidation(M) and phosphorylation(S,T,Y) set as variable modifications. No more than two phosphorylations of the same amino acid (serine, threonine, or tyrosine) were allowed. Identified spectra were filtered to 1% FDR at PSM level as determined by Percolator. We discarded any identifications of contaminants as well as any identifications with an MS Amanda score less than 200. The assembled training dataset contains 2,839,295 identified spectra, 38.9% of which were from phosphorylated peptides (see Supplemental Table S1). The majority of spectra (70.7%) originate from experiments on HeLa cell line extracts. The rest was taken from experiments using different human, mouse, or rat tissue types and cell lines to form a more generalized training dataset. The experiments were specifically selected to include exclusively QExactive high-accuracy fragment ion data and no label-based quantification.

Feature calculation PhoStar calculates a total of 60 numerical features for each individual spectrum. A complete list is given in Supplemental Table S2. These features were constructed to best describe the abundance of neutral losses and specific (phosphorylated) amino acids in the spectrum. We record neutral loss of H3 PO4 and HPO3 , as well as H2 O, NH3 , and several possible combinations of the aforementioned. Precursor neutral losses are found by direct matching of experimental peaks to the expected m/z values calculated from the MS1 information. The calculation of fragment ion features depends on finding peak pairs with a specific m/z difference (e.g. 97.98 Da for H3 PO4 ) among all possible peak pairs of the spectrum. Peak matching always considers isotopes with either 0, 1 or 2 C 13 atoms. Accepted charge states for fragment ions are +1 and +2. For precursor ions the charge is assumed to match the MS1 information. Furthermore, for each peak in the spectrum we calculate the theoretical m/z of the possible complementary fragment ions. All of these theoretical ions are also eligible to form peak pairs for the calculation of fragment ion features (granted each pair still contains 5 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

at least one observed peak). The feature calculation is influenced by two input parameters: peak picking depth P and fragment ion tolerance T . Peak picking is performed to reduce the number of noise peaks in the spectrum. The m/z range of the spectrum is divided into windows of 100 m/z width. Only the P highest peaks in each window are kept. The calculation of features concerning the precursor ion and its neutral losses is done before peak picking and all precursor related peaks are removed. Thus, P strictly refers to the number of fragment ion peaks per 100 m/z window. T is the maximum accepted distance (in Da or ppm) when matching an individual peak to a theoretical m/z value e.g. when looking for precursor neutral loss peaks. Double tolerance is assumed when looking for peak pairs of specific difference, e.g. peak pair with distance of 97.98 m/z to indicate neutral loss of H3 PO4 , since both peaks are subject to full measurement uncertainty.

Machine learning PhoStar uses a supervised machine learning approach to distinguish between phosphorylated and non-phosphorylated spectra. A classification model was constructed using the random forest algorithm. 30 This algorithm constructs multiple decision trees each for a random subset of the training data. The classification score for a given sample is calculated by averaging the votes (1 for phosphorylated, 0 for not phosphorylated) from all trees in the forest. Thus, the classification is in the interval [0, 1] with a score close to 1 indicating a spectrum that is highly likely to be phosphorylated. PhoStar determines all spectra with classification score above a certain threshold C to be phosphorylated. Therefore, the threshold C regulates how strict the model is in accepting spectra as phosphorylated. Training and validation of the random forest model were done in HeuristicLab 34 (Version 3.3.13). The parameters for the algorithm were M = 0.5 (fractions of features used in each tree), R = 0.1 (fraction of samples used in each tree), and N = 50 (number of trees in the random forest). Several feature calculation parameter combinations were tested. 6 ACS Paragon Plus Environment

Page 6 of 19

Page 7 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The classification performance is shown in Table 1. For the final model feature calculation parameters were set to 10 ppm fragment ion tolerance and peak picking depth of 10 peaks per 100 m/z. Model performance was confirmed using 10-fold cross validation which showed an average accuracy of 97.6%. Table 1: Validation of PhoStar phosphorylation model on the training dataset using different parameters for feature calculation: fragment ion match tolerance T and peak picking depth P in peaks per 100 m/z. 10-fold cross validation was performed. We show the average values for accuracy (ACC), true positive rate (TPR), true negative rate (TNR), and positive predictive value (PPV). tolerance T 10 ppm 15 ppm 5 ppm 10 ppm 10 ppm

picking depth P 15 peaks 10 peaks 10 peaks 5 peaks 10 peaks

ACC 97.59% 97.38% 97.43% 97.18% 97.60%

TPR 95.36% 94.94% 94.97% 94.50% 95.32%

TNR 98.79% 98.68% 98.75% 98.61% 98.82%

PPV 97.68% 97.46% 97.60% 97.31% 97.73%

Results Validation on spectral libraries To demonstrate the performance of the PhoStar model we used it to classify spectral library data. The National Institute for Standards and Technology (NIST) provides curated spectral libraries for a variety of organisms on their web site (http://chemdata.nist.gov). Libraries of tandem mass spectra from Orbitrap HCD experiments are available for human and mouse. We concatenated the three different HCD libraries (label-free, iTRAQ-4, and iTRAQ-4 phosphorylated) for each organism. The resulting libraries contained 2,552,942 spectra (8.75% phosphorylated) for human and 124,665 spectra (12.6% phosphorylated) for mouse. Prediction was done by calculating features and phosphorylation scores for each library spectrum. Spectra with phosphorylation score above the classification threshold C (default C=0.5) were predicted to be phosphorylated. The prediction results were then compared to the true labels in the library. 7 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1: Predicting spectra of phosphorylated peptides in NIST HCD spectral libraries. The jitter plots show 5000 randomly selected samples from either human or mouse spectral libraries. PhoStar classifies a spectrum with a phosphorylation score above the threshold C = 0.5 as phosphorylated. PhoStar was able to identify phosphorylated spectra with an average accuracy 98.04% in the NIST spectral libraries (see Figure 1 and Supplemental Table S3). Average true positive rate and true negative rate were 91.23% and 98.83%, respectively. Overall accuracy could further be improved by individually adjusting the classification threshold C for each library (see Supplemental Figure S1). Feature calculation during prediction used parameters of fragment ion tolerance T = 10 ppm and peak picking depth P = 10 peaks per 100 m/z. These were also the parameters used for the training process (see Supplemental Figure S2 for full results). Accuracy in the spectral libraries is higher than during the model training process since the spectral libraries had a smaller relative amount of phosphorylated spectra of 8.75% and 12.6% compared to the 38.9% in the training dataset. More non-phosphorylated spectra will lead to higher overall accuracy since the model shows a better true negative rate than true positive rate in all experiments. The true positive rate of phosphorylation prediction is limited by the amount of phosphorylated spectra which contain characteristic peaks that can be picked up by the features. Some phosphorylated peptides will not produce sufficient neutral losses to be correctly clas-

8 ACS Paragon Plus Environment

Page 8 of 19

Page 9 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

sified (see Supplemental Figure S3). The extent of neutral losses depends on a multitude of factors including the phosphorylated amino acid residue, the overall amino acid sequence, and the precursor charge state. 29 The development of dedicated models for specific charge states or phosphorylated residues could potentially improve results. Furthermore, since the only available NIST HCD libraries with phosphorylated spectra are the iTRAQ-4 libraries all phosphorylated spectra in the test data are labeled with iTRAQ-4 by necessity. However, this does not hinder PhoStar classification since the reporter ions do not directly interfere with any of the feature calculations.

Application in peptide identification PhoStar can be integrated with peptide identification workflows to filter spectra before database search. This allows for subsequent database search with reduced search space. PhoStar was used to split input spectra into predicted phosphorylated and non-phosphorylated output files. Then, a separate database search was performed on each of the two using the MS Amanda search engine. The search for variable modification phosphorylation(S,T,Y) was considered only for the spectra predicted to be phosphorylated. The results were compared in terms of number of identified peptides and size of search space to a standard database search where all spectra were searched with variable modification phosphorylation(S,T,Y). The total search space was calculated by adding up search space values for every PSM as reported by the MS Amanda search engine. The used test samples were HeLa cell line datasets taken from Sharma et al. 36 This experimental data included proteome wide datasets as well as data from phospho-enriched samples with six replicates per control group. We used three of the six replicates for both the proteome and phospho-enriched control groups (the other replicates were part of the training dataset). PhoStar feature calculation parameters were fragment ion tolerance T = 10 ppm and peak picking depth P = 10 peaks per 100 m/z. Spectra with a phosphorylation score above the threshold C=0.5 were used in the search with Phosphorylation(S,T,Y) considered. 9 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

C controls the trade-off between true positive rate and false positive rate during spectrum filtering. Adjusting the parameter to fit with experimental requirements can improve results (see the Supplemental Material for further discussion of C). Database search parameters were equal to the searches described for the training data preparation except for the variable modification phosphorylation(S,T,Y). After the database search, PSMs from both parts were pooled and filtered to 1% FDR using a simple score based ranking.

Figure 2: Comparison of PhoStar-assisted MS Amanda database search results for triplicate HeLa cell line samples measured by Sharma et al. 36 (A) The number of identified peptides is based on the number of unique sequences identified at 1% FDR (peptide-spectrum match level). (B) Search space is calculated as the sum of individual search space values reported for each PSM in an individual search. Values are scaled to the highest number observed for triplicates in that respective group. The number of identified peptides in the PhoStar-assisted workflow for phosphopeptide10 ACS Paragon Plus Environment

Page 10 of 19

Page 11 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

enriched samples ranged between 97.13–99.93%, whereas total number of identified peptides for the proteome samples was between 100.18–100.47% (see Figure 2 and Supplemental Table S4). Meanwhile, the total search space decreased to 53.65%–76.03% for phosphopeptideenriched samples and 31.19%–47.38% for the proteome samples. This reduced search space led to shorter total analysis runtime for the assisted workflow (see Supplemental Figure S4). In all cases the PhoStar-assisted search identified a similar number of total peptides within a significantly reduced search space. The search space is reduced by making sure that spectra which are very unlikely to be phosphorylated are not subjected to database search with variable modification phosphorylation(S,T,Y). The effect was most pronounced for the proteome samples since these datasets include several times more non-phosphorylated than phosphorylated spectra. Furthermore, this limits the number of false positive matches of unmodified spectra to phosphorylated peptides. This is the reason why we see an increase of identified non-phosphorylated peptides at constant FDR in all samples. For the proteome samples this led to a net increase in the number of total identified peptides (Supplemental Table S4).

Discussion and conclusion We have presented PhoStar, a tool for the identification of tandem mass spectra that are highly likely to originate from phosphorylated peptides. As demonstrated, applying PhoStar before database search can help to optimize the peptide search space on a spectrum-byspectrum basis during database search which considers phosphorylation as a variable modification. PhoStar is easily combinable with other identification workflows and should achieve similar results. Especially in non-enriched samples a search can be done with only a fraction of the complexity while still identifying an equal number of peptides. Similar machine learning approaches could be used to identify a variety of PTMs in tandem mass spectra. An integrative solution using different classifiers covering multiple common modifications

11 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 19

could allow for more PTM identifications in routine experiments without the search space explosion commonly associated with searching many PTMs. Beside the demonstrated workflow of splitting and later re-pooling the data, PhoStar can also be used to simply filter the data by discarding either phosphorylated or nonphosphorylated sets. Alternatively, the distribution of PhoStar phosphorylation scores among all spectra could be used as a fast estimate of how many phosphorylated peptides are in the dataset without performing full identification. We also suggest that the PhoStar phosphorylation score could be used to improve PSM validation after database search since a phosphorylated peptide matched to a spectrum with low phosphorylation score is more likely to be a false positive match. This could be used as additional information to separate true positive from false positive hits and increase confident peptide identifications during validation. PhoStar presents all numerical features together with the output so any information is open for further downstream applications.

Usage and Availability PhoStar is available free of charge in the download section at http://bioinformatics. fh-hagenberg.at as windows executable accompanied by an extensive user manual and example data. The tool is called on the command line using a simple syntax requiring the input spectrum file and three parameters for spectrum processing: peak picking depth P in peaks per 100 m/z, fragment ion tolerance T in either Da or ppm, and score threshold C between 0 and 1 (default C = 0.5). P and T directly influence feature calculation and should be chosen based on the properties of the input spectra. We recommend P = 10 peaks per 100 m/z and T = 10 ppm for high-accuracy fragment ion spectra, e.g. QExactive data. Since the model included with PhoStar was trained using spectra with HCD fragmentation, performance may vary when using data with alternative fragmentation methods. Electrontransfer dissociation (ETD) in particular changes the occurrence of neutral loss peaks during

12 ACS Paragon Plus Environment

Page 13 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

fragmentation. 35 Since this is an important feature for the classification the results must be critically examined. The threshold parameter C determines the minimum score value required for a spectrum to be classified as phosphorylated. For instance, a lower value of C will allow the model to more leniently accept spectra as phosphorylated. The optimal C depends on the study goal and should be carefully considered by the user (for more information see the Supplemental Material S1). Input files can be supplied in Mascot Generic Format (MGF) or spectral library text format (MSP). PhoStar produces three output files: a tab-separated value file with all of the calculated features and phosphorylation scores, and two spectrum files of the same format as the input including only the spectra that have been classified as phosphorylated or non-phosphorylated, respectively. Pre-processing steps used for feature calculation are only temporary and do not affect the spectrum output files, thus, the result files can be used exactly as the input files and be combined with any identification workflow of the user’s choice. For a description of the columns in the tab-separated output file see Supplemental Table S2.

Acknowledgement This work was supported by the Austrian Science Fund (FWF) (TRP 308-N15) and the Joint JKU/UAS PhD Program in Informatics organized by the Johannes Kepler University Linz and the University of Applied Sciences Upper Austria (application period from August 1, 2015 until October 11, 2015).

Supporting Information The following files are available free of charge at ACS website (http://pubs.acs.org): phostar dorl2017 supplemental.pdf Supplementary Table S1: Comprehensive list of spectra in the training dataset. Supplementary Table S2: Full description of all features used

13 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

by PhoStar. Spplementary Table S3: PhoStar classification performance on NIST spectral libraries. Supplementary Figure S1: Accuracy during classification depending on score threshold. Supplementary Figure S2: ROC curves for classification depending on score threshold. Supplementary Figure S3: Fragment ion neutral losses during the classification of the spectral libraries. Supplementary Figure S4: Runtime comparison for database searches using the PhoStar-assisted workflow. S1: Explanation of the score threshold parameter C. Supplementary Table S4: Detailed result comparison for database searches using the PhoStar-assisted workflow.

References (1) Aebersold, R.; Mann, M. Mass-spectrometric exploration of proteome structure and function. Nature 2016, 537, 347–355. (2) Gillet, L. C.; Leitner, A.; Aebersold, R. Mass Spectrometry Applied to Bottom-Up Proteomics: Entering the High-Throughput Era for Hypothesis Testing. Annu. Rev. Anal. Chem. 2016, 9, 449–472. (3) Schmidt, A.; Forne, I.; Imhof, A. Bioinformatic analysis of proteomics data. BMC Syst. Biol. 2014, 8 Suppl 2, S3. (4) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–67. (5) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466–7. (6) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3, 958–64. 14 ACS Paragon Plus Environment

Page 14 of 19

Page 15 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(7) Cox, J.; Neuhauser, N.; Michalski, A.; Scheltema, R. A.; Olsen, J. V.; Mann, M. Andromeda: A peptide search engine integrated into the MaxQuant environment. J. Proteome Res. 2011, 10, 1794–1805. (8) Dorfer, V.; Pichler, P.; Stranzl, T.; Stadlmann, J.; Taus, T.; Winkler, S.; Mechtler, K. MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra. J. Proteome Res. 2014, 13, 3679–84. (9) Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 2014, 5, 5277. (10) Nesvizhskii, A. I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J. Proteomics 2010, 73, 2092–123. (11) Na, S.; Paek, E. Software eyes for protein post-translational modifications. Mass Spectrom. Rev. 2015, 34, 133–147. (12) Tharakan, R.; Edwards, N.; Graham, D. R. M. Data maximization by multipass analysis of protein mass spectra. Proteomics 2010, 10, 1160–1171. (13) Tsur, D.; Tanner, S.; Zandi, E.; Bafna, V.; Pevzner, P. A. Identification of posttranslational modifications by blind search of mass spectra. Nat. Biotechnol. 2005, 23, 1562–1567. (14) Chen, Y.; Chen, W.; Cobb, M. H.; Zhao, Y. PTMap–A sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites. Proc. Natl. Acad. Sci. 2009, 106, 761–766. (15) Dasari, S.; Chambers, M. C.; Slebos, R. J.; Zimmerman, L. J.; Ham, A.-J. L.; Tabb, D. L. TagRecon: High-Throughput Mutation Identification through Sequence Tagging. J. Proteome Res. 2010, 9, 1716–1726. 15 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(16) Na, S.; Bandeira, N.; Paek, E. Fast Multi-blind Modification Search through Tandem Mass Spectrometry. Mol. Cell. Proteomics 2012, 11, M111.010199. (17) Savitski, M. M.; Nielsen, M. L.; Zubarev, R. A. ModifiComb, a new proteomic tool for mapping substoichiometric post-translational modifications, finding novel types of modifications, and fingerprinting complex protein mixtures. Mol. Cell. Proteomics 2006, 5, 935–48. (18) Fu, Y.; Xiu, L.-Y.; Jia, W.; Ye, D.; Sun, R.-X.; Qian, X.-H.; He, S.-M. DeltAMT: a statistical algorithm for fast detection of protein modifications from LC-MS/MS data. Mol. Cell. Proteomics 2011, 10, M110.000455. (19) Wilhelm, T.; Jones, A. M. E. Identification of related peptides through the analysis of fragment ion mass shifts. J. Proteome Res. 2014, 13, 4002–4011. (20) Falkner, J. A.; Falkner, J. W.; Yocum, A. K.; Andrews, P. C. A spectral clustering approach to MS/MS identification of post-translational modifications. J. Proteome Res. 2008, 7, 4614–4622. (21) Bandeira, N.; Tsur, D.; Frank, A.; Pevzner, P. A. Protein identification by spectral networks analysis. Proc. Natl. Acad. Sci. U. S. A. 2007, 104, 6140–5. (22) Ding, J.; Shi, J.; Wu, F.-X. SVM-RFE based feature selection for tandem mass spectrum quality assessment. Int. J. Data Min. Bioinform. 2011, 5, 73–88. (23) Zou, A.-M.; Wu, F.-X.; Ding, J.-R.; Poirier, G. G. Quality assessment of tandem mass spectra using support vector machine (SVM). BMC Bioinformatics 2009, 10 Suppl 1, S49. (24) Zou, A.-M.; Shi, J.; Ding, J.; Wu, F.-X. Charge state determination of peptide tandem mass spectra using support vector machine (SVM). IEEE Trans. Inf. Technol. Biomed. 2010, 14, 552–8. 16 ACS Paragon Plus Environment

Page 16 of 19

Page 17 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(25) Carvalho, P. C.; Cociorva, D.; Wong, C. C. L.; Carvalho, M. d. G. d. C.; Barbosa, V. C.; Yates, J. R.; III, Charge prediction machine: tool for inferring precursor charge states of electron transfer dissociation tandem mass spectra. Anal. Chem. 2009, 81, 1996–2003. (26) Sharma, V.; Eng, J. K.; Feldman, S.; von Haller, P. D.; MacCoss, M. J.; Noble, W. S. Precursor Charge State Prediction for Electron Transfer Dissociation Tandem Mass Spectra. J. Proteome Res. 2010, 9, 5438–5444. (27) Liang, S.-Y.; Wu, S.-W.; Pu, T.-H.; Chang, F.-Y.; Khoo, K.-H. An adaptive workflow coupled with Random Forest algorithm to identify intact N-glycopeptides detected from mass spectrometry. Bioinformatics 2014, 30, 1908–16. (28) Lu, B.; Ruse, C. I.; Yates, J. R. Colander: a probability-based support vector machine algorithm for automatic screening for CID spectra of phosphopeptides prior to database search. J. Proteome Res. 2008, 7, 3628–34. (29) Boersema, P. J.; Mohammed, S.; Heck, A. J. R. Phosphopeptide fragmentation and analysis by mass spectrometry. J. Mass Spectrom. 2009, 44, 861–78. (30) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. (31) Vizca´ıno, J. A.; Csordas, A.; Del-Toro, N.; Dianes, J. A.; Griss, J.; Lavidas, I.; Mayer, G.; Perez-Riverol, Y.; Reisinger, F.; Ternent, T.; Xu, Q.-W.; Wang, R.; Hermjakob, H. 2016 update of the PRIDE database and its related tools. Nucleic Acids Res. 2016, 44, D447–D456. (32) K¨all, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 2007, 4, 923–925. (33) Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized

17 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26, 1367–1372. (34) Wagner, S.; Kronberger, G.; Beham, A.; Kommenda, M.; Scheibenpflug, A.; Pitzer, E.; Vonolfen, S.; Kofler, M.; Winkler, S.; Dorfer, V.; Affenzeller, M. In Adv. Methods Appl. Comput. Intell.; Klempous, R., Nikodem, J., Jacak, W., Chaczko, Z., Eds.; Topics in Intelligent Engineering and Informatics; Springer, 2014; Vol. 6; Chapter Architecture and Design of the HeuristicLab Optimization Environment, pp 197–261. (35) Mikesh, L. M.; Ueberheide, B.; Chi, A.; Coon, J. J.; Syka, J. E.; Shabanowitz, J.; Hunt, D. F. The utility of ETD mass spectrometry in proteomic analysis. Biochim. Biophys. Acta - Proteins Proteomics 2006, 1764, 1811–1822. (36) Sharma, K.; D’Souza, R. C. J.; Tyanova, S.; Schaab, C.; Wi´sniewski, J. R.; Cox, J.; Mann, M. Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling. Cell Rep. 2014, 8, 1583–94.

18 ACS Paragon Plus Environment

Page 18 of 19

Page 19 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3: For TOC only.

19 ACS Paragon Plus Environment