Mining Large Scale Tandem Mass Spectrometry ... - ACS Publications

Dec 14, 2015 - Apache Spark large scale computing framework. .... data storage over large computer clusters or cloud computing .... In all searches, t...
0 downloads 13 Views 2MB Size
Subscriber access provided by ORTA DOGU TEKNIK UNIVERSITESI KUTUPHANESI

Article

Mining large scale MS/MS data for protein modifications using spectral libraries Oliver Horlacher, Frédérique Lisacek, and Markus Müller J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.5b00877 • Publication Date (Web): 14 Dec 2015 Downloaded from http://pubs.acs.org on December 16, 2015

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Mining large scale MS/MS data for protein modifications using spectral libraries

Oliver Horlacher1,2, Frederique Lisacek1,2* and Markus Müller1,2* 1

Proteome Informatics Group, SIB Swiss Institute of Bioinformatics, Geneva, 1211, Switzerland

2

Centre Universitaire de Bioinformatique, University of Geneva, Geneva, 1211, Switzerland

* Corresponding authors:

Markus Müller, Frédérique Lisacek Swiss Institute of Bioinformatics, University of Geneva CMU Rue Michel-Servet 1 1211 Geneva, Switzerland Email: [email protected], [email protected]

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract Recent experimental improvements in post translational modification (PTM) detection by MS/MS has allowed the identification of vast numbers of PTMs. Open modification searches (OMS) of MS/MS data, which do not require prior knowledge of the modifications present in the sample, further increased the diversity of detected PTMs. Despite much effort there is still a lack of functional annotation of PTMs. One possibility to narrow the annotation gap is to mine MS/MS data deposited in public repositories and to correlate the PTM presence with biological metainformation attached to the data. Since the data volume can be quite substantial and contain tens of millions of MS/MS spectra, the data mining tools have to be able to cope with big data. Here we present two tools, Liberator and MzMod, which are built using the MzJava class library and the Apache Spark large scale computing framework. Liberator builds large MS/MS spectrum libraries and MzMod searches them in an OMS mode. We applied these tools to a recently published set of 25 million spectra from 30 human tissues and present tissue specific PTMs. We also compared the results to the ones obtained with the OMS tool MODa and the search engine X!Tandem.

Keywords Proteomics, PTM, MS/MS, Hadoop, Apache Spark, parallel computing, open modification search, big data, human tissues

ACS Paragon Plus Environment

Page 2 of 33

Page 3 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Introduction

Post translational modifications (PTM) are chemical alterations of proteins, which can change their binding properties, 3D structure, enzymatic activity, life span and subcellular localization. The addition and removal of most PTMs is tightly regulated by enzymes and aberrations in enzyme activity were shown to be involved in disease 1. There are more than 200 different PTMs recorded in the UniProt protein knowledgebase 2 ranging from small chemical modifications (e.g. acetylation, phosphorylation) to the addition of large biomolecules (e.g. glycosylation, ubiquitination). This collection of PTMs opens up the possibility for complex regulation and PTM crosstalk 2,3,4. Recent advances in mass spectrometry (MS), PTM enrichment and sample processing methods has made it possible to identify thousands of sites for a specific PTM in a single study 5. Peptide fragment (MS/MS) spectra define accurate and sensitive fingerprints to identify a PTM and its position on the peptide sequence. Traditional peptide spectrum match (PSM) scores evaluate the quality of a match between an eventually modified peptide and a spectrum. However, if several potentially modified residues are close the each other these match scores may retrieve the true peptide, but are often not able to distinguish the different modification sites, especially for lower quality spectra (so called ∆-correct identifications 6). Therefore special PTM positioning scores were introduced 7,8, which allow selecting the PSMs with an unambiguous PTM location. The identification and positioning accuracy of PTMs benefit from the high accuracy of modern MS/MS instruments on the precursor ion and fragment ion m/z values 9. On large datasets, MS/MS sequence searches with many (>5) variable modifications are being challenged due to increased search time and reduced sensitivity 10. Unbiased or open modification searches (OMS) provide a valuable alternative since the modifications do not need to be preconfigured but are read out from the MS/MS data directly, and the computational complexity of the search is independent of the number of modifications present in the data. This is possible because OMS rely on the alignment of modified query spectra to unmodified database spectra, where fragment peaks are matched in accordance with the masses of the modifications 11,12. In this way, information about both mass shifts and modification sites is provided. In most practical applications, OMS is performed on the set of spectra that remain unidentified after a prior standard sequence search, and after a reduction of the peptide search space to limit search time and to increase sensitivity 10. For example Tsur et al. 6 search all proteins, which have at least one unmodified peptide match. Their MS-Alignment algorithm, an improved version of the original algorithm 12, retrieved mostly singly and some doubly modified peptides. After this first OMS, a so-called PTM

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

frequency matrix was calculated, which registers how often a mass shift is positioned on a particular amino acid, and this matrix is then used to correct wrongly positioned PTMs. The MODi tool 13 on the other hand obtains a strongly reduced set of candidate peptides by filtering with sequence tags extracted from MS/MS spectra. It combines several sequence tags into a tag chain data structure and introduces mass shifts to explain the gaps in this chain when matching against a database peptide. The MODa algorithm 14 combines the tag chain approach with spectrum alignment through a dynamic programming algorithm that yields the highest scoring tag chain and mass shifts for each candidate peptide. The ModifiComb algorithm 15 searches all peptides previously identified by a standard sequence search. The rationale behind this approach is that many modifications are present in substoichiometric amounts and their unmodified counterparts are likely to be found as well. Savitski et al. also used the ∆M-histogram, i.e. the histogram of all observed mass shifts, as a summary representation of all modifications found in a sample. Other tools do not only use the sequences of identified peptides, but their consensus spectra stored in spectral libraries as starting point for an OMS with one modification per alignment 16,17,18,19, since the MS/MS spectrum of the unmodified peptide can provide valuable information for the identification of its modified forms 18. Unfortunately, most of the modified PSMs detected by all these OMS approaches correspond to abundant protein modification introduced during sample processing, which do not contain biological information (e.g. oxidation, carbamylation, deamidation, pyro-Glu, N-terminal hydrolysis). Interesting new biological modifications are much rarer and need to be carefully extracted from this background. While most high throughput studies yield new modifications sites of specific PTMs, some studies focus on the identification of new PTMs or PTMs not expected to occur in a given set of proteins 20,21. Much work has been carried out on histone proteins and several new PTMs involved in transcriptional regulation were found 22. For example, Chen et al. 23 found propionylation and butyrylation of several lysine sites on histone H4 proteins. These PTMs are related to lysine acetylation and several lines of biological evidence suggested their presence prior to the study. Therefore they could be detected in a standard MS/MS sequence search by configuring them as variable modifications. Two unexpected histone modifications (tyrosine hydroxylation and lysine crotonylation) were found using OMS 24. Similar OMS work was carried out on a whole proteome level. A recent study investigated the effect of formaldehyde fixation and paraffin embedding (FFPE) on clinical tissue samples, and found that methylation of lysine was increased in FFPE samples 25. Another large scale analysis of Synechococcus 7002 proteins 26 yielded 23 PTMs, some of which were overexpressed in functional categories such as respiration and photosynthesis processes.

ACS Paragon Plus Environment

Page 4 of 33

Page 5 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Chick et al. 27 used a dataset of 1.2 million MS/MS spectra from HEK293 cells to look for unanticipated modifications. Using a modification mass window of ±500Da and an estimated FDR of about 0.3% on the spectrum level for their OMS search they managed to identify an additional 184’000 spectra (~1/2 of the number of spectra identified in an exact search), indicating that a large portion of the spectra unassigned by a standard sequence search without variable modifications could be attributed to modified peptides. When compared to a standard sequence search with the 15 most common variable modifications, OMS only picked up about 50% of the modified spectra found in the standard search, since OMS is more prone to error and requires more conservative thresholds. However, more than 100’000 spectra could only be found by OMS, among them many rare biological PTMs and amino acid substitutions. At present, the functional annotation lags behind the fast discovery of new modification sites. Bioinformatics studies based on the evolutionary conservation of modification sites and their position in functional or structural protein domains 28,29,30 help prioritizing their functional importance, but have to be followed by further experimental functional characterization. Also, there is an ongoing discussion about the functional relevance of low stoichiometry modification sites of abundant proteins found in large scale studies 31,32. The large gap in functional annotations of PTMs may be narrowed by mining existing proteomic MS/MS data, which are stored in publically accessible repositories 33. These data provide a rich deposit for mining biologically relevant information 34,35,36,37. Other repositories incorporate information from large scale PTM studies and provide tools to mine and interpret these data 38,39. For more specific questions it may be required to search the MS/MS data directly. For example, public data can be used to verify findings obtained using a proprietary dataset 25. Matic et al. 40 made use of the fact that ADP-ribosylated proteins also become enriched by phosphoprotein enrichment protocols, and successfully mined a large phophoproteomic MS/MS dataset for ADP-ribosylation. Zhou et al. 41 reanalyzed MS/MS data from human sperm enriched for acetylated proteins 42, and the search for ubiquitinations and phosphorylations revealed many additional modification sites and opened up the possibility to study PTM crosstalk. A challenge facing bioinformatics is the ever-growing size of MS datasets and the need to process spectra by the tens of millions. The sheer volume imposes the use of distributed data processing frameworks such as Hadoop MapReduce 43 (hadoop.apache.org) and Apache Spark 44 (spark.apache.org). In contrast to multithreading, which operates on a single computer, these frameworks distribute the computing tasks and data storage over large computer clusters or cloud computing resources. Software written within these frameworks is able to analyze the data as one

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

block without the need to write additional wrapper scripts to split and recombine the data. When executed on a computer network the data are automatically divided into partitions, which are distributed to the available computers, and the results obtained on these partitions are combined. If the computation of a partition fails, the failure is noticed and the subtask is automatically dispatched again. Hadoop is already used in bioinformatics, primarily for nextgeneration sequencing analysis 45 but also for proteomics 46,47,48, while Spark was more recently introduced 49,50. Spark extends on the functionality and performance of Hadoop by allowing in-memory data storage and providing many additional functions 44. The main data structure is the Spark RDD (Resilient Distributed Dataset), which is a collection of immutable elements that can be operated on in parallel. Spark offers a whole list of parallel operations on RDDs, such as reading from and writing to various file formats, map, reduce, filter, sample, join and many more. For mining MS/MS data OMS provides an attractive complement to standard sequence searches, which overlook unexpected but important modifications. To scan large datasets containing tens of millions of MS/MS spectra the scalability of OMS becomes an important issue. In this paper we present Liberator and MzMod. Liberator is a tool to create spectral libraries for large datasets. MzMod is a completely refactored and parallelized version of the QuickMod tool 18, which searches spectral libraries in OMS mode. They are implemented in an efficient way based on the MzJava class library 51 and the Hadoop and Apache Spark frameworks. This allows running these tools in parallel on a single machine or on a computer cluster. We tested Liberator and MzMod on a large dataset of 25 million spectra from several human tissues 36. We compared results with the standard sequence search tool X!Tandem and the OMS tool MODa, and we tested the accuracy of the FDR calculation with respect to PTMs. Finally we present some preliminary biological results obtained with these data.

ACS Paragon Plus Environment

Page 6 of 33

Page 7 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Experimental Section

MS/MS Data

The MS/MS dataset used in this study is described in Kim et al. 36 and was downloaded from the PRIDE 52 data repository (www.ebi.ac.uk/pride/, dataset identifier PXD000561). Briefly, proteins from 30 human tissues were fractionated and digested with trypsin. Peptides were separated in a micro-capillary RPLC column and fragmentation spectra were measured by LTQ-Orbitrap Elite or LTQ-OrbitrapVelos mass spectrometers using HCD fragmentation. The 2200 raw mass spectrometry files (.dat) were centroided and converted to .mgf files by the MSConvert software version 3.0.4472 53. MzJava classes parse these .mgf files and converte them to Hadoop files (http://wiki.apache.org/hadoop/ SequenceFile), which are binary flat files consisting of a list of key-value pairs. The Hadoop files we used for this work are ordinary local binary files living on a single disk and do not require installation of the Hadoop file system. Only spectra with defined precursor charge and more than 20 peaks were considered in this conversion resulting in 23’944’976 spectra written to the Hadoop files. If not otherwise stated in the text, the whole dataset was used. Some tests were performed only on the adult liver subset of the data (1’022’964 spectra) as indicated in the text.

Cluster calculations

The large volume of MS/MS spectra were processed by X!Tandem, MODa, Liberator and MzMod on the VitalIT (www.vital-it.ch) computer cluster using IBM’s load sharing facility (LSF) to distribute the jobs. In order to run Liberator and MzMod on the cluster three LSF jobs were submitted: one job to run the master, an array job to run the workers, and another job to run the spark job. The master uses one core and the default ram (1G). The workers were run using an LSF array job that launches 32 workers each of which has 8 CPUs. The Liberator workers were set to use 32G of ram and the MzMod workers used 16G of ram. The job was launched on one core with default ram (1G). Spark restarts a jab that has failed and it was set up to tolerate 85 failures before giving up the job. The Hadoop

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

and Spark .jar files were distributed to the worker nodes via rsync and there was no need to install the Hadoop file system on the cluster. X!Tandem and MODa were distributed using simple LSF array jobs.

X!Tandem search

The X!Tandem54 search (SLEDGEHAMMER release) matched the spectra from the .mgf files against the 20199 reviewed human protein sequences in the UniProt database (www.uniprot.org) without considering protein variants. X!Tandem only reports the highest scoring peptide for each spectrum. The mass tolerance was set to 10ppm on the precursor and 0.05Da on the fragment masses, and isotopic error correction was switched on. All cysteine residues were carbamidomethylated. Candidate peptides are fully tryptic with up to 2 missed cleavages. X!Tandem was used first to build a spectral library of unmodified peptides.. No variable modifications were configured for this search and the peptides carrying a default X!Tandem modification (acetylation of protein N-terminus, ammonium loss of glutamine and cysteine, water loss of glutamic acid) were excluded from the library. This search was run separately since we wanted to measure the execution time required to build an unmodified spectral library (X!Tandem search time goes up with the number of modifications). Next, we configured X!Tandem to search for variable modifications. Since X!Tandem cannot handle more than one modification per amino acid and we wanted to study both methylation and ubiquitination (GlyGly) on lysine, we had to perform the variable modification search in two runs: the first one with oxidation of methionine, phosphorylation of serine, threonine and tyrosine, methylation of lysine as variable modifications and the second one with ubiquitination (GlyGly) on lysine and hexose on asparagine. In all searches the refinement option was turned off. The X!Tandem result files were converted to the pepXML format by the Tandem2XML program from the trans proteomic pipeline (TPP) 55(version 4.7) and these pepXML files were read using the MzJava (mzjava.expasy.org, version 1.1) parser and converted to Hadoop files for further processing. Reversed protein sequences were concatenated to the .fasta sequence file and the global FDR was calculated as two times the number of reversed sequence hits divided by the number of target and decoy hits 56, which yields a more conservative FDR estimation 57. Calculation of the FDR in subgroups of the PSMs can be problematic for 4 reasons: 1) PSMs in the subgroup may show a score distribution different from the global one. 2) The prior probabilities of obtaining a correct or incorrect hit may be different in the subgroup compared to all PSMs. 3) The

ACS Paragon Plus Environment

Page 8 of 33

Page 9 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

estimation of the group FDR may be difficult if the group size is small. 4) For more complex subgroups (e.g. those coming from selection with multiple criteria) the decoy PSMs may not accurately mimic the incorrect target PSMs 58. We estimated the group FDR using a method proposed by Fu & Qian 59. Their calculation tackles problems 1)-3) and infers the group FDR from the global FDR using an exact mathematical relationship. One important parameter in this relationship - the probability that a decoy hit of a given score belongs to the subgroup - is computed in a robust way, which yields more reliable FDR estimates for small subgroups.

Workflow overview Figure 1 summarizes the Liberator/MzMod workflow we applied in this paper. First spectra and PSM are read from text file and converted to Hadoop files. Liberator is executed to create the spectrum library and export the unidentified spectra. MzMod iterates through all unidentified query spectra and matches them against the spectra from the library. All library spectra within the allowed OMS mass tolerance are aligned to the query spectra and scores are calculated to evaluate these alignments. Finally, using the hits to a library of decoy spectra, the FDR is calculated and all hits passing the FDR threshold are exported to a text file. Below we provide more detailed descriptions of these steps.

Spectrum library creation. All unmodified (i.e. cysteine carbamidomethylated, but otherwise no modifications) PSMs found in the X!Tandem search were compiled to build a library of annotated consensus MS/MS spectra by our in-house tool Liberator version 2.0. Liberator is written in Java and based on the MzJava class library (mzjava.expasy.org). In this work .mgf and .pepXML files were parsed, but the tool is able to read the several spectrum file (e.g .mgf, mzXML, .mzML, .msp) and PSM file formats (e.g. pepXML, mzIdentML, Andromeda, MODa). The tool is modular and the classes dealing with main the algorithms can be set in a configuration file. In brief, Liberator first reads the list of spectra and PSMs links peptide ions to the matching spectra , for each peptide ion combines the matching spectra to a consensus spectrum, and calculates a decoy spectrum for each consensus. Supplementary Figure 1A shows an overview of the Liberator process: Spark RDDs (blue boxes) are divided into partitions (filled blue boxes), which are sent to the different computer cluster node CPUs. filter and map operations can be performed efficiently separately on each partition, whereas rightOuterJoin and groupByKey operations require exchange of objects between

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 33

partitions. The processing steps of the Liberator tool and the parameters used in this work are now described in more detail: 1) Spectrum extraction: All MS/MS spectra with more than 20 peaks and a determined precursor charge between 1 and 3 were extracted from the Hadoop file. 2) PSM extraction: all unmodified PSMs with charge between 1 and 3 and a X!Tandem hyperscore higher than 35.52 corresponding to a global FDR of 1% and with a sequence length between 7 and 40 amino acids are extracted from the Hadoop file. 3) Joining spectra and PSMs: here all matching spectra are linked to their peptide ions with the Spark rightOuterJoin operation. Spectra which did not match an unmodified peptide at the FDR threshold are put aside and written to a Hadoop file. Then all matching spectra with the same peptide ion key (i.e. same charge, same peptide sequence, same modifications) are grouped together using the Spark groupByKey function. 4) Consensus creation: this is the core of the Liberator tool. First, all PSMs of a given peptide key are sorted by their hyperscore and at most the 300 highest scoring PSMs are retained. Then the similarity matrix using the cosine similarity of the square root transformed retained spectra and the minimal spanning tree (MST) are calculated. The MST is cut at a similarity threshold of 0.6 and only the part of the tree with the highest total hyperscore is kept. This step discards bad PSMs from a search engine with low spectrum similarity to the good PSMs. The spectrum nodes of this pruned tree are then used to calculate the consensus spectrum. Peaks are inserted into the consensus spectrum if they are detected in at least 20% of the spectra in the pruned tree or if they correspond to a b- or y-ion, and the intensities of consensus peaks are summed over all spectra. The resulting consensus spectrum is then cleaned by a filter, which only accepts fragment peaks if they are among the 2 highest peaks in a local ±10Da window around the peak. 5) Write library to Hadoop file: all consensus spectra and their annotations are written to a Hadoop file. 6) Decoy creation: decoy spectra are created with the Deliberator tool 60, which is also based on the MzJava classes to create randomized spectra. Briefly, the peptide sequence of the PSM peptide is shuffled and the b- or y-ion peaks of the spectra are reset to the values corresponding to the shuffled sequence. If the b- or y-ions of the shuffled peptide are too similar to the ions of the original peptide (cosine similarity must be smaller than 0.8), the process is repeated up to 10 times. The not annotated peaks are moved by a mass shift of 7 Da with probability of 0.2 and left

ACS Paragon Plus Environment

Page 11 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

in place otherwise, but precursor peaks are left untouched. The decoy spectra and their annotations are written to a Hadoop file.

MODa search

The OMS using MODa version 1.23 14 was run on the same UniProt database as the X!Tandem search with a precursor mass tolerance of 10ppm and fragment mass tolerance of 0.05Da. The instrument type was set to QTOF (the only options are QTOF and ion trap and QTOF was assumed the better choice for HCD data). The high mass resolution option was turned on and correction of precursor masses was switched off. MODa was run in OMS or blind search mode with one allowed modification in the mass range 0-400Da per peptide. Peptides were required to be fully tryptic with up to 2 missed cleavage sites, and all cysteine residues were carbamidomethylated. MODa results files were parsed using an in house program available in the MzJava class library (class MODaPsmReader). PSMs with equal scores were given the same rank. Therefore for some spectra several peptides come up in the first rank. Decoy hits were obtained by the reversed protein sequences attached to the .fasta file.

MzMod search

MzMod is a completely redesigned version of the QuickMod tool. It runs the code in parallel based on the Apache Spark framework, and uses much faster spectrum alignment and better scoring algorithms from the MzJava class library. Further MzMod is implemented in a modular way, which facilitates to test and adapt its functionality. MzMod reads the sequence file containing the not identified query spectra and splits them into several partitions, which are then analyzed in parallel (Supplementary Figure 1B). The algorithm runs through each partition sequentially and for each query spectrum it extracts the library spectra, which are within the mass tolerance of the OMS. It then aligns the query and library spectra, calculates the alignment scores and exports all hits which pass a FDR filter. Here we would like to present the different steps in more detail. 1) Reading query spectra: The query spectra have been sorted before storing them in the Hadoop file. The sort key contains the spectrum precursor charge and m/z value. A predicate can be set while reading for example to

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

select only spectra of sufficient quality. After reading query spectra are converted to IndexedSpectrum objects, which allow fast lookup of matching peaks. 2) Querying each partition: MzMod uses the Spark framework to split the list of query spectra into several partitions, which will be analyzed in parallel. Each partition is searched sequentially: for every query spectrum the library spectra, which fit into the mass tolerance of the OMS search, are retrieved from their sorted sequence file. Since both query and library spectra are sorted, the extraction of library spectra is incremental, i.e. for the next query spectrum only the head and tail of the retrieved list library spectra needs to be adjusted: the spectra at the tail of the list that fall out of the mass window are removed and the new spectra at the head that fall within the mass window have to be read from the sequence file. Every query spectrum partition has its own copy of the spectrum library (Supplementary Figure 1B). Library spectra are converted to IndexedPeptideSpectrum objects for fast lookup of matching peaks. 3) Fast alignment and scoring of query and library spectra: To align query and library spectra and score these alignments with the normalized dot product (NDP) or cosine score 61 we use a heuristic approach. For each spectrum-spectrum match (SSM) the algorithm runs through all library spectrum peaks and for each peak it picks the matching query spectrum peaks. ‘Matching’ means that the m/z value difference is either 0 or compatible with the precursor mass difference ∆M, i.e. for an annotated library peak with known charge Z the mass difference is ∆M/Z, where for a not annotated peak all charges from 1 to the precursor charge are considered. Next for all matching peak pairs the product of their intensity is calculated and the pairs are sorted in decreasing order by their intensity product. The pair with the highest intensity product is taken and all other pairs containing one of the two peaks drop out from the list, since it is required that no peak can match twice. Then the next highest scoring pair is processed in the same way and the algorithm continues until no pair is left. The weighted NDP score wNDPscore can be directly summed up from the list of the matching pairs, where a lower weight is given to pairs containing unannotated library peaks (weight = 2/3). At the moment this algorithm works for one modification per peptide, but it could be extended to several modifications by including a heuristic that extracts the allowed mass shifts from the query and library spectrum pair before the alignment. 4) Meta-scoring of query and library spectra: After alignment the N (N=12) best SSMs for every spectrum are kept and a more refined positioning score is calculated for all retained SSMs. These score calculations can be done on each partition separately without exchange of information between partitions (Supplementary Figure 1B). In

ACS Paragon Plus Environment

Page 12 of 33

Page 13 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

contrast to the alignment algorithm in step 4), the positioning score takes the library peptide sequence information into account and provides an alignment that is consistent, e.g. a b-ion and the complementary y-ion cannot be both modified for a single modification. For each peptide of length L the modification mass detected in step 4) is placed on every amino acid position i. First the unmodified theoretical spectrum S0 = {m0j| j ϵ all singly charged b- and yions} is calculated and then for each position i =1..L the modified theoretical spectrum Si = {mij| j ϵ all singly charged b- and y-ions}. For every position i and ion j the relative intensity Rij of the matching ion peaks in the query spectrum is calculated:  =

  

(Equation 1)

where Iij is the intensity of fragment ion mij in the query spectrum. The positioning score Pscore is then defined as

 = max − ∏∈  ∙ ∏∈1 −  

(Equation 2)

where M is the set of fragment ions in Si carrying the modification and  its complement. Finally the metascore Mscore is defined by  = !"#  +  ∙ %& 

(Equation 3)

where ICscore is the ratio of query spectrum peak intensities matching Si divided by the total intensity in the spectrum. Finally for every query spectrum we also calculate a delta score ∆Mscore, which is the difference of the highest Mscore of this spectrum and the runner up Mscore, and add this value to the Mscore.

(  =

 + ∆ 

(Equation 4)

5) FDR calculation: 500’000 query spectra are sampled and searched against the decoy database (see above). This number of sampled spectra gives accurate estimates of the FDR, which is calculated separately for modified and unmodified SSMs as well as different charge states. The decoy peptides are searched separately and the decoy score provide an estimate of the null distribution. The FDR is calculated by fitting the decoy distribution into the target distribution to determine the class probability π0 62. We make the conservative assumption that all hits with lowest scores are random hits drawn from the null distribution. Then we calculate the number of hits below a score

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 33

threshold Θ in both null (N0(Θ)) and target (NT(Θ)) distribution and move the score threshold Θ in steps i from the minimal score to the median score of the null distribution. Then we plot N0(Θi) against NT(Θi) and fit a straight line to these values. The slope of the line gives us the scaling factor with which we multiply N0(Θ) in order to estimate the number of false positives. 6) Result generation and export: Finally the results passing the FDR threshold are exported to a tabular text file.

Clustering of peptide lists

In order to cluster peptide lists we calculate a similarity score between two lists. Firstly a peptide list is reduced to the unique peptide sequences and the similarity score is calculated as two times the size of the intersection of the two lists divided by the sum of the sizes of the two lists. If the lists are equal then the score is 1, if they do not have any peptides in common the score is 0. The similarity matrix containing all tissue pair similarities is processed using the hierarchical clustering function hclust (agglomeration method set to “average”) in the R statistical environment (https://www.r-project.org). All plots in this paper were done in R.

Results and Discussion

Summary statistics for X!Tandem search on whole dataset

The 2200 X!Tandem variable modification searches were performed on the VitalIT cluster within a total CPU time of 40 hours without and 240 hours with variable modifications (Supplementary Table 1). All PSMs with a hyperscore higher than 35.52 corresponding to a global FDR of 1% at the spectrum level were accepted for further analysis (Supplementary Figure 2A). Notably, the FDR thresholds calculated for each run separately showed a high variation (Supplementary Figure 2B), and we preferred to apply a global threshold to guarantee similar PSM quality for all runs. This result shows that a run-by-run analysis of the MS/MS data can introduce quite a high variability and it is better to analyze all spectra as one block. There were over 5 million unmodified PSMs and 881’976 modified, where oxidation was the most abundant modification followed by phosphorylation and N-terminal acetylation.

ACS Paragon Plus Environment

Page 15 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Overall, doubly charged spectra provided most identifications followed by triply charged ones, whereas singly charged ones only contributed marginally. The target database X!Tandem hyperscore distributions showed quite a strong dependency on the modification state (Supplementary Figure 2C). The heavier modifications (GlyGly, hexose) produced the lowest scores, followed by methylation and phosphorylation, whereas oxidized peptides yielded a score distribution identical to unmodified peptides and acetylated peptides produced even slightly higher scores. The decoy score distributions obtained from the reversed protein sequences did not reveal such a dependency (Supplementary Figure 2D). Therefore, the FDR was calculated separately for each modification (group FDR, for details see Experimental Section section), where only singly modified peptides were included in these groups. In agreement with previously published results 59 our group FDR values showed a large variation for the modifications investigated, where phosphorylation, methylation, GlyGly and hexose FDRs were strongly increased (Table 1). In order to decrease these high error rates we applied a separate hyperscore threshold for each modification corresponding to a group FDR of 1%. These adapted thresholds strongly decreased the number of PSMs for phosphorylation, methylation, GlyGly, and hexose (Table 1). The PSM count for oxidation and for the unmodified PSMs slightly increased after applying a group FDR threshold of 1%. For protein N-terminal acetylation the group FDR always remained below 0.1% for all hyperscore values larger than 10. The decoy score density distribution for acetylated peptides is similar to the other decoy score distribution (Supplementary Figure 2D). However, compared to other modifications there were few acetylated decoy hits, which may be an artifact of using reversed protein sequences as decoys. In order to keep the PSM hyperscores at an acceptable level, we adopted a conservative approach and only applied the group threshold for modified PSMs if it was higher than the global score threshold. Along the same lines we calculated the group FDR for all PSMs with 0, 1 and 2 modifications (not counting carbamidomethlyation). Supplementary Figure 3A reveals that the error increases with the number of modifications and the group FDR in doubly modified PSMs is almost 10% and adjustment of the group FDR to 1% halved the number of doubly modified PSMs (Supplementary Figure 3B). After adjustment, singly modified peptides accounted for about 10% of the total PSMs and doubly modified PSMs for about 1%. In order to better distinguish wrong from true modified PSMs, we investigated how the presence of the unmodified form of a given peptide influences the identification error of modified spectra. We calculated the group FDR for all modified PSMs that also have a corresponding unmodified peptide match (shared peptide group, SPG). The number of decoy hits in the acetylation and oxidation SPG were so strongly reduced that the group FDR never

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 33

reached the 1% level even for the lowest hyperscores of 10. In order to maintain a minimal match quality for acetylation and oxidation SPG PSMs, we applied the global hyperscore threshold of 35.52 to these SPGs. For the other modifications the restriction to the SPG resulted in reduced thresholds at group FDR of 1% and increased number of hits (Table 1). These results suggest that for PTMs with a lower stoichiometry we can obtain more hits on large datasets by just searching for modifications on peptides that were already identified in their unmodified form. However, this strategy leads to a loss in the total numbers of hits for PTMs like N-terminal acetylation or oxidation with higher stoichiometry. The division of the modified PSMs into shared peptide group and its complement is reminiscent of the inclusion of the number of sibling modifications in the Bayesian model in iProphet

63

. However,

the type of the modification cannot be included in their model which could lead to biased results.

Summary statistics for MODa search on whole dataset

The MODa OMS search was performed on the whole UniProt and database on the VitalIT cluster. It required a total CPU time of 1697 hours (Supplementary Table 1). On average 70% of the spectra produced at least one PSM entry in the MODa result files without any score threshold filtering. Supplementary Figure 4 shows the target and decoy distributions of the MODa score. The score thresholds at an FDR of 1% were calculated separately for modified and unmodified PSMs. If only the highest ranking hits were considered, MODa identified 1’166’775 unmodified and 191’822 modified PSMs. About 2% of the MODa results showed several best PSMs per spectrum with equal scores, but different peptide sequences. Next we evaluated whether the reversed decoy sequences provide error estimates, which are consistent with the X!Tandem searches. Therefore we extracted the spectra with high confident hits in the SPG X!Tandem results (SPG FDR of 1%) and compared the identified X!Tandem and MODa peptide sequences (bare sequences stripped from modifications). If FDR estimates were accurate, we would expect less than 2% difference for the spectra that were identified by both tools. However, we found that the percentage of spectra, for which MODa picked a peptide sequence that is different to that found by X!Tandem, was always significantly higher, especially for modified peptides where the percentage was up to 20% (Supplementary Figure 5). Since there presumably is little error in the X!Tandem hits, we suspect that the FDR in the MODa search was underestimated. Even more stringent thresholds than the ones obtained with the reversed decoy sequences did not lower the disagreement to X!Tandem below 6% (Supplementary Figure 6). While Chick et al. 27 state that the FDR

ACS Paragon Plus Environment

Page 17 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

calculated by reversing database sequences were accurate for their OMS tool, our results indicate that these FDR values are too optimistic especially for the modified MODa results.

Summary statistics for Liberator/MzMod search on test and whole datasets

First the consensus creation algorithm of Liberator was compared to different approaches. Supplementary Figure 7 compares the consensus spectrum of human serum albumin peptide LDELRDEGK with the consensus spectrum from the NIST library (peptide.NIST.gov). The two spectra are nearly identical even though different data was used in both cases, indicating the reproducibility of peptide MS/MS spectra. Supplementary Figure 8 shows the cosine similarities between consensus spectra calculated with Liberator and SpectraST 64on a subset of the data. Most consensus spectra are very similar (cosine score > 0.8). The few differences are mainly due to the different filtering criteria applied, which lead to different numbers of peaks and reduced similarity scores. Creation of the spectrum library with Liberator took a bit more than 26 hours total CPU time, whereas the MzMod search of all 24 million spectra finished in 1526 hour- roughly the same time as MODa (Supplementary Table 1). Supplementary Figure 9 shows the MzMod Mscore (Equation 3) distribution for the highest ranking scores obtained from the adult liver data. The decoy distributions obtained by the decoy spectra method fit the target distributions quite well. In order to evaluate the accuracy of the MzMod FDR calculation using shuffled library spectra we compared the results to X!Tandem. If we assume that both X!Tandem and MzMod hits contain 1% error then we would expect that less than 2% of the spectra identified by both tools yield different peptide sequences. Supplementary Figure 10 reveals that this is approximately the case for both doubly and triply charged SSMs and the MzMod FDR is quite accurate. Also contrary to the results for X!Tandem and MODa the group FDRs calculated for the 6 different PTMs are not far from the global level of 1% (Supplementary Figure 11). Next we tested the power of the different MzMod scores (wNDPscore, Mscore, and dMscore) using the high scoring X!Tandem results (FDR of 1%) as a gold standard (Figure 2). The Mscore clearly performs better than the dot product score wNDPscore indicating the importance of the peptide sequence information for the alignment of modified and unmodified spectra. The number of conflicting peptide sequences found by MzMod and X!Tandem can be further reduced by combining the delta score and Mscore into dMscore (Equation 4). The reason for this improvement is that in many cases when MzMod reports a conflicting peptide in the first rank, the ‘true’ peptide is in the second rank with a close Mscore and

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

a low delta score. The dMscore favors those highest scoring SSMs, which have no close runner up score, and therefore suppresses these conflicts. Similar to delta scores used for PTM positioning 8 the dMscore increases accuracy at the cost of discarding some SSMs with high Mscore but low delta scores. However up to a conflict rate of 5% dMscore outperforms Mscore. In Figure 3 we compare the performance of MzMod Mscore and the MODa score. As outlined above the conflict rate between MODa and X!Tandem is never below 6%, where MzMod is always around 2%. In order to investigate these conflicting SSMs further, we calculated the singly charged b-ions for every peptide (including modifications) and redefined a conflict between two peptides if their sets of b-ion masses were different. With this b-ion similarity the number of conflicts between MODa and X!Tandem was strongly reduced (Figure 3 dashed lines) and in most conflicts MODa picked up a modified peptide with the same b-ions as the X!Tandem peptide. For example when MODa matched the peptide ELEEISERLEEAGGATSAQIEL(33.950)NK, X!Tandem found the peptide ELEEISERLEEAGGATSAQIEM(15.995)NK (numbers in parenthesis indicate modification mass in Da at this position). MzMod did not make this mistake since the unmodified peptide ELEEISERLEEAGGATSAQIELNK was never identified. For a ‘blind’ OMS tool the two alignments are almost identical since they are based on the same masses. MzMod is saved by its reduced search space, which prevents the erroneous alignment in this case. In another case both MODa and MzMod disagreed with the X!Tandem peptide sequence and b-ion masses. X!Tandem matched the spectrum to TTGIVM(O)DSGDGVTHIVPIYEGYALPHAILR. The not oxidized form of the peptide was also identified, along with the very similar peptide TTGIVMDSGDGVTHTVPIYEGYALPHAILR, which differs only in the highlighted amino acid (IT). Both MzMod and MODa picked the latter peptide with a mass shift of 28.006 Da near the N-terminus, which does not make much sense and the oxidation found by X!Tandem is the more probable match. This example illustrates that the blind alignment with arbitrary mass shifts is prone to give wrong results. More work on postprocessing OMS results is required to exclude these spurious matches and to strike a balance between being open and being blind.

Comparison of X!Tandem, MODa and MzMod

Next we compared the number of identified PSMs and the number of unique peptides for X!Tandem, MODa and MzMod. For X!Tandem we took the numbers from the SPG PSMs with the corresponding thresholds

ACS Paragon Plus Environment

Page 18 of 33

Page 19 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(last row in Table 1). MODa and MzMod modified PSMs were filtered with a threshold corresponding to a FDR of 1% for modified PSMs or SSMs, respectively. The results in Figure 4 indicate that for all modifications searched by X!Tandem, it identified the largest number of spectra and unique peptides, followed by MzMod and MODa. This is qualitatively consistent with the results reported by Chick et al. 27, which were obtained with an adapted Sequest search tool 65 run in exact and OMS mode. Since the X!Tandem search used variable modifications on defined amino acids, it had to explore a smaller search space and encountered fewer ambiguities than MODa and MzMod, where any mass shift can sit on any amino acid. Therefore it is reasonable that XTandem requires less stringent thresholds at a given error rate and therefore finds more PSMs. The results reveal that the use of spectral libraries for OMS provides an advantage mainly due to the reduced search space. MzMod identified about 1/3 of the X!Tandem PSMs and about 4-5 times the number of MODa PSMs. The only exception being N-terminal acetylation, where the nonacetylated peptides are rarely identified and MzMod was not able to align the acetylated peptides to a spectrum library entry. The MzMod search identified a total 1’408’641 modified SSMs. Many of the detected modifications are situated between 0 and 100 Da and correspond to standard chemical modification mass shifts encountered in proteomics samples or wrongly assigned precursors (Supplementary Figure 12, Supplementary Table 2). Other frequent mass shifts can be introduced when spectra of abundant peptides, which differ only by one amino acid from a library peptide, are matched. These spectra were not matched in the X!Tandem search either because the hyperscore was just below threshold or the mutated peptide is not in the sequence database. 321’557 SSMs remained after removal of the modifications at +1Da, +16Da, +28Da, +32Da, +43Da, and +57Da. MODa produced a total of 206’725 PSMs, 92’916 of which were left after removal. X!Tandem, MzMod and MODa taken together matched about 2’215’212 spectra to modified peptides corresponding to 40% of the number of unmodified PSMs. Supplementary Figure 13 shows the Venn diagram for the modified PSMs identified by the 3 tools used in this study.

Linking PTMs to tissues

The unmodified peptides identified in the X!Tandem search are tissue specific. Clustering of the peptide lists (see Method Section) identified in each tissue revealed that the similarity dendrogram is comparable to the one obtained with RNA profiles 66 (see Experimental Section for details on clustering). Fetal tissues as well as immune system cells form clear clusters (Supplementary Figure 14A) The clustering profile of the phosphorylated peptides

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 33

identified by X!Tandem is similar but also reveals some interesting differences. For example fetal brain tissue now clusters with the adult frontal cortex and spinal cord tissues (Supplementary Figure 14B). The level of phosphorylation depends quite strongly on the tissue, with brain tissues and immune cells having the highest spectral counts, and adult kidney and liver tissues the lowest (Supplementary Figure 15). Lysine succinylation is a recently discovered PTM 21, which adds a succinyl group of mass of 100.0160 Da to lysine residues. This PTM reverses the charge of lysine from +1 to -1 and therefore has a strong effect on the binding properties and structure of the modified protein. MzMod found 180 distinct succinylated peptides matching 259 spectra, some of which are already annotated in UniProt. An MzMod hit was considered as a succinylation if the mass shift was within 0.005 Da of the succinylation mass and the modification was positioned on or immediately next to a lysine. It can be distinguished from peptides which are both carbamidomethylated and carbamylated (mass shift: 100.0273). The PSM count distribution (Figure 5) showed strong spikes in the adult frontal cortex, adult liver and adult heart tissues, which are known for their high metabolic rates 67 . These results indicate a connection between lysine succinylation and metabolic pathways consistent with previous reports 68. This preliminary analysis of the results indicates that it is possible to discover correlations between tissue types and PTMs in these data and will be followed by a further investigation. Inclusion of quantitative information from the MS1 level will hopefully yield more accurate PTM tissue profiles, and it may allow us mine the data for tissue specific modifications of individual peptides.

Conclusions

MS/MS data repositories are a valuable recourse for the discovery and annotation of PTMs. Standard search tools are able to search these datasets within reasonable time using large computer clusters and to extract thousands of modified proteins for preconfigured variable modifications. If the modification of interest and its amino acid substrates are known in advance then standard sequence search tools should be used to find them. Searching at a global FDR threshold, the error can be inflated in the subgroup of modified PSMs and depend on the type of modification. The error also increases with the number of modifications per peptide. Therefore we recommend calculating the FDR separately for each configured PTM and for different numbers of PTMs per peptide. For modifications of low stoichiometry it is again advantageous to calculate the local FDR for the group of modified

ACS Paragon Plus Environment

Page 21 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

PSMs, for which the unmodified peptide was also detected. Instead of calculating local FDRs one could also use a tool such as iProphet 63, which adapts the statistical model for the different subgroups. OMS is required if the modifications or their attachment sites in the proteins are unknown. The tools we present here, Liberator and MzMod, help building a spectral library based OMS workflow for big MS/MS data. Since they are implemented efficiently based on the MzJava class library and Apache Spark large scale computing framework, they are able to handle millions of spectra. Our results show that the reduced search space of the spectral library (compared to all tryptic peptides) and the improved MzMod scoring lead to a favorable performance for those modified peptides, whose unmodified peptide is likely to be present in the library. We only observed decreased performance for high stoichiometry modifications where the unmodified peptide is rarely detected. Based on the results summarized in Figure 4, we recommend to first search the high stoichiometry (e.g. Nterminal acetylation) and the most abundant (e.g. phosphorylation, methylation, lysine acetylation, ubiquitination and oxidation) PTMs in the sample as variable modifications using a sequence search tool. If it is not known which modifications are most abundant this information can be obtained from the ∆M-histogram of an OMS search on a subsample of the data. Then a spectrum library can be built on the sequence search results, which forms the basis to mine the data for unknown or unanticipated PTMs. Gaps in the spectrum library could be filled with publicly available library spectra or predicted spectra. If modified spectra are also inserted into the library, this approach will be able to pick up additionally modified versions of already modified spectra. The results of the OMS search themselves could be as well inserted into the spectra library in order to search for further multiply modified peptides. Such an approach could be very useful to study the correlation between PTMs events and PTM crosstalk. Future work will also deal with modification specific scoring systems, which make use of neutral loss peaks and intensity changes induced by PTMs. Further post-processing tools to discard ‘impossible’ alignments or to ‘blast’ alignments in order to check whether they could as well be interpreted as a sequence variant without the need for a PTM will help to analyze the long OMS result lists of large datasets. Preliminary analysis of the results revealed many connections between PTMs and human tissues. So far this analysis is based on spectral counting, which is not a very accurate quantitative method for single peptides. These spectral counts should be complemented by more accurate quantitative information from MS1 peak volumes. With the availability of SWATH data 69 in public repositories the information gained in the standard sequence database

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 33

searches and OMS in the form of spectral libraries could be used to extract accurate quantitative information, which is needed to increase the sensitivity to detect relationships between tissues and PTMs.

Availability Liberator and MzMod are open-source tools distributed under the AGPL v3.0 license. They require Java 1.8 and Apache Spark 1.3 and Apache Hadoop 2.3 or higher. The code and a description how to run it on LSF are available on https:// bitbucket.org/sib-pig/mzmod.

Acknowledgments We would like to thank the VitalIT team for their kind and efficient support.

ACS Paragon Plus Environment

Page 23 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

References

(1) (2) (3)

(4) (5) (6) (7)

(8)

(9)

(10) (11) (12) (13)

(14) (15)

(16)

(17)

(18)

Bannister, A. J.; Kouzarides, T. Regulation of chromatin by histone modifications. Cell Res. 2011, 21 (3), 381–395. Beltrao, P.; Bork, P.; Krogan, N. J.; van Noort, V. Evolution and functional cross‐talk of protein post‐translational modifications. Mol. Syst. Biol. 2013, 9 (1). Minguez, P.; Parca, L.; Diella, F.; Mende, D. R.; Kumar, R.; Helmer‐Citterich, M.; Gavin, A.‐C.; Noort, V. van; Bork, P. Deciphering a global network of functionally associated post‐translational modifications. Mol. Syst. Biol. 2012, 8 (1). Izzo, A.; Schneider, R. Chatting histone modifications in mammals. Brief. Funct. Genomics 2010, 9 (5‐6), 429–443. Olsen, J. V.; Mann, M. Status of Large‐scale Analysis of Post‐translational Modifications by Mass Spectrometry. Mol. Cell. Proteomics 2013, 12 (12), 3444–3452. Tsur, D.; Tanner, S.; Zandi, E.; Bafna, V.; Pevzner, P. A. Identification of post‐translational modifications by blind search of mass spectra. Nat. Biotechnol. 2005, 23 (12), 1562–1567. Beausoleil, S. A.; Villen, J.; Gerber, S. A.; Rush, J.; Gygi, S. P. A probability‐based approach for high‐ throughput protein phosphorylation analysis and site localization. Nat Biotech 2006, 24 (10), 1285–1292. Savitski, M. M.; Lemeer, S.; Boesche, M.; Lang, M.; Mathieson, T.; Bantscheff, M.; Kuster, B. Confident Phosphorylation Site Localization Using the Mascot Delta Score. Mol. Cell. Proteomics MCP 2011, 10 (2). Boyne, M. T.; Garcia, B. A.; Li, M.; Zamdborg, L.; Wenger, C. D.; Babai, S.; Kelleher, N. L. Tandem Mass Spectrometry with Ultrahigh Mass Accuracy Clarifies Peptide Identification by Database Retrieval. J. Proteome Res. 2009, 8 (1), 374–379. Ahrné, E.; Müller, M.; Lisacek, F. Unrestricted identification of modified proteins using MS/MS. Proteomics 2010, 10 (4), 671–686. Na, S.; Paek, E. Software eyes for protein post‐translational modifications. Mass Spectrom. Rev. 2014. Pevzner, P. A.; Dančík, V.; Tang, C. L. Mutation‐Tolerant Protein Identification by Mass Spectrometry. J. Comput. Biol. 2000, 7 (6), 777–787. Na, S.; Jeong, J.; Park, H.; Lee, K.‐J.; Paek, E. Unrestrictive Identification of Multiple Post‐ translational Modifications from Tandem Mass Spectrometry Using an Error‐tolerant Algorithm Based on an Extended Sequence Tag Approach. Mol. Cell. Proteomics 2008, 7 (12), 2452–2463. Na, S.; Bandeira, N.; Paek, E. Fast Multi‐Blind Modification Search Through Tandem Mass Spectrometry. Mol. Cell. Proteomics 2012, 11 (4). Savitski, M. M.; Nielsen, M. L.; Zubarev, R. A. ModifiComb, a New Proteomic Tool for Mapping Substoichiometric Post‐translational Modifications, Finding Novel Types of Modifications, and Fingerprinting Complex Protein Mixtures. Mol. Cell. Proteomics 2006, 5 (5), 935–948. Falkner, J. A.; Falkner, J. W.; Yocum, A. K.; Andrews, P. C. A Spectral Clustering Approach to MS/MS Identification of Post‐Translational Modifications. J. Proteome Res. 2008, 7 (11), 4614– 4622. Ye, D.; Fu, Y.; Sun, R.‐X.; Wang, H.‐P.; Yuan, Z.‐F.; Chi, H.; He, S.‐M. Open MS/MS spectral library search to identify unanticipated post‐translational modifications and increase spectral identification rate. Bioinformatics 2010, 26 (12), i399–i406. Ahrné, E.; Nikitin, F.; Lisacek, F.; Müller, M. QuickMod: A Tool for Open Modification Spectrum Library Searches. J. Proteome Res. 2011, 10 (7), 2913–2921.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(19) (20) (21) (22) (23)

(24)

(25)

(26)

(27)

(28)

(29) (30)

(31)

(32)

(33)

(34)

Ma, C. W. M.; Lam, H. Hunting for Unexpected Post‐Translational Modifications by Spectral Library Searching with Tier‐Wise Scoring. J. Proteome Res. 2014, 13 (5), 2262–2271. Mukherjee, S.; Hao, Y.‐H.; Orth, K. A newly discovered post‐translational modification‐‐the acetylation of serine and threonine residues. Trends Biochem. Sci. 2007, 32 (5), 210–216. Zhang, Z.; Tan, M.; Xie, Z.; Dai, L.; Chen, Y.; Zhao, T. Identification of lysine succinylation as a new post‐translational modification. Nat. Chem. Biol. 2011, 7 (1), 58–63. Arnaudo, A. M.; Garcia, B. A. Proteomic characterization of novel histone post‐translational modifications. Epigenetics Chromatin 2013, 6 (1), 24. Chen, Y.; Sprung, R.; Tang, Y.; Ball, H.; Sangras, B.; Kim, S. C.; Falck, J. R.; Peng, J.; Gu, W.; Zhao, Y. Lysine Propionylation and Butyrylation Are Novel Post‐translational Modifications in Histones. Mol. Cell. Proteomics 2007, 6 (5), 812–819. Tan, M.; Luo, H.; Lee, S.; Jin, F.; Yang, J. S.; Montellier, E.; Buchou, T.; Cheng, Z.; Rousseaux, S.; Rajagopal, N.; et al. Identification of 67 Histone Marks and Histone Lysine Crotonylation as a New Type of Histone Modification. Cell 2011, 146 (6), 1016–1028. Zhang, Y.; Muller, M.; Xu, B.; Yoshida, Y.; Horlacher, O.; Nikitin, F.; Garessus, S.; Magdeldin, S.; Kinoshita, N.; Fujinaka, H.; et al. Unrestricted modification search reveals lysine methylation as major modification induced by tissue formalin fixation and paraffin embedding. PROTEOMICS 2015, 15 (15), 2568–2579. Yang, M.‐K.; Yang, Y.‐H.; Chen, Z.; Zhang, J.; Lin, Y.; Wang, Y.; Xiong, Q.; Li, T.; Ge, F.; Bryant, D. A.; et al. Proteogenomic analysis and global discovery of posttranslational modifications in prokaryotes. Proc. Natl. Acad. Sci. U. S. A. 2014. Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. A mass‐ tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 2015, 33 (7), 743–749. Beltrao, P.; Albanèse, V.; Kenner, L. R.; Swaney, D. L.; Burlingame, A.; Villén, J.; Lim, W. A.; Fraser, J. S.; Frydman, J.; Krogan, N. J. Systematic Functional Prioritization of Protein Posttranslational Modifications. Cell 2012, 150 (2), 413–425. Landry, C. R.; Levy, E. D.; Michnick, S. W. Weak functional constraints on phosphoproteomes. Trends Genet. 2009, 25 (5), 193–197. Henriksen, P.; Wagner, S. A.; Weinert, B. T.; Sharma, S.; Bačinskaja, G.; Rehman, M.; Juffer, A. H.; Walther, T. C.; Lisby, M.; Choudhary, C. Proteome‐wide Analysis of Lysine Acetylation Suggests its Broad Regulatory Scope in Saccharomyces cerevisiae. Mol. Cell. Proteomics 2012, 11 (11), 1510– 1522. Levy, E. D.; Michnick, S. W.; Landry, C. R. Protein abundance is key to distinguish promiscuous from functional phosphorylation based on evolutionary information. Philos. Trans. R. Soc. B Biol. Sci. 2012, 367 (1602), 2594–2606. Weinert, B. T.; Iesmantavicius, V.; Moustafa, T.; Schölz, C.; Wagner, S. A.; Magnes, C.; Zechner, R.; Choudhary, C. Acetylation dynamics and stoichiometry in Saccharomyces cerevisiae. Mol. Syst. Biol. 2014, 10 (1). Perez‐Riverol, Y.; Alpi, E.; Wang, R.; Hermjakob, H.; Vizcaíno, J. A. Making proteomics data accessible and reusable: Current state of proteomics databases and repositories. PROTEOMICS 2014, n/a – n/a. Farrah, T.; Deutsch, E. W.; Omenn, G. S.; Sun, Z.; Watts, J. D.; Yamamoto, T.; Shteynberg, D.; Harris, M. M.; Moritz, R. L. State of the human proteome in 2013 as viewed through PeptideAtlas: comparing the kidney, urine, and plasma proteomes for the biology‐ and disease‐driven Human Proteome Project. J. Proteome Res. 2014, 13 (1), 60–75.

ACS Paragon Plus Environment

Page 24 of 33

Page 25 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(35)

(36)

(37)

(38)

(39)

(40) (41)

(42) (43) (44) (45) (46) (47)

(48) (49)

(50)

(51) (52)

Journal of Proteome Research

Wilhelm, M.; Schlegl, J.; Hahne, H.; Gholami, A. M.; Lieberenz, M.; Savitski, M. M.; Ziegler, E.; Butzmann, L.; Gessulat, S.; Marx, H.; et al. Mass‐spectrometry‐based draft of the human proteome. Nature 2014, 509 (7502), 582–587. Kim, M.‐S.; Pinto, S. M.; Getnet, D.; Nirujogi, R. S.; Manda, S. S.; Chaerkady, R.; Madugundu, A. K.; Kelkar, D. S.; Isserlin, R.; Jain, S.; et al. A draft map of the human proteome. Nature 2014, 509 (7502), 575–581. Wang, M.; Weiss, M.; Simonovic, M.; Haertinger, G.; Schrimpf, S. P.; Hengartner, M. O.; Mering, C. von. PaxDb, a Database of Protein Abundance Averages Across All Three Domains of Life. Mol. Cell. Proteomics 2012, 11 (8), 492–500. Naegle, K. M.; Gymrek, M.; Joughin, B. A.; Wagner, J. P.; Welsch, R. E.; Yaffe, M. B.; Lauffenburger, D. A.; White, F. M. PTMScout, a Web Resource for Analysis of High Throughput Post‐translational Proteomics Studies. Mol. Cell. Proteomics 2010, 9 (11), 2558–2570. Hornbeck, P. V.; Kornhauser, J. M.; Tkachev, S.; Zhang, B.; Skrzypek, E.; Murray, B.; Latham, V.; Sullivan, M. PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post‐translational modifications in man and mouse. Nucleic Acids Res. 2011, gkr1122. Matic, I.; Ahel, I.; Hay, R. T. Reanalysis of phosphoproteomics data uncovers ADP‐ribosylation sites. Nat. Methods 2012, 9 (8), 771–772. Zhou, T.; Xia, X.; Liu, J.; Wang, G.; Guo, Y.; Guo, X.; Wang, X.; Sha, J. Beyond single modification: Reanalysis of the acetylproteome of human sperm reveals widespread multiple modifications. J. Proteomics 2015, 126, 296–302. Sun, G.; Jiang, M.; Zhou, T.; Guo, Y.; Cui, Y.; Guo, X.; Sha, J. Insights into the lysine acetylproteome of human sperm. J. Proteomics 2014, 109, 199–211. Dean, J.; Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. Commun ACM 2008, 51 (1), 107–113. Zaharia, M.; Chowdhury, M.; Franklin, M. J.; Shenker, S.; Stoica, I. Spark: cluster computing with working sets. Proc. 2nd USENIX Conf. Hot Top. Cloud Comput. 2010. Taylor, R. C. An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 2010, 11 (Suppl 12), S1. Pratt, B.; Howbert, J. J.; Tasman, N. I.; Nilsson, E. J. MR‐Tandem: parallel X!Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics 2012, 28 (1), 136–137. Kalyanaraman, A.; Cannon, W. R.; Latt, B.; Baxter, D. J. MapReduce Implementation of a Hybrid Spectral Library‐Database Search Method for Large‐scale Peptide Identification. Bioinformatics 2011. Hung, C.‐L.; Hua, G.‐J. Cloud Computing for Protein‐Ligand Binding Site Comparison. BioMed Res. Int. 2013, 2013. Wiewiórka, M. S.; Messina, A.; Pacholewska, A.; Maffioletti, S.; Gawrysiak, P.; Okoniewski, M. J. SparkSeq: fast, scalable and cloud‐ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 2014, 30 (18), 2652–2653. Freeman, J.; Vladimirov, N.; Kawashima, T.; Mu, Y.; Sofroniew, N. J.; Bennett, D. V.; Rosen, J.; Yang, C.‐T.; Looger, L. L.; Ahrens, M. B. Mapping brain activity at scale with cluster computing. Nat. Methods 2014, 11 (9), 941–950. Horlacher, O.; Nikitin, F.; Alocci, D.; Mariethoz, J.; Müller, M.; Lisacek, F. MzJava: An open source library for mass spectrometry data processing. J. Proteomics. Vizcaíno, J. A.; Côté, R. G.; Csordas, A.; Dianes, J. A.; Fabregat, A.; Foster, J. M.; Griss, J.; Alpi, E.; Birim, M.; Contell, J.; et al. The Proteomics Identifications (PRIDE) database and associated tools: status in 2013. Nucleic Acids Res. 2013, 41 (D1), D1063–D1069.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(53) (54) (55)

(56) (57) (58) (59) (60)

(61) (62) (63)

(64)

(65)

(66)

(67)

(68)

(69)

Page 26 of 33

Kessner, D.; Chambers, M.; Burke, R.; Agus, D.; Mallick, P. ProteoWizard: open source software for rapid proteomics tools development. Bioinformatics 2008, 24 (21), 2534–2536. Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466–1467. Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74 (20), 5383–5392. Elias, J. E.; Gygi, S. P. Target‐decoy search strategy for increased confidence in large‐scale protein identifications by mass spectrometry. Nat. Methods 2007, 4 (3), 207–214. Jeong, K.; Kim, S.; Bandeira, N. False discovery rates in spectral identification. BMC Bioinformatics 2012, 13 (Suppl 16), S2. Everett, L. J.; Bierl, C.; Master, S. R. Unbiased Statistical Analysis for Multi‐Stage Proteomic Search Strategies. J. Proteome Res. 2010, 9 (2), 700–707. Fu, Y.; Qian, X. Transferred Subgroup False Discovery Rate for Rare Post‐translational Modifications Detected by Mass Spectrometry. Mol. Cell. Proteomics 2014, 13 (5), 1359–1368. Ahrné, E.; Ohta, Y.; Nikitin, F.; Scherl, A.; Lisacek, F.; Müller, M. An improved method for the construction of decoy peptide MS/MS spectra suitable for the accurate estimation of false discovery rates. PROTEOMICS 2011, 11 (20), 4085–4095. Stein, S. E.; Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass Spectrom. 1994, 5 (9), 859–866. Storey, J. D. A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B Stat. Methodol. 2002, 64 (3), 479–498. Shteynberg, D.; Deutsch, E. W.; Lam, H.; Eng, J. K.; Sun, Z.; Tasman, N.; Mendoza, L.; Moritz, R. L.; Aebersold, R.; Nesvizhskii, A. I. iProphet: Multi‐level Integrative Analysis of Shotgun Proteomic Data Improves Peptide and Protein Identification Rates and Error Estimates. Mol. Cell. Proteomics 2011, 10 (12). Lam, H.; Deutsch, E. W.; Eddes, J. S.; Eng, J. K.; King, N.; Stein, S. E.; Aebersold, R. Development and validation of a spectral library searching method for peptide identification from MS/MS. PROTEOMICS 2007, 7 (5), 655–667. Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5 (11), 976–989. Son, C. G.; Bilke, S.; Davis, S.; Greer, B. T.; Wei, J. S.; Whiteford, C. C.; Chen, Q.‐R.; Cenacchi, N.; Khan, J. Database of mRNA gene expression profiles of multiple human organs. Genome Res. 2005, 15 (3), 443–450. Wang, Z.; Ying, Z.; Bosy‐Westphal, A.; Zhang, J.; Schautz, B.; Later, W.; Heymsfield, S. B.; Müller, M. J. Specific metabolic rates of major organs and tissues across adulthood: evaluation by mechanistic model of resting energy expenditure1234. Am. J. Clin. Nutr. 2010, 92 (6), 1369–1377. Rardin, M. J.; He, W.; Nishida, Y.; Newman, J. C.; Carrico, C.; Danielson, S. R.; Guo, A.; Gut, P.; Sahu, A. K.; Li, B.; et al. SIRT5 Regulates the Mitochondrial Lysine Succinylome and Metabolic Networks. Cell Metab. 2013, 18 (6), 920–933. Gillet, L. C.; Navarro, P.; Tate, S.; Röst, H.; Selevsek, N.; Reiter, L.; Bonner, R.; Aebersold, R. Targeted Data Extraction of the MS/MS Spectra Generated by Data‐independent Acquisition: A New Concept for Consistent and Accurate Proteome Analysis. Mol. Cell. Proteomics 2012, 11 (6).

ACS Paragon Plus Environment

Page 27 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 1: Number of PSMs Resulting from Different Ways to Calculate the FDR

Unmodified

Phosphorylation

Methylation

Acetylation

Oxidation

0.53

33.9

30.5

0.02

0.53

62.5

100

4’503’248

24’980

18’053

60’128

546’331

12’368

3’365

5’477’245

7’345

3’723

60’128*

546’331*

2’004

198

-

12’065

8’928

18’383*

538’108*

4’034

203

PTM group FDR (%) at global FDR = 1% Number of PSMs in PTM group at global FDR = 1% Number of PSMs in PTM group at group FDR = 1% Number of PSMs in shared peptide group at group FDR = 1% *Global FDR = 1% threshold (see main text)

ACS Paragon Plus Environment

GlyGly

Hexose

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1 Liberator/MzMod flowchart. Orange boxes represent Hadoop files, light blue boxes Apache Spark RDDs, and dark blue boxes text files.

ACS Paragon Plus Environment

Page 28 of 33

Page 29 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 2 Comparison to the X!Tandem results. The x-axis represents the rank of the MzMod scores wNDPscore, Mscore, or dMscore . The y-axis indicates the number of conflicting SSMs with score lower than the score at a given rank. A conflict occurs if the peptides identified by X!Tandem and MzMod for a given spectrum have different sequences. The sloping lines indicate conflict rates of 0.01, 0.02, 0.03, 0.04, 0.05 and 0.1.

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3 Comparison of MODa and MzMod to the X!Tandem results. The x-axis represents the rank of the MzMod score Mscore or the MODa score. The y-axis indicates the number of conflicting SSMs with score lower than the score at the given rank. A conflict between the peptides identified by X!Tandem and MzMod or MODa for a given spectrum is evaluated in two ways: first by sequence comparison (solid lines) and second by comparison of the masses of singly charged b-ions (dashed lines). The sloping lines indicate conflict rates of 0.01, 0.02, 0.03, 0.04, 0.05 and 0.1.

ACS Paragon Plus Environment

Page 30 of 33

Page 31 of 33

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 4 Comparison of X!Tandem (light gray), MODa (gray) and MzMod (dark gray) for PSMs phosphorylated on STY, Nterminal acetylated, methylated on K, oxidized on M, ubiquitinated (GlyGly) on K, and with a hexose on N. The plot shows both number of PSMs (left bar) and number of unique peptides (right bar).

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 5 Lysine succinylation PSM count identified by MzMod as a function of tissue

ACS Paragon Plus Environment

Page 32 of 33

Page 33 of 33

for TOC only

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42

Journal of Proteome Research

ACS Paragon Plus Environment