SpecOMS: a full open modification search method ... - ACS Publications

Bioinformatics; proteomics; MS/MS; open modification search; pattern-matching; algo- rithms; PTM; peptide identification; spectra comparison. Introduc...
0 downloads 5 Views 1MB Size
Article pubs.acs.org/jpr

SpecOMS: A Full Open Modification Search Method Performing Allto-All Spectra Comparisons within Minutes Matthieu David,†,‡ Guillaume Fertin,*,† Hélène Rogniaux,‡ and Dominique Tessier*,‡ †

LS2N UMR CNRS 6004, Université de Nantes, F-44300 Nantes, France INRA UR1268 Biopolymères Interactions Assemblages, F-44316 Nantes, France



S Supporting Information *

ABSTRACT: The analysis of discovery proteomics experiments relies on algorithms that identify peptides from their tandem mass spectra. The almost exhaustive interpretation of these spectra remains an unresolved issue. At present, an important number of missing interpretations is probably due to peptides displaying post-translational modifications and variants that yield spectra that are particularly difficult to interpret. However, the emergence of a new generation of mass spectrometers that provide high fragment ion accuracy has paved the way for more efficient algorithms. We present a new software, SpecOMS, that can handle the computational complexity of pairwise comparisons of spectra in the context of large volumes. SpecOMS can compare a whole set of experimental spectra generated by a discovery proteomics experiment to a whole set of theoretical spectra deduced from a protein database in a few minutes on a standard workstation. SpecOMS can ingeniously exploit those capabilities to improve the peptide identification process, allowing strong competition between all possible peptides for spectrum interpretation. Remarkably, this software resolves the drawbacks (i.e., efficiency problems and decreased sensitivity) that usually accompany open modification searches. We highlight this promising approach using results obtained from the analysis of a public human data set downloaded from the PRIDE (PRoteomics IDEntification) database. KEYWORDS: bioinformatics, proteomics, MS/MS, open modification search, pattern-matching, algorithms, PTM, peptide identification, spectra comparison



INTRODUCTION The identification of tandem mass spectra obtained in discovery proteomics experiments remains a formidable challenge. Indeed, a recent study covering all complete proteome experiments deposited into the PRIDE (PRoteomics IDEntification) database1 up to April 2015 showed that on average 75% of the spectra analyzed by mass spectrometry remained unidentified.2 Another study suggests that half of the falsenegative hits may be due to modified peptides and that these misidentifications considerably impact protein identification and quantification.3 The most common strategy for the interpretation of MS/MS spectra consists of comparing the experimental spectra to a set of ideal spectra (also called theoretical spectra) extrapolated from the predicted fragmentation of peptides derived from a protein database. A score function determines the best candidate(s) for each spectrum and ranks the list of peptidespectrum matches (PSMs) in decreasing order based on the assumption that the most reliable PSMs have a higher score. A threshold based on a measure of the statistical significance determines which PSMs are accepted for the inference of proteins according to a given false discovery rate (FDR).4 Algorithms need to process the huge number of spectra obtained in MS/MS experiments quickly. Therefore, the © 2017 American Chemical Society

standard approach reduces the number of theoretical spectra to take into account in the comparisons by selecting peptides whose masses are within a certain range of the experimental spectrum mass. At this step, some post-translational modifications (PTMs) may be added as “variable modifications” (VMs); however, these modifications need to be user-defined prior to the search. Moreover, there are limitations on the number of VMs that can reasonably be included in an analysis because the inclusion of too many VMs leads to longer search times and more false-positive identifications.5 In practice, no more than three or four VMs are considered in a typical analysis. In most cases, methionine oxidation is the only specified VM in database searches, although the Unimod databank6 describes more than 1000 possible PTMs. Moreover, PTM databases are probably far from complete because all rare or labile PTMs have still not been discovered to date. To refine our understanding of the origin of spectral misinterpretations, we previously developed the algorithm SpecXtract,7 which is able to compute the number of common (or shared) fragment masses for any pair of spectra (se,st) such that se ⊆ SE and st ⊆ ST, where SE and ST, respectively, refer to Received: May 17, 2017 Published: June 29, 2017 3030

DOI: 10.1021/acs.jproteome.7b00308 J. Proteome Res. 2017, 16, 3030−3038

Article

Journal of Proteome Research

Figure 1. Overview of the SpecOMS workflow. Rectangles represent data and rounded boxes represent processes.

into the interesting results that SpecOMS can achieve are also provided.

the set of experimental and theoretical spectra. SpecXtract relies on a well-designed data structure (SpecTrees), which allows very fast computations, even on a standard workstation. The number of shared masses is a very simple score to evaluate the similarity of two MS/MS spectra and is the key feature on which more elaborate scores are built. A high number of shared masses is largely accepted to lead to more reliable identification. Therefore, we started this study with the assumption that the number of shared masses could be precise enough to find PSMs given the mass accuracy provided by new generation of mass spectrometers. This is the score we will use throughout this article. Once SpecTrees has reported all of the pairs (se,st) that share at least a given number of fragment ion masses, we would like to explain the mass difference (or mass delta) between st and se. For this purpose, we developed a new algorithm, called SpecFit, that analyzes the pairs provided by SpecXtract. Several factors can explain a mass delta between spectra, including the incomplete digestion of the protein, which leads to a missed cleavage and results in a peptide whose length is greater than expected or truncation of the peptide at one of its extremities, which produces a semitryptic peptide (due to proteolysis by endogenous proteases present in the sample or in-source fragmentation within the mass spectrometer8). Finally, a mass delta could also be explained by one or more PTMs or variants. Altogether, SpecTrees, SpecXtract, and SpecFit constitute a software workflow referred to as SpecOMS in this manuscript. SpecOMS, which does not require mass filtering before comparing the spectra, is a type of software solution developed for an open modification search (OMS), which is also sometimes referred to as a “blind search” or “unrestrictive search”. These tools are designed to discover unpredictable PTMs and variants that affect the primary amino acid sequence. Traditionally, OMS algorithms are computationally intensive and therefore are mainly performed on spectra that remain unidentified after a first search with a small subset of variable PTMs. Moreover, the search space is always reduced, whether in terms of considered peptides or considered PTMs. For interested readers, several in-depth reviews of OMS algorithms are available.5,9 We report the development of SpecOMS and the results obtained based on a standard data set downloaded from PRIDE. We demonstrate that the innovative algorithms developed in SpecOMS overcome the main drawbacks that usually accompany OMS searches, namely, the excessive computation time and the increase in the FDR. Initial insights



METHODS The SpecOMS workflow is presented in Figure 1. After a brief overview of SpecTrees and SpecXtract, we report in detail the development of SpecFit. Detailed information on the algorithmic aspects of SpecTrees and SpecXtract is available elsewhere for the interested reader.7 SpecTrees Data Structure

The SpecTrees data structure efficiently stores all of the information required to compute the number of shared masses for any pair of spectra. The data structure is composed of one or several tree structures whose edges are oriented from child to parent. In each tree, each node represents a unique spectrum that contains the spectrum identifier sid and a counter csid. The latter is used to (partially or totally) count the number of shared masses between the spectrum sid and all other spectra located higher in the tree. Notably, the same sid can be associated with several nodes in SpecTrees. Processing Experimental Spectra. To discriminate ion fragments from noise, we selected only the k most intense masses in each spectrum, with an exclusion window related to the accuracy, namely, [−2·accuracy, +2·accuracy]. We therefore ensure that two distinct experimental masses cannot match the same theoretical mass. If multiple peaks are present in the exclusion window, then the most intense peak is selected. Next, we introduced two parameters. At first, pr, the position of the rightmost significant digit of the accuracy (the fractional separator being at position 0), is used to transform floating point numbers that represent masses into integer numbers, for efficiency purpose. The second parameter x adjusts the exact tolerance of the measurements. More or less, masses are inserted into the SpecTrees datastructure to reflect this tolerance variation. For instance, a tolerance of 0.03 Da is separated into (pr = 2, x = 3) and seven masses (the exact mass ±3) are inserted in the SpecTrees datastructure. Building SpecTrees. The construction of the SpecTrees data structure requires a preprocessing step called Bucket Clustering that creates a collection of sets referred to as buckets. Each bucket is associated with a mass m and collects the identifiers of all spectra (both experimental and theoretical) that exhibit m. Once created, each bucket is given a unique identifier (B0, B1, etc.) and is sorted by spectrum identifiers (in increasing order). Finally, the collection of buckets itself is also sorted by lexicographical order. 3031

DOI: 10.1021/acs.jproteome.7b00308 J. Proteome Res. 2017, 16, 3030−3038

Article

Journal of Proteome Research

Figure 2. Example of construction of SpecTrees and extraction of the number of shared masses between spectrum s4 and all of the other spectra. Tree structures are represented with regular black arrows, and the additional array of lists SpecIdent[] is represented with red dashed arrows. Bold green dashed arrows represent the path followed for the extraction of the number of shared masses and the green numbers in parentheses represent the different steps in the computation.

The collection of buckets is sequentially processed. Using the first bucket, SpecTrees creates the first tree, which contains one node per spectrum identifier. The counter associated with each node is set to 1, indicating that one mass is shared between these spectra (namely, the mass associated with that bucket). The processing of the next buckets repeats the same sequence. As long as the spectrum identifiers are similar to the identifiers from the previous bucket, the counter associated with each node is incremented by 1. When one spectrum identifier differs from the previous bucket, a new node is created and its counter is set to 1. This new node is attached to the current tree as a new branch. When the first spectrum identifier of a given bucket differs from that of the previous bucket, a new tree is created by a process similar to the one described for the first bucket. For efficiency purposes, an additional array of lists, called SpecIdent[], maintains the list of nodes associated with each spectrum identifier. An illustration of the structure obtained through the SpecTrees construction process described above is provided in Figure 2.

Pairs of spectra are memorized together with their number of shared masses and their mass deltas for further analysis by our SpecFit algorithm. In this article, when given a pair (se, st) of spectra, where se (respectively s t ) is an experimental (respectively theoretical) spectrum, the mass delta m(se) − m(st) will be denoted by Δm(se,st) (or Δm if clear from the context). SpecFit Search

The underlying principle of SpecFit is that a missed cleavage, a semitryptic peptide, a PTM, or a variant could explain the mass delta between an experimental and a theoretical spectrum. For instance, a theoretical spectrum representing a peptide with one modification typically has only 50% of its masses shared with the theoretical spectrum representing the unmodified peptide. The other 50% of the masses are shifted in the spectrum by the value of the modification (for a detailed explanation on how masses are moved when a modification is introduced in a spectrum, see, e.g., ref 5). Consequently, incorporating Δm at the right location in st may significantly increase the number of shared masses between se and st. For each experimental spectrum se, SpecFit considers each pair (se, st) output by SpecXtract together with its score (i.e., the number of shared masses) and its mass difference Δm, as long as the score exceeds a minimal threshold value. SpecFit outputs only one pair among these values, which corresponds to the best peptide inferred from se. SpecFit works as follows: If at least one pair (se,st) satisfying Δm = 0 exists, then SpecFit returns the PSM with Δm = 0 of highest score. Otherwise, if all pairs have a nonzero Δm, then the four cases below are considered in the following order: (1) Search among the pairs with Δm > 0 for the presence of a missed cleavage peptide. For this step, let pi be the peptide

SpecXtract Algorithm

For each spectrum si referenced in SpecIdent[i], SpecXtract computes the number of masses shared with all spectra sj such that 0 ≤ j ≤ i − 1. For each (si,sj), SpecXtract searches all of the branches of SpecTrees that contain both si and sj and computes the sum of the counters from the nodes whose sid corresponds to si. By repeating this computation for all of the experimental spectra, SpecXtract is able to extract the number of shared masses between any pair of spectra. Note that a substantial computation time is saved when SpecTrees is built in such a way that experimental spectra are located at the deepest levels in the trees (i.e., when they appear below the theoretical spectra). 3032

DOI: 10.1021/acs.jproteome.7b00308 J. Proteome Res. 2017, 16, 3030−3038

Article

Journal of Proteome Research associated with st. For each protein P such that pi belongs to P, SpecFit compares Δm to m(pi−1) and m(pi+1) (where pi−1 and pi+1 are the tryptic peptides that, respectively, precede and follow pi in P). If one of these masses equals Δm, then the sequence of pi−1 (respectively pi+1) is concatenated at the beginning (respectively, at the end) of the sequence of pi and a new score is thus computed. If the score is improved, then it is updated for this pair. When all pairs with Δm > 0 have been considered, if at least one score has been improved in the process, then SpecFit returns the PSM with the highest improved score. If not, step 2 below is executed. (2) Search among the pairs with Δm< 0 for the presence of a semitryptic peptide. For this step, amino acids are removed one by one starting from the left side of the peptide p represented by st until the new peptide has the same mass as se. If this is the case, then a new theoretical spectrum is built from the reduced peptide sequence and compared with se; the score is updated if it is improved. This process is then repeated starting from the right side of the peptide p. When all pairs with Δm < 0 have been considered, if at least one score has been improved in the process, then SpecFit returns the PSM with the highest improved score. If not, step 3 is executed. (3) For each pair (se,st), a shift alignment is performed in an attempt to locate the mass delta of the experimental spectrum. This step consists of progressively modifying the location of Δm into st, each time leading to a slightly different theoretical spectrum that is compared anew with se. The score is updated if it is improved. When all pairs have been considered, if at least one score has been improved in the process, then SpecFit returns the PSM with highest improved score. If not, step 4 is executed. (4) If SpecFit has not returned a result for se in this stage (i.e., no improvement in the score has been obtained by steps 1−3 above), then SpecFit returns the PSM with the highest score among the original PSMs for se.

tests reported in this study were executed on a workstation equipped with an Intel i7 (2.9 GHz) and 12 GB of dedicated to the Java Virtual Machine, running under Windows 7. Data Sets

Collection of Experimental Spectra. We downloaded multiple sets of experimental spectra from the PRIDE database deposit archived under the PXD001468 identifier (from b1906_293T_proteinID_01A_QE3_122212.raw to b1925_293T_proteinID_05A_QE3_122212.raw). These spectra were obtained from the HEK293 human cell line, analyzed by the LC−MS/MS technique on a Q-Exactive Orbitrap spectrometer (Thermo Fisher Scientific). Detailed information concerning sample collection and preparation was previously published.11 We converted the file from .raw to .mgf using RawConverter version 1.1.0.19 (64 bits).13 All data sets were limited to spectra containing more than 20 mass values and 2+ and 3+ charges. Only the k = 60 most intense masses were selected in spectra, except when we tested the influence of parameter k. The data set corresponding to the b1906_293T_proteinID_01A_QE3_122212.raw file is further referred to in this manuscript as the HEK293 data set; it contains 37 685 spectra and was already analyzed with an open modification search strategy.11 Other spectra data sets were only used in runtime measurement experiments to increase the number of experimental spectra to process. Collection of Theoretical Spectra. We downloaded the human protein database GRCh37 from the Ensemble genome assembly and added most of the common contaminants obtained from the common Repository of Adventitious Proteins (cRAP). A collection of peptides was created from this database by replicating the action of the trypsin enzyme (with systematic cleavage after arginine and lysine). This set of peptides was filtered to remove peptides that were too short (fewer than 7 amino acids) or too long (more than 25 amino acids) because these peptides usually hinder the identification process. Redundant peptide sequences were deleted, and the remaining peptides generated 510 685 unique theoretical spectra. Each theoretical spectrum contained the (computed) set of masses corresponding to the most frequent fragments generated by an Orbitrap spectrometer, that is, single charged monoisotopic b-ion and y-ion fragments. A fixed modification of 57.021464 Da that corresponds to a carbamidomethylation was added to the monoisotopic mass of the cysteine residue. Software Settings. Several performance tests were conducted using two OMS methods, MODa14 (release v1.51) and PIPI15 (release 1.2.9), and a more traditional search engine X!Tandem16 (version Sledgehammer). To be fair with MODa and SpecOMS that have no multicore implementation, PIPI was executed on a single-thread as well. Each software was configured with the following parameters (default values are used when unspecified below): (1) MODa: PPMTolerance = 10, Fragment ion tolerance = 0.02 Da, AutoPMCorrection = 0, BlindMode = 1, enzyme constraint min number termini = 1, missed cleavage = 1, Highresolution = ON. During the performance test, the minimum/ maximum tolerance = −0.02 Da/1000 Da. This tolerance was enlarged to −1000 Da/+2000 Da for the performance evaluation of modified peptides identifications. Peptide identifications at 1% FDR were obtained using anal_moda.jar. (2) PIPI: ms1 tolerance = 10 ppm, ms2 tolerance = 0.02 Da, mz bin offset = 0, PTM db = unimod.txt, min ptm mass = −0.02 Da, max ptm mass = 1000 Da, missed cleavage = 1.

Evaluation of Incorrect Identifications

A significant proportion of the PSMs produced by SpecOMS might be incorrect depending on their score and the mass accuracy of the experimental measurements. Similarly to conventional searches, we estimate the reliability of the identifications using a target-decoy approach.10 The principle of this approach is to evaluate the proportion of incorrect identifications by searching the experimental mass spectra against a decoy database. This decoy database is constructed by reversing all amino acids of each peptide sequence from the target database, except for the last K or R. A previous study stated that no fundamental principle of the target-decoy method was violated by OMS.11 Indeed, none of the documented pitfalls that invalidate the target-decoy approach are present in our study.12 For instance, we do not incorporate protein inference knowledge and PSMs are searched with the same mass accuracy as the measurements. Moreover, we do not extrapolate the FDR of a subset of PSMs from the FDR of the entire set of PSMs. SpecOMS Implementation

The SpecOMS components are implemented using the programming language Java 1.8. The input data files are composed of the experimental spectra and the protein and contaminant databases. Two configuration files allow set up of (a) the input and output files and (b) the SpecOMS parameters. The data generated after executing the software are stored into a formatted text file for convenience. All of the 3033

DOI: 10.1021/acs.jproteome.7b00308 J. Proteome Res. 2017, 16, 3030−3038

Article

Journal of Proteome Research

Table 1. Execution Time and Memory Consumption of SpecOMS Depending on the Size of the Experimental and Theoretical Spectra Datasets (accuracy = 0.02 Da) experiment

1

2

3

4

5

6

number of theoretical spectra number of experimental spectra SpecTrees execution time (seconds) SpecXtract + SpecFit exec. time (minutes) memory consumption (GB)

510 685 37 691 12.88 5.07 2.1

1 021 371 37 691 20.40 14.13 3.9

510 685 80 816 18.89 14.59 2.4

510 685 119 294 29.49 20.02 2.7

510 685 163 113 32.08 31.03 3.1

510 685 206 750 37.84 43.65 3.5

mass is shared between two spectra. As new mass spectrometers with very high resolution at both the MS and MS/MS levels become more common in proteomics laboratories, algorithms must benefit from these precise mass measurements. This accuracy is particularly beneficial for the identification of PTMs and variants.17 As shown in Figure 3, the fragment ion mass accuracy has a limited incidence on the SpecOMS computation time.

(3) X!tandem: fragment monoisotopic mass error = 0.02 Da, parent monoisotopic mass error= +−5 ppm, quick acetyl = no, quick pyrolidone = no, parent monoisotopic mass isotope error = no.



RESULTS Given a set Se of experimental spectra and a set St of theoretical spectra, SpecXtract produces all pairs of spectra (se,st) (with se ⊆Se and st ⊆St) that share a minimum number of fragment masses, along with their score (i.e., number of shared masses). When the mass delta differs from zero, SpecFit searches for missed cleavages or semitryptic peptides or enhances the PSM score through the introduction of a PTM or a variant in the peptide sequence. SpecOMS Is a Remarkably Fast Workflow with Low Memory Requirements

Execution time and memory consumption of SpecOMS were measured using data sets of varying sizes. The runtimes measured for the different experiments are split according to steps of the SpecOMS workflow (Table 1). We first used the HEK293 data set, composed of 37 685 experimental spectra, tested against the target database of 510 685 theoretical spectra (experiment 1). SpecOMS executes the full search in roughly 5 min requiring 2.1 GB of memory (i.e., clearly below the capacity of low-level price workstations). This computation time is perfectly compatible with the throughput of actual proteomics experiments. Comparatively, MODa processed the data set in 6.5 h using 10 GB of memory, while PIPI, executed in a single-threaded configuration, required 30 h and fully used the 25 GB available memory. Second, the HEK293 data set was analyzed with a search space enlarged to the union of the target and the decoy databases (experiment 2), representing twice as many theoretical spectra as in experiment 1. The runtime of SpecOMS remained under 15 min and the memory footprint under 4 GB. Finally, we constructed larger experimental spectra data sets (by merging spectra from several experimental data sets as explained in the Methods section) and performed the search against the target database (experiments 3 to 6). The runtime of SpecOMS increases according to the size of the experimental spectra set but always remained below 45 min, while the number of spectra exceeded 200 000 in experiment 6. The runtime of SpecOMS does not grow linearly with the number of experimental spectra (e.g., a six-fold increase in the number of experimental spectra between experiments 1 and 6 leads to a nine-fold increase in the computation time). This is, however, not a significant drawback as the SpecTrees construction time is negligible compared with the extraction step performed by SpecXtract, and one should therefore split numerous experimental spectra data sets into separate smaller data sets. The maximum allowed mass deviation in the experimental measurements must be taken into account to decide whether a

Figure 3. Execution time for the full SpecOMS workflow depending on the fragment ion mass accuracy (HEK293 data set).

SpecOMS Requires a High Fragment Ion Mass Accuracy

We designed a series of experiments using the HEK293 data set to evaluate the influence of both the PSM score threshold and the fragment ion mass accuracy on the number of PSMs obtained from the target and decoy databases before the execution of SpecFit. The results are displayed in the form of two series of curves in Figure 4. Unsurprisingly the number of PSMs can rise sharply regardless of the fragment ion mass accuracy if the threshold score is too low due to a high number of random PSMs. When the score is >10 (as shown in Figure 4a), the number of PSMs increases until the fragment ion mass accuracy parameter reaches the mass accuracy of the mass spectrometer (0.02 Da). Thereafter, the curves remain almost stable. In contrast, Figure 4b shows that with the exception of a score higher than 16, the slope of the curve steepens when the fragment ion mass accuracy parameter exceeds the mass accuracy of the mass spectrometer, leading to many additional random PSMs. We can conclude that with the exception of PSMs with scores >16 (which represent a small subset of PSMs), the discrimination between correct and random PSMs requires a high fragment ion mass accuracy. Indeed, in contrast with more conventional approaches, SpecOMS does not filter theoretical spectra based on a precursor mass criterion before identification. Although the precursor high mass accuracy given by the most recent mass spectrometers is not used to reduce 3034

DOI: 10.1021/acs.jproteome.7b00308 J. Proteome Res. 2017, 16, 3030−3038

Article

Journal of Proteome Research

SpecOMS Identification Performance According to the Experimental Spectra Preprocessing

Informative masses are often the most abundant masses in MS/ MS spectra, and thus their relative intensity can be used to separate them from noise. Because SpecOMS selects only the k most intense masses in each experimental spectrum, we measured the effect of k on the results (See Table 2 for k = 60 and Table S-1 for k = 30, 40, 50, 70, 80, 90 and 100). First, when Δm = 0, a score of 9 is sufficient to distinguish between correct and random PSMs with an FDR varying from 0.2 (k = 30) to 2.1% (k = 100). Unsurprisingly, missed cleavage peptides are also identified with high confidence at a score of 9 regardless of k. For these two PSM subsets, the identification relies on an additional piece of information: the equality between the mass of the experimental spectrum and the mass of the peptide. Importantly, unlike traditional approaches, this information is not used a priori to generate the PSMs but is an a posteriori complementary element. Semitryptic peptides are often considered as the cause or a major contributing factor to the majority of negative mass deltas suggested by OMS algorithms.11 In our study, we must admit that we have identified a small number of semitryptic peptides. Surprisingly, while an additional parent ion masse constraint is added, the FDR associated with semitryptic PSMs increases rapidly from 2.5% (k = 30) to 14% (k = 100). This degradation of specificity is more dramatic when Δm < 0; even when we consider only the PSMs having at least 14 shared masses, the FDR varies from 8% (k = 30) to 75% (k = 100). These high FDRs are disconcerting and, to the best of our knowledge, have never been discussed before. The introduction of a negative mass delta into theoretical spectra constricts more fragment ion masses into a smaller range of masses, which is a potential source of noise. Thus peptide identifications with negative mass delta seem to concentrate a large part of the interpretation errors of a full OMS analysis. This point would require an in-depth study in the future. Next, when the precursor mass information cannot be used to infer the PSMs (i.e., when Δm > 0 but PSMs do not correspond to missed cleavage peptides), a higher score is needed to obtain a reasonable FDR. A score of 12 generates an FDR varying from 0.8% (k = 30) to 7.4% (k = 100). When k = 60, an FDR of 2.6% in the HEK293 data set leads to the identification of an additional substantial set of 2420 spectra. Those additional PSMs can be divided into the following two categories: (i) several hundreds PSMs for which SpecFit improves the score and suggests a position for the shift into the

Figure 4. Number of PSMs in the target database (a) and in a decoy database (b) for several scores depending on the mass accuracy (HEK293 data set).

the search space, the high fragment ion mass accuracy is useful when discriminating between correct and random PSMs in the huge search space produced by all peptides derived from the protein database. For example, for the HEK293 data set, an accuracy of 0.02 Da, which is in accordance with the tolerance given by the mass spectrometer, is a good trade-off between sensitivity and specificity.

Table 2. Number of PSMs Obtained by SpecFit for the HEK293 Dataset with a Fragment Mass Accuracy of 0.02 Da and at Most 60 Selected Peaks in the Experimental Spectruma score Δm = 0 Δm ≠ 0

semitryptic missed cleavage PTM/variant (Δm > 0) PTM/variant (Δm < 0)

a

8

9

10

11

12

13

14

15

16+

8 ≥ (total)

1232 148 0 0 0 0 2100 4096 1165 1779

1233 36 23 8 19 1 1288 1174 920 1202

1193 15 39 4 70 1 907 411 612 826

1161 0 45 1 88 0 798 114 463 496

1093 0 37 0 103 0 628 41 251 255

914 0 30 0 111 0 540 13 174 145

781 0 24 0 139 0 424 6 108 43

548 0 20 0 98 0 315 1 74 26

1211 0 28 0 280 0 513 2 125 12

9366 199 246 13 908 2 7513 5858 3892 4784

PSMs are grouped by categories depending on Δm. PSMs in the target database are in bold, and PSMs in the decoy database are in italics. 3035

DOI: 10.1021/acs.jproteome.7b00308 J. Proteome Res. 2017, 16, 3030−3038

Article

Journal of Proteome Research

Figure 5. SpecOMS identification (scan 12186) of a peptide exhibiting an unanticipated missed cleavage site on the first amino acid (K). Masses marked with a star on the peptide ladder were gained using SpecFit.

Figure 6. SpecOMS identification (scan 34487) of a glycosylated peptide (N-acetylglucosamine). Such an identification with a large neutral loss (203.0805 Da) remains unaccessible to traditional methods. The presence of MS/MS marker ions (204.0872, 186.0766, and 168.066 Da) reinforces this identification.

will result in the loss of most of the shared masses. For its part, SpecOMS was able to identify PSMs with high scores (50 with a score above 12), while these PSMs were not identified by MODa. In the full list of identifications provided by MODa, we were also surprised by the total absence of PSMs with a negative mass delta below −100 Da and the very small number of mass deltas over 400 Da, although we configured the peptide mass tolerance between −1000 and +2000 Da.

theoretical spectra and (ii) 1000 PSMs for which SpecFit is unable to increase the score. In summary, SpecOMS identifies 11 708 spectra with Δm > 0 (∼31% of the 2+ and 3+ experimental spectra) with a global FDR of 1.1% when k = 60. This value therefore appears as a good trade-off between sensitivity and specificity. Impact of a Modification on Peptide Detection

Simulation experiments were conducted to evaluate the performance of SpecOMS in the identification of peptide modifications. We analyzed the HEK293 data set, voluntarily omitting the fixed carbamidomethylation on cysteines with both SpecOMS and MODa in similar configurations. An execution of X!tandem was also conducted with modifications limited to the fixed carbamidomethylation modification on cysteines to provide a reference point. At 1% FDR, X!Tandem identified 1691 peptides of length ranging from 7 to 25 amino acids modified only by carbamidomethylation(s) on cysteines. SpecOMS provides within 5 min the same peptide identifications for 879 peptides (52%) with a mass delta of 57.02 Da (1 modified cysteine), 114.04 Da (2 modified cysteines), or 171.06 Da (3 modified cysteines), 580 spectra were not identified, while the remainder was identified with an alternate identification. The execution of MODa took 6.5 h to identically identify 1170 of the spectra (69%), 499 spectra were not identified, and only 25 additional spectra were suggested with an alternate identification. We examined in detail the difference between MODa and SpecOMS. A large part of the small peptides (fewer than 10 amino acids) was hardly recognized by SpecOMS because only PSMs with more than eight shared masses were considered in this experiment. It also should be noted that several modifications on the same peptide may drastically decrease the number of shared masses, depending on how they are distributed: two closed modifications behave nearly as a single one, but two modifications at both extremities of the peptide

SpecFit Interpretation of the Positive Mass Delta

Although the high mass accuracy of the precursor has no influence on the PSMs found by SpecOMS, this high accuracy provides crucial information when a mass delta must be explained. Because the location of a site modification is a difficult task requiring unambiguously assignable ions, the position given by SpecFit must be considered approximate. Several postsearch engine tools are, however, available to measure the reliability of the modification site locations (AScore,18 LuciPHOr2,19 SLoMo20). A more exhaustive comparison of software able to locate PTMs is available for more details.21 As expected, the most frequent mass deltas correspond to known predominant PTMs, including oxidation, dioxidation, carbamylation, deamidation, formylation, and natural isotopic peptides (Figure S-1, Supporting Information). Interestingly, SpecOMS also exhibits more original modifications, and the rate of increase in the score given by SpecFit provides valuable information because it indicates the presence of an ion series. We have illustrated such identifications associated with a large rate of increase with the highlight of an unanticipated cleavage site (Figure 5), a variant (Figure S-2, Supporting Information), and an identification of a rarer PTMs (Figure S-3, Supporting Information). We can also remark that SpecOMS suggests more than 1000 identifications with a mass delta >500 Da (see the results file, Supporting Information), a limit that current OMSs do not 3036

DOI: 10.1021/acs.jproteome.7b00308 J. Proteome Res. 2017, 16, 3030−3038

Article

Journal of Proteome Research

PSMs and SpecOMS implements a simple score function. Nevertheless, we demonstrated in our experiments that the number of identifications that could be obtained was sufficient based on the high mass accuracy provided by the new generation of spectrometers. Moreover, using its score, SpecOMS introduces an easy stratification of the search space.26 Our algorithm can be adapted so as to allow score thresholds independently for modified and unmodified peptides requiring more evidence fragmentation for spectra displaying modifications. With this approach, all possible PTMs are in competition during the interpretation of the experimental spectra, and the high bias toward some PTMs due to the testing of too many peptides can be avoided.12 For instance, a recent and systematic study of protein methylation showed that the FDR associated with a subset of methylated peptides can be degraded up to 80%, even though the global FDR was ∼1%.27 In SpecOMS, the separate threshold for modified peptides contributes to a low FDR, even for this subset of peptides. Consequently, SpecOMS is a well-suited algorithm for mining unusual PTMs in spectral databases to extract rare PTM candidates with sufficient information from their fragmentation patterns, which is useful for preparing follow-up biological experiments. SpecOMS can identify a large number of spectra with strong competition between interpretations. However, many routes to improve the number of PSMs above a threshold score for a given FDR still exist. In addition to choosing the best PSMs, the score function of a search engine must rank the PSMs from the most to the least reliable, thereby defining the set of PSMs used to infer the proteins. Most of the usual search engines find the same PSMs despite different score functions when the search space is strictly equal. In contrast, the score function implemented in each search engine can have a great impact on the PSM ranking.28 SpecOMS could potentially implement a more sophisticated score function to rank the PSMs and increase the number of accepted identifications at a given FDR. Nevertheless, the actual simplicity of the score function facilitates the interpretation of the results. The introduction of complexity into the score function may be delayed because the time and efforts required for result interpretation can also slow down the use of OMS approaches. Moreover, in addition to the mass shifts, SpecFit could make use of PTM knowledge to find the correct identifications and systematically search for the presence of signature ions or neutral loss behaviors to secure confidence in the interpretations. In conclusion, we have presented SpecOMS as an open modification search approach that is able to interpret a complete proteomics experiment in a short amount of time and that bypasses the problems associated with the FDR inherent to a huge peptide space. This technique has already produced very interesting results, and a considerable number of paths remain open to further improve this algorithm in the future, as presented above. The current java implementation of SpecOMS is available at https://github.com/matthieu-david/ SpecOMS.

overcome because a large mass range typically induces excessive computational time. More importantly, SpecFit has the opportunity to increase the PSM scores during the execution of SpecXtract. This strategy can recover many identifications because modified peptides can bypass other candidates when their scores are re-evaluated during the execution of SpecFit. However, SpecFit was not always able to improve the score even when the PSMs were associated with a high score. We have identified at least two explanations for this stability. First, although a high mass accuracy should facilitate a correct charge state evaluation, some spectra obviously retain a charge-state error (Figure S-4, Supporting Information). Despite a very large mass delta, this error can be detected because SpecOMS has no upper limit. Second, a labile PTM may exist, although the fragmentation pattern does not contain this modification in the b and y ions. Figure 6 illustrates this case with a glycosylated peptide (i.e., an N-acetylglucosamine (GlcNAc), whose PTM is confirmed by the presence of signature MS/MS ions).



DISCUSSION AND CONCLUSIONS Screening large discovery experiments routinely with a full OMS strategy (i.e., without any mass filter of any kind) was often considered until recently as impractical in terms of both computational time and sensitivity.5,9 SpecOMS took up this challenge because the increase in accuracy given by the new generation of mass spectrometers should facilitate the interpretation of MS/MS spectra.11 Indeed, the results from our experiments demonstrate that SpecOMS is a particularly rapid open modification search algorithm. Moreover, despite specific attention directed toward its spatial and temporal performance, there is still room for improvement, in particular, with parallel programming implementation. In the future, we may be able to analyze very large experiments within a few minutes on a standard workstation. We can also notice the low memory usage induced by particularly condensed information in SpecTrees so that SpecOMS can search very large protein databases such as those required in proteogenomics studies on a workstation. Because SpecOMS can process a very large number of spectra comparisons in a short time, each experimental spectrum can be compared to each theoretical spectrum to find the best PSMs. Unlike variable searches that allow modifications to occur at any selected amino acid on all protein sequences with the consequence of multiplying the size of the peptide space, (i.e., the set of peptides actually compared to each experimental spectrum), SpecOMS defines a large but stable peptide space deduced directly from the protein database. Inside this delimited peptide space, our key idea was to favor PSMs with at least a minimum number of shared masses. This strategy is not new and has been implemented in ModifiComb22 and in tools designed to identify modified peptides by comparison with spectral libraries.23−25 However, these algorithms are limited to comparisons between experimental spectra due to efficiency concerns. As a consequence, a PTM can only be found if unmodified peptides have been correctly identified. Additionally, spectral libraries are only available for a small collection of organisms. SpecOMS resolves these limitations and extends the comparisons to a very large number of theoretical spectra (more than 500 000 in the experiment presented in this study). SpecOMS requires high-quality PSMs to obtain an acceptable FDR because the number of comparisons fosters random



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.7b00308. 3037

DOI: 10.1021/acs.jproteome.7b00308 J. Proteome Res. 2017, 16, 3030−3038

Article

Journal of Proteome Research



(13) He, L.; Diedrich, J.; Chu, Y. Y.; Yates, J. R. Extracting Accurate Precursor Information for Tandem Mass Spectra by RawConverter. Anal. Chem. 2015, 87, 11361−11367. (14) Na, S.; Bandeira, N.; Paek, E. Fast multi-blind modification search through tandem mass spectrometry. Mol. Cell. Proteomics 2012, 11, M111.010199. (15) Yu, F.; Li, N.; Yu, W. PIPI: PTM-Invariant Peptide Identification Using Coding Method. J. Proteome Res. 2016, 15, 4423−4435. (16) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466−1467. (17) Kim, M. S.; Zhong, J.; Pandey, A. Common errors in mass spectrometry-based analysis of post-translational modifications. Proteomics 2016, 16, 700−714. (18) Beausoleil, S. A.; Villen, J.; Gerber, S. A.; Rush, J.; Gygi, S. P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 2006, 24, 1285− 1292. (19) Fermin, D.; Avtonomov, D.; Choi, H.; Nesvizhskii, A. I. LuciPHOr2: site localization of generic post-translational modifications from tandem mass spectrometry data. Bioinformatics 2015, 31, 1141−1143. (20) Bailey, C. M.; Sweet, S. M.; Cunningham, D. L.; Zeller, M.; Heath, J. K.; Cooper, H. J. SLoMo: automated site localization of modifications from ETD/ECD mass spectra. J. Proteome Res. 2009, 8, 1965−1971. (21) Chalkley, R. J.; Clauser, K. R. Modification site localization scoring: strategies and performance. Mol. Cell. Proteomics 2012, 11, 3− 14. (22) Savitski, M. M.; Nielsen, M. L.; Zubarev, R. A. ModifiComb, a new proteomic tool for mapping substoichiometric post-translational modifications, finding novel types of modifications, and fingerprinting complex protein mixtures. Mol. Cell. Proteomics 2006, 5, 935−948. (23) Ahrne, E.; Nikitin, F.; Lisacek, F.; Muller, M. QuickMod: A tool for open modification spectrum library searches. J. Proteome Res. 2011, 10, 2913−2921. (24) Ye, D.; Fu, Y.; Sun, R. X.; Wang, H. P.; Yuan, Z. F.; Chi, H.; He, S. M. Open MS/MS spectral library search to identify unanticipated post-translational modifications and increase spectral identification rate. Bioinformatics 2010, 26, 399−406. (25) Falkner, J. A.; Falkner, J. W.; Yocum, A. K.; Andrews, P. C. A spectral clustering approach to MS/MS identification of posttranslational modifications. J. Proteome Res. 2008, 7, 4614−4622. (26) Alves, G.; Yu, Y. K. Improving peptide identification sensitivity in shotgun proteomics by stratification of search space. J. Proteome Res. 2013, 12, 2571−2581. (27) Hart-Smith, G.; Yagoub, D.; Tay, A. P.; Pickford, R.; Wilkins, M. R. Large Scale Mass Spectrometry-based Identifications of Enzymemediated Protein Methylation Are Subject to High False Discovery Rates. Mol. Cell. Proteomics 2016, 15, 989−1006. (28) Tessier, D.; Lollier, V.; Larre, C.; Rogniaux, H. Origin of Disagreements in Tandem Mass Spectra Interpretation by Search Engines. J. Proteome Res. 2016, 15, 3481−3488.

Table S-1: Impact of the number of selected masses on SpecOMS interpretations. Figure S-1: SpecOMS provides insights on the PTMs repartition in the sample. Figure S-2: SpecOMS identifies peptide variants. Figure S-3: SpecOMS identifies unusual PTMs. Figure S-4: SpecOMS identifies charge error assignments. (PDF) Results produced from the analysis of the HEK293 data set by SpecOMS. (XLSX)

AUTHOR INFORMATION

Corresponding Authors

*E-mail: [email protected]. Tel: +33 (0)2 51 12 58 24. *E-mail: [email protected]. Tel: +33 (0)2 40 67 51 76. ORCID

Matthieu David: 0000-0001-9405-8892 Dominique Tessier: 0000-0003-1503-7693 Notes

The authors declare no competing financial interest. The current java implementation of SpecOMS is available at https://github.com/matthieu-david/SpecOMS.



ACKNOWLEDGMENTS This project is partly funded by the Région Pays de la Loire (France) GRIOTE program (2013-2018).



REFERENCES

(1) Martens, L.; Hermjakob, H.; Jones, P.; Adamski, M.; Taylor, C.; States, D.; Gevaert, K.; Vandekerckhove, J.; Apweiler, R. PRIDE: the proteomics identifications database. Proteomics 2005, 5, 3537−3545. (2) Griss, J.; Perez-Riverol, Y.; Lewis, S.; Tabb, D. L.; Dianes, J. A.; Del-Toro, N.; Rurik, M.; Walzer, M. W.; Kohlbacher, O.; Hermjakob, H.; Wang, R.; Vizcaino, J. A. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nat. Methods 2016, 13, 651−656. (3) Bogdanow, B.; Zauber, H.; Selbach, M. Systematic Errors in Peptide and Protein Identification and Quantification by Modified Peptides. Mol. Cell. Proteomics 2016, 15, 2791−2801. (4) Choi, H.; Nesvizhskii, A. I. False discovery rates and related statistical concepts in mass spectrometry-based proteomics. J. Proteome Res. 2008, 7, 47−50. (5) Ahrne, E.; Muller, M.; Lisacek, F. Unrestricted identification of modified proteins using MS/MS. Proteomics 2010, 10, 671−686. (6) Creasy, D. M.; Cottrell, J. S. Unimod: Protein modifications for mass spectrometry. Proteomics 2004, 4, 1534−1536. (7) David, M.; Fertin, G.; Tessier, D. SpecTrees: An Efficient Without a Priori Data Structure for MS/MS Spectra Identification. Algorithms in Bioinformatics 2016, 9838, 65−76. (8) Kim, J. S.; Monroe, M. E.; Camp, D. G.; Smith, R. D.; Qian, W. J. In-source fragmentation and the sources of partially tryptic peptides in shotgun proteomics. J. Proteome Res. 2013, 12, 910−916. (9) Na, S.; Paek, E. Software eyes for protein post-translational modifications. Mass Spectrom. Rev. 2015, 34, 133−147. (10) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4, 207−214. (11) Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 2015, 33, 743−749. (12) Chalkley, R. J. When target-decoy false discovery rate estimations are inaccurate and how to spot instances. J. Proteome Res. 2013, 12, 1062−1064. 3038

DOI: 10.1021/acs.jproteome.7b00308 J. Proteome Res. 2017, 16, 3030−3038