Article pubs.acs.org/jpr
PIPI: PTM-Invariant Peptide Identification Using Coding Method Fengchao Yu,† Ning Li,*,†,‡ and Weichuan Yu*,†,§ †
Division of Biomedical Engineering, The Hong Kong University of Science and Technology, Hong Kong, China Division of Life Science, The Hong Kong University of Science and Technology, Hong Kong, China § Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China ‡
S Supporting Information *
ABSTRACT: In computational proteomics, the identification of peptides with an unlimited number of post-translational modification (PTM) types is a challenging task. The computational cost associated with database search increases exponentially with respect to the number of modified amino acids and linearly with respect to the number of potential PTM types at each amino acid. The problem becomes intractable very quickly if we want to enumerate all possible PTM patterns. To address this issue, one group of methods named restricted tools (including Mascot, Comet, and MS-GF+) only allow a small number of PTM types in database search process. Alternatively, the other group of methods named unrestricted tools (including MS-Alignment, ProteinProspector, and MODa) avoids enumerating PTM patterns with an alignment-based approach to localizing and characterizing modified amino acids. However, because of the large search space and PTM localization issue, the sensitivity of these unrestricted tools is low. This paper proposes a novel method named PIPI to achieve PTM-invariant peptide identification. PIPI belongs to the category of unrestricted tools. It first codes peptide sequences into Boolean vectors and codes experimental spectra into real-valued vectors. For each coded spectrum, it then searches the coded sequence database to find the top scored peptide sequences as candidates. After that, PIPI uses dynamic programming to localize and characterize modified amino acids in each candidate. We used simulation experiments and real data experiments to evaluate the performance in comparison with restricted tools (i.e., Mascot, Comet, and MS-GF+) and unrestricted tools (i.e., Mascot with error tolerant search, MS-Alignment, ProteinProspector, and MODa). Comparison with restricted tools shows that PIPI has a close sensitivity and running speed. Comparison with unrestricted tools shows that PIPI has the highest sensitivity except for Mascot with error tolerant search and ProteinProspector. These two tools simplify the task by only considering up to one modified amino acid in each peptide, which results in a higher sensitivity but has difficulty in dealing with multiple modified amino acids. The simulation experiments also show that PIPI has the lowest false discovery proportion, the highest PTM characterization accuracy, and the shortest running time among the unrestricted tools. KEYWORDS: peptide identification, unrestricted PTM identification, database search
1. INTRODUCTION
where k is the number of modified amino acids and n is the average number of potential PTM types at each modified amino acid. The number of theoretical spectra becomes very large, even when we only consider a few PTM types. Some tools9−15 use tags to accelerate searching speed. A tag is a sequence fragment inferred from a spectrum. InsPecT11 is a typical example. Given an experimental spectrum, it first infers tags. Then, it picks out peptide sequences containing these tags. Finally, it uses these peptide sequences to generate a smaller database for fine search. During database search, InsPecT enumerates all PTM patterns, which still results in a large number of theoretical spectra if it considers many PTM types. Thus, restricted tools only allow a small number of modified amino acids and PTM types during a database search.
Shotgun proteomics has achieved great success after more than 20 years of development. On the basis of the database search idea, researchers have proposed many tools to identify peptides. According to the approaches to dealing with post-translational modification (PTM), we can classify these tools into two categories: restricted tools1−15 and unrestricted tools.5,16−35 Restricted tools generate theoretical spectra by in silico fragmenting peptide sequences. They infer an experimental spectrum’s corresponding peptide sequence by finding the most similar theoretical spectrum. These tools need to generate different theoretical spectra corresponding to different PTM patterns. Given a peptide sequence, the number of theoretical spectra follows (n + 1)k
Received: May 27, 2016 Published: October 17, 2016
(1) © 2016 American Chemical Society
4423
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research
Figure 1. Workflow of PIPI.
distances are invariant to PTM. Then, it converts peaks into a vector and searches for up to 20 candidates. PIPI does not need to infer the exact locations of modified amino acids during the database search. Thus, it bypasses the PTM localization problem in database search and leads to a better performance in both peptide identification and PTM characterization. The rest of the paper is organized as follows: Section 2 describes coding, database search, PTM localization and characterization, final score calculation, and q-value estimation. Section 3 presents three sets of experiments to demonstrate the performance of PIPI. Section 4 discusses the relationship between PIPI and other existing tools.
Unrestricted tools identify spectra with unlimited PTM types. Spider18 and OpenSea20 obtain parts of a spectrum’s sequence by de novo sequencing.36−38 Then, they infer the modified amino acids by comparing the sequence parts with candidate peptide sequences from a database. Mascot with error tolerant search39 uses two steps to identify as many spectra as possible. It first identifies all spectra in a restricted manner. Then, it generates a small database based on the identification result. With the small database, it tries different approaches to searching candidates matching for each spectrum. These approaches include allowing one more missed cleavages, trying all modifications one by one, substituting each amino acid with the others one by one. It is worth mentioning that Mascot tries these possibilities sequentially, which means that each spectrum can only have one of the above situations. MS-Alignment17 compares an experimental spectrum with PTM-free theoretical spectra. It uses dynamic programming to find the overlapping peaks and treats gaps between the overlapping parts as modified amino acids. MS-Alignment only supports up to two modified amino acids in each peptide. The Paragon Algorithm40 in ProteinPilot Software uses a twopass method to cover unspecified modifications. The first pass identifies spectra in a restricted manner, and the second pass identifies spectra in a unrestricted manner with the help of tags. Finally, it incorporates information from both passes to calculate a final score for each spectrum. PeaksPTM41 identifies as many peptides as possible. It first identifies peptides without any PTM using a restricted approach. Then, it identifies peptides containing one PTM using an exhaustive search approach. After getting these two kinds of peptides, it rescores them for achieving a high sensitivity. Finally, it uses the PTM identified in the previous steps to search for peptides with multiple PTM types. MODa31 infers various lengths of sequence fragments from a spectrum and aligns tags against peptide sequences. It also uses dynamic programming to find the best alignment result. After the alignment, it calculates a score for each sequence and selects the one with the highest score. All of these tools’ scoring functions rely on the PTM pattern, which means that the accuracy of PTM localization strongly influences the performance of the identification. However, PTM localization is not an easy task. Although various methods have been proposed,42−44 it is still difficult to determine the exact locations.45−47 We propose a PTM-invariant peptide identification method named PIPI. PIPI belongs to the category of unrestricted tools. It first builds a theoretical database of peptide sequences by converting each sequence into a coded Boolean vector. Each element in the vector indicates whether the corresponding three-amino-acids tag exists in the original sequence. When analyzing a spectrum, PIPI only extracts peaks whose relative
2. METHODOLOGY Figure 1 shows the workflow of PIPI. There are four major steps: 1. Peptide sequence coding and spectrum coding. 2. Database search. 3. PTM localization and characterization. 4. Final score calculation and q-value estimation. We will describe each step in detail. 2.1. Peptide Sequence Coding and Spectrum Coding
2.1.1. Peptide Sequence Coding. PIPI in silico digests proteins to peptides and codes peptide sequences into Boolean vectors called coded sequences. Given a peptide sequence, PIPI extracts length-3 tags from the N-terminal to the C-terminal with an overlapped sliding window (Figure 2). Because there is no intensity information in a peptide sequence, values in the coded sequence are either 0 or 1. Elements corresponding to extracted tags have 1 values and the others have 0 values. To estimate q-value later on, PIPI appends the original database with a decoy database generated by shuffling. For each digested peptide in the concatenated database, PIPI codes it into a coded sequence. After coding, PIPI indexes all vectors based on their corresponding peptides’ masses. 2.1.2. Spectrum Coding. Given a spectrum, PIPI first removes noisy peaks and normalizes peak intensities. It uses the peak intensity with the highest frequency as a threshold48 and eliminates all peaks whose intensities are smaller than the threshold. Then, PIPI replaces each peak’s intensity with its square root and normalizes the peak intensities, as in SEQUEST,2 by dividing the whole m/z range into 10 regions. In each region, it normalizes peak intensities so that the highest one equals 1. Some ions may not be detected in a spectrum due to the limited fragmentation efficiency and instrument’s detection sensitivity. PIPI checks each peak to see if its complementary peak exists in a spectrum. If not, it adds the complementary peak with the same intensity to the corresponding location. 4424
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research
amino acids, including two additional ones, “U” and “O”. With the setting above, we obtain the length of the vector Lv =
213 − 21 − 21 × 20 + 21 + 21 × 20 = 4851 2 (4)
The order of the tags does not matter as long as it is consistent in the whole workflow. Figure 3 illustrates how PIPI codes a spectrum.
Figure 2. Illustration of peptide sequence coding and database building. Top: PIPI extracts tags from a peptide sequence with an overlapped sliding window. It codes different tags into a Boolean vector. Each peptide sequence in the database becomes a coded vector. Bottom: PIPI codes all peptide sequences into Boolean vectors one by one.
Figure 3. Illustration of spectrum coding. PIPI extracts length-3 tags from a spectrum and codes these tags into a vector. Its indexes indicate different tags, and its values are the intensity summations of the peaks forming the corresponding tags.
Two peaks are complementary to each other if the sum of their m/z values equals Sm + 2 × pm, where Sm is the precursor mass of the spectrum and pm is the mass of a proton. Please note that PIPI only considers single charged fragmentations. PIPI also adds two one-intensity peaks with the m/z values corresponding to the N-terminal and the C-terminal, respectively. After adding peaks, PIPI expresses a spectrum as a matrix SLs×2, where Ls is the total number of peaks. The elements si,1 and si,2 are the m/z value and the intensity of the ith peak, respectively. Two peaks form a peak pair if they satisfy the following relationship |sj,1 − si,1 − Δk| ≤ 2τ, where k ∈ [1, 22] is an index of the 22 amino acids (including “U” and “O”), Δk is the mass of one of the 22 amino acids, and τ is the tandem mass spectrum (MS/MS) mass tolerance. A peak pair consisting of the ith and the jth peak is denoted as P(i, j). Two peak pairs P(i, j) and P(i′, j′) can be linked if j = i′, and a number of pairs can be linked sequentially to form a tag. For simplicity, we denote an Lt length tag as P1P2...PLt. In practice, most spectra can produce many tags due to the large number of noisy peaks. If there were more than 200 tags in a spectrum, PIPI would divide the whole m/z range into 10 regions and keep the top 20 tags in each region. PIPI uses the m/z value of a tag’s first peak to locate the tag. As long as the first peak is in the region, the tag is considered as in the region. PIPI codes tags with the same length into a vector v = [v1, v2, ..., vi, ..., vLv], where Lv is the length of the vector and
vi =
∑ si,2 i∈0
0 = {i|i ∈ PP 1 2 ... PLt }
2.2. Similarity Measure and Database Search
2.2.1. Similarity Measure. PIPI uses the cross-correlation coefficient to measure the similarity between a coded sequence and a coded spectrum S(v1, v2) =
(v1)T v2 ∥v1∥∥v2∥
(5)
where v1 is a coded sequence and v2 is a coded spectrum. There are two parts in the cross-correlation coefficient: a dot product in the numerator and a product of two l-2 norms in the denominator. Given two vectors, the dot product measures the overlapping level, and the denominator normalizes the dot product. The cross-correlation coefficient measures the similarity between a spectrum and a peptide sequence. To choose the right tag length, we studied the discriminant power of different lengths empirically. We used the database of Homo sapiens (human) from UniProtKB/Swiss-Prot (20 205 proteins, 2015-11 release). We in silico digested these proteins with trypsin and kept peptides with masses from 500 to 5000 Da. In digestion, we allowed no missed cleavage. There are in total 638 480 nonredundant peptides. PIPI coded all of these peptides and calculated the cross-correlation coefficients using pairs of coded sequences whose peptides’ mass differences were within the range of [−250, 250] Da. It used different tag lengths, from two to four amino acids, for coding. Table 1 shows the relative frequencies of the cross-correlation coefficients from 0 to 0.5. Please note that the cross-correlation
(2) (3)
Here i ∈ P1P2...PLt means that i belongs to one of the indexes of the peaks forming P1P2...PLt. In this paper, we set Lt = 3 and will demonstrate that this is a good choice later on. We call the vector a coded spectrum. Because the tags extracted from a spectrum can be from b ions or y ions (under the collisioninduced dissociation (CID) setting), PIPI cannot determine the direction of a tag. Thus PIPI treats a tag and its reversed version as the same. PIPI also treats amino acids “I” and “J” equally because they have an identical mass. There are in total 22
Table 1. Relative Frequencies of the Cross-Correlation Coefficients from 0 to 0.5
0−0.1 0.1−0.2 0.2−0.3 0.3−0.4 0.4−0.5 4425
Tag 2
Tag 3
Tag 4
0.6496 0.2168 0.1016 0.0233 0.0067
0.9735 0.0206 0.0050 0.0000 0.0000
0.9982 0.0013 0.0000 0.0000 0.0000
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research
Figure 4. (a) Two histograms of the cross-correlation coefficients by comparing coded vectors in a pairwise manner. Homologous pairs and heterologous pairs are labeled with different colors. (b) ROC curve. (c) PR curve. We also added baselines corresponding to random guess. Because the number of the heterologous pairs is much larger than that of the homologous pairs (more than 1000:1), the baseline of the PR curve is very close to the x axis.
show that two kinds of pairs are separated well in terms of the cross-correlation coefficient. 2.2.2. Database Search. After coding all spectra and peptide sequences, PIPI finds at most 20 coded sequences for the coded spectrum. Concretely, PIPI first finds all possible coded sequences whose corresponding peptides’ masses are within the range [Sm − ν, Sm + ν], where Sm is the spectrum’s precursor mass and ν is a predefined value. Then, it uses eq 5 to measure the similarity between a coded sequence and the coded spectrum. Because PIPI does not know if a spectrum contains any PTM, it maintains two lists for the coded spectrum. The first list contains up to 10 coded sequences whose masses are in the spectrum’s mass tolerance range. The second list contains up to 10 coded sequences whose masses are not in the spectrum’s mass tolerance range. Clearly, the first list corresponds to PTM-free peptide candidates, and the second list corresponds to PTM-containing peptide candidates. Thus PIPI only needs to perform PTM localization and characterization for peptides in the second list.
coefficient of two identical vectors equals 1. Most of the crosscorrelation coefficients under the “tag 3” and “tag 4” settings are smaller than 0.1, which means that PIPI can separate coded vectors from different peptide sequences well. Because a longer tag requires a higher spectrum quality, which is not always satisfied, we decided to use tags of length of three amino acids. We also used a real data set to investigate the discriminant power of coded vectors coupled to the cross-correlation coefficient measure. We chose 18 757 MS/MS spectra from a data set published by Chick et al.34 There are 14 843 PTM-free spectra and 3 914 PTM-containing spectra. All of them were identified by Comet7 with q-values ≤0.01. Comet is an opensource implementation of SEQUEST’s algorithm. We set five variable modifications (i.e., oxidation, phosphorylation, acetylation, methylation, and deamidation) when using Comet. We let PIPI code them and calculate the cross-correlation coefficients using pairs of coded vectors. Without considering PTM difference, if a pair of two different vectors is from the same peptide, it is called a homologous pair; if a pair of two different vectors is from different peptides, it is called a heterologous pair. The allowed peptide mass difference was also from −250 to 250 Da. Because we allowed a wide mass difference and did not consider PTM difference in determining the homologous pairs and heterologous pairs, the comparison was PTM-invariant. We plotted two histograms of crosscorrelation coefficients from two kinds of pairs, a receiver operation characteristic (ROC) curve, and a precision-recall (PR) curve in Figure 4. Each histogram was normalized by its total count. ROC curve is widely used in evaluating a method. Because the number of the heterologous pairs is much larger than that of the homologous pairs (more than 1000:1), PR curve is more suitable in evaluating a method with imbalanced data.49 In these two curves, we also added baselines corresponding to random guess. Please note that the baseline of the PR curve is very close to the x axis. After studying the data, we observed that there were spectra whose peptide sequences only differ by one or two amino acids. Although pairs of those spectra are heterologous, they tend to have high crosscorrelation coefficients. On the contrary, missing peaks and noisy peaks may result in missed true tags and additional fake tags; different PTM patterns also results in different tags because a modified amino acid “destroys” tags containing it. Thus some homologous pairs have lower cross-correlation coefficients than those of some heterologous pairs. However, these two situations are minor. All of the histograms and curves
2.3. PTM Localization and Characterization
Here we describe how PIPI locates and characterizes modified amino acids given a spectrum and a peptide sequence. Researchers have used dynamic programming to infer the locations and mass shifts of modified amino acids. MSAlignment aligns an experimental spectrum’s peaks against those from a theoretical spectrum, while MODa aligns tags of different lengths against a peptide sequence. Because PIPI’s coding procedure has already extracted length-3 tags, it aligns these tags against a peptide sequence using dynamic programming. Before the alignment, PIPI compares each tag’s m/z value in the experimental spectrum with that in the PTM-free theoretical spectrum. It only keeps those tags whose experimental m/z values are within the range [Tmz − ν, Tmz + ν], where Tmz is the m/z value in the PTM-free theoretical spectrum. We call this process tag cleaning (Figure 5). After tag cleaning, PIPI adds the N-terminal and the C-terminal as two special tags. We denote a tag as ti, where i is its index. We define t(1) as i the location of the first amino acid in the peptide sequence and I(ti) as the summation of the peak intensities of the tag. The dynamic programming matrix is D|{ti}|×(n+2), where |{ti}| is the number of tags and n equals the length of the peptide sequence. During the dynamic programming there are two kinds of jumps: jumps within a tag and jumps between two tags. 4426
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research
on average more than 100 modifications for each amino acid. Thus the scoring rules are as follows: ⎧ di , t (1)− 1 + I(ti) jump within a tag i ⎪ ⎪ non‐PTM jump from i′ to i di , j = ⎨ di′, ti(1) + I(ti) ⎪ ⎪ di′, t (1) + I(ti) − p PTM jump from i′ to i ⎩ i (6)
where di,j is an element of D|{ti}|×(n+2) and p is a penalty for a PTM jump. 2.4. Final Scoring and q-Value Estimation
Figure 5. Illustration of tag cleaning. PIPI compares tags against a peptide sequence to obtain the relative shifts from the corresponding PTM-free location. The diagonal green or red bars indicate tags. PIPI only keeps those tags whose relative shifts are within a predefined range (green bars in the figure).
For peptide identification, restricted tools (e.g., SEQUEST, Comet, and MS-GF+8) have a higher sensitivity than unrestricted tools (e.g., MS-Alignment, ProteinProspector, and MODa) if PTM patterns are included in the theoretical spectra. Besides, the original spectrum contains more information than the coded vector. As PIPI has already known each spectrum-peptide pair’s PTM pattern after the PTM localization and characterization step (Section 2.3), it calculates scores using the original spectrum. After the previous two steps (Sections 2.2 and 2.3), each spectrum has at most 20 peptide sequences as candidates. PIPI calculates a score for each spectrum-peptide pair using the XCorr2
Because tags have sequence and peak location information, not all jumps between tags are meaningful. Thus we define the following jumping rules (Figure 6):
XCorr(t, e) = tT e −
1 150
75
tT eδ
∑ δ =−75, δ ≠ 0
(7)
where t is a vector of the digitized theoretical spectrum, e is a vector of the digitized experimental spectrum, and δ is an m/z shift. XCorr is a score function used by popular tools such as SEQUEST and Comet. For each spectrum, the top-scored peptide is kept as the final result. For each peptide-spectrum match (PSM), PIPI also estimates its corresponding e-value using the linear tail-fit method.51 Finally, PIPI estimates q-values for PSMs. In most cases,52−55 people convert false discovery rate (FDR) to q-value that is monotonically decreasing with respect to the score. Without specific description, we always convert FDR to q-value and use it as cutoff. To achieve a high sensitivity, PIPI uses Percolator52 to estimate q-values for PSMs. Briefly speaking, Percolator uses a machine learning method to reorder all PSMs based on multiple features (e.g., XCorr, e-value, and mass error). Those reordered PSMs are used to estimate q-values. We refer this approach as “the Percolator approach”. To demonstrate the sensitivity of PIPI’s score function, PIPI also uses each PSM’s evalue to estimate FDR55−57
Figure 6. Illustration of tag alignment. Tags are aligned against a peptide sequence. “n” and “c” indicate two special tags: N-terminal and C-terminal. There is a modification on “M” in the peptide sequence. Jumps between or within tags are labeled with different colors corresponding to different jumping types. Numbers on the jumping arrows indicate whether the jump is a non-PTM jump (circled 1) or a PTM jump (circled 2).
1. Jumps within a tag are allowed (the green arrows in Figure 6). 2. Jumps from the end of a tag to the start of another tag are allowed (the black arrows in Figure 6). 3. Jumps from the middle of a tag to the start of another tag are allowed only if the end of the former tag overlaps with the start of the latter tag (the blue arrow in Figure 6). Overlapping means that they have the same substring and the same peak locations. 4. Jumps from the end of a tag to the end of another tag are not allowed (the red arrow in Figure 6). Jumps between tags can be classified into two categories: 1. There is no modified amino acid between two tags, which is called a non-PTM jump (circled 1 in Figure 6). 2. There are modified amino acids between two tags, which is called a PTM jump (circled 2 in Figure 6). To make sure that all PTM jumps are meaningful, PIPI uses the PTM types in the Unimod database50 as reference. This database covers almost all known PTM types. There are
⎛ d (s ) , FDR(s) = min⎜ ⎝ max(t(s), 1)
⎞ 1⎟ ⎠
(8)
where s is an e-value threshold, d(s) is the number of decoy PSMs whose e-values are smaller than or equal to s, and t(s) is the number of target PSMs whose e-values are smaller than or equal to s. Then, it converts FDR into q-value using q(t ) = min FDR(s) s≥t
(9)
where t is a FDR threshold. We refer this approach as “the basic target-decoy approach”. 4427
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research
Figure 7. Bar plots showing the percentages of positive PSMs identified by Mascot, Comet, MS-GF+, Mascot_ET, MS-Alignment, ProteinProspector, MODa, and PIPI, respectively. “Mascot_ET” stands for Mascot with error tolerant search, “MSA” stands for MS-Alignment, and “PP” stands for ProteinProspector. For each bar, the blue part corresponds to true-positive PTM-free PSMs, the orange part corresponds to truepositive PTM-containing PSMs, and the gray part corresponds to false-positive PSMs. The values in the blue and orange parts are the corresponding percentages with respect to the number of all spectra. The value at the top of each bar is the percentage of all positive PSMs with respect to the number of all spectra.
3. EXPERIMENTAL RESULTS We used three sets of experiments to evaluate the performance of PIPI. Section 3.1 shows four simulation experiments. Each of them uses 12 064 high-quality MS/MS spectra and a custom database. Because we knew the ground truth, we evaluated PIPI according to four criteria: sensitivity, false discovery proportion (FDP), PTM characterization accuracy, and running speed. Section 3.2 uses five public data sets from the standard protein mixture samples.58 Section 3.3 uses 24 public data sets published by Chick et al.34 In these two sections, we did not know the ground truth. Thus we evaluated PIPI according to two criteria: sensitivity and running speed. Please refer to Klimek et al.58 and Chick et al.34 for details of sample preparation and data acquisition. In these three sets of experiments, we used Mascot (version: 2.5), Comet (version: 2016.01 rev. 2), MS-GF+ (version: 10072), MS-Alignment (version: 20120109), ProteinProspector (version: 5.16.0), MODa (version: 1.51), and PIPI (version: 1.2.4). Because Mascot supports unrestricted search by turning on error tolerant search option, we ran Mascot twice for each data file: one for restricted search and the other for unrestricted search. We refer the Mascot with error tolerant search as Mascot_ET for simplicity. For restricted tools (i.e., Mascot, Comet, and MS-GF+), we specified five variable modifications (i.e., oxidation on “M”; phosphorylation on “S”, “T”, and “Y”; acetylation on “K” and peptide N-terminal; methylation on “K”; deamidation on “N” and “Q”). For unrestricted tools (i.e., MSAlignment, ProteinProspector, MODa, and PIPI), we set the modification delta mass from −250 to 250 Da. Mascot_ET does not have the modification delta mass setting. We allowed all amino acids, peptide N-terminal, and peptide C-terminal to
be modified. Mascot_ET only supports up to one unspecified modification or one type of amino acids being substituted in each peptide. MS-Alignment needs to specify the maximum number of each peptide’s modified amino acids. The default value is 1. We set it to 2 in the second to fourth simulation experiments and 1 in the other experiments. ProteinProspector works in either restricted or unrestricted manner. We used it in the unrestricted manner by allowing mass modifications on all amino acids. We found that under the unrestricted manner, ProteinProspector only considers up to one modified amino acid in each peptide. MODa and PIPI do not limit the number of modified amino acids in each peptide. All of these tools’ precursor mass tolerance is 10 ppm, and MS/MS mass tolerance is 0.02 Da. We only considered MS/ MS spectra whose precursor masses were from 600 to 5000 Da. This common range has been used by many tools.1,2,4,6,7 All of the databases contain both target proteins and decoy proteins generated by reversing or shuffling.8,57,59,60 For simplicity, the number of proteins mentioned in this paper only refers to the number of target proteins. We allowed no missed cleavage. To our knowledge, MS-Alignment, ProteinProspector, and PIPI do not support C13 correction.7,8,39 Thus we turned off all other tools’ corresponding function for a fair comparison. In the simulation experiments (Section 3.1), we used “the basic target-decoy approach” to estimate the q-values. We tried to use Percolator. However, it failed with an error “the input data has too good separation between target and decoy PSMs”. This is because all spectra have high quality, which results in insufficient decoy PSMs for Percolator to use. In the real experiments (Sections 3.2 and 3.3), we used both of “the basic target-decoy approach” and “the Percolator approach” to 4428
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research Table 2. Percentages of the Scenarios Causing Missed Findings Scenario 3 Simulation Simulation Simulation Simulation
Scenario 1
Scenario 2
top scored
not top scored
Scenario 4
0.023 0.014 0.010 0.011
0.663 0.726 0.771 0.816
0.082 0.115 0.103 0.089
0.189 0.136 0.114 0.084
0.043 0.010 0.001 0.000
1 2 3 4
estimate the q-values. All of these experiments’ cutoff is q-value ≤0.01.
PTM-containing PSMs identified by restricted tools. The reason is that some PSMs’ PTM masses are close to the masses of the specified variable modifications by accident. PIPI identifies more true PSMs than other unrestricted tools except for ProteinProspector. ProteinProspector simplifies the task by only considering up to one modified amino acid in each peptide. Such simplification makes the sensitivity higher but has difficulty in handling multiple modified amino acids. Because we did not consider the correctness of PTM patterns in PSM comparison, the bars corresponding to ProteinProspector still have true-positive PTM-containing PSMs in “Simulations 2−4”. However, the percentage drops very quickly. This is because there is more than one modified amino acid in “Simulations 2− 4”. The observation is similar for MS-Alignment, which supports up to two modified amino acids in each peptide. It is easy to see that PIPI is the most robust one when the number of modified amino acids increases. It is worth mentioning that Mascot_ET identifies only a few PTM-containing PSMs with a large number of false-positive PSMs. Mascot_ET is supposed to identify PTM-containing PSMs, but unfortunately it failed. The reason is unclear at the moment. Please refer to the Supplementary Results in the Supporting Information for details. To find the reasons of low sensitivity in identifying PTMcontaining PSMs, we analyzed the results based on the ground truth and the candidate list of each spectrum generated by PIPI. According to the step in which PIPI misses the correct peptide, we list the scenarios as the following: 1. Missing the peptide in step 1 (Section 2.1.2): PIPI does not find enough tags to generate coded spectrum. 2. Missing the peptide in step 2 (Section 2.2): PIPI does not include the correct peptide sequence in the candidate list. 3. Missing the peptide in step 3 (Section 2.3): The correct peptide sequence is in the candidate list, but the dynamic programming infers an incorrect PTM pattern. It has two outcomes: the peptide sequence is or is not the top scored one after step 4. 4. Missing the peptide in step 4 (Section 2.4): The correct peptide sequence along with the correct PTM pattern is the top scored one, but it does not pass the q-value cutoff. Table 2 shows the percentages of these scenarios. We list the percentages of Scenario 3’s two outcomes separately. We can see that the second and the third scenarios contribute most of the missed findings. In spite of this pitfall, PIPI still has the highest sensitivity and the highest robustness among all unrestricted tools in most cases. 3.1.3. False Discovery Proportion and PTM Characterization Accuracy. We calculated FDP for these results using
3.1. Simulation Experiments
3.1.1. Data Generation. We picked 12 056 PTM-free MS/ MS spectra from the data sets in Chick et al.34 All of them have e-values ≤0.01. These spectra correspond to 6746 nonredundant peptides. We randomly selected half of these peptides and randomly replaced one amino acid in each selected peptide. This is the approach used in Na et al.31 To maintain the number of peptides and avoid ambiguity, we also considered the following factors: • “K”, “R”, and “P” are related to trypsin digestion site. • “I” and “L” have the same monoisotopic mass. • “C” has a fix modification. Thus these amino acids are not replaced and the replacements also do not contain them. After modifying half of these peptides, the corresponding spectra become PTM-containing. There are in total 5952 PTM-containing spectra and 6104 PTM-free spectra corresponding to 6746 peptides as a custom database. To mimic the real situation, we added 116 common contaminant proteins from the common repository of adventitious proteins (cRAP)61 to the database. We call this simulation data set “Simulation 1”. We repeated the above procedure to generate three more simulation data sets. In the second simulation data set, half of the peptides contain two modified amino acids and the other half contain no modified amino acid. There are in total 5933 PTM-containing spectra and 6123 PTM-free spectra. In the third simulation data set, half of the peptides contain three modified amino acids and the other half contain no modified amino acid. There are in total 5923 PTM-containing spectra and 6133 PTM-free spectra. In the fourth simulation data set, half of the peptides contain four modified amino acids and the other half contain no modified amino acid. There are in total 6013 PTM-containing spectra and 6043 PTM-free spectra. We call these data sets “Simulation 2”, “Simulation 3”, and “Simulation 4”, respectively. The spectra files and databases can be found in the Supporting Information. 3.1.2. Sensitivity. Because we knew the ground truth, we could label the true-positive PSMs and false-positive PSMs. Figure 7 shows the stacked bars of the results. For each bar, the blue part corresponds to true-positive PTM-free PSMs, the orange part corresponds to true-positive PTM-containing PSMs, and the gray part corresponds to false-positive PSMs. We did not consider the difference of PTM patterns during PSM comparison. The values in the blue and orange parts are the corresponding percentages with respect to the number of all spectra. The value at the top of each bar is the percentage of all positive PSMs with respect to the number of all spectra. We can see that restricted tools (including Mascot, Comet, and MS-GF+) almost identify none of PTM-containing PSMs. This is because the number of simulated PTM types is large and there is no way to specify all of them during the restricted search. A closer look reveals that there are is small number of
FDP = 4429
V max(V + S , 1)
(10) DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research Table 3. FDP of Four Simulation Experiments Mascot Comet MS-GF+ Mascot_ET MS-Alignment ProteinProspector MODa PIPI
Simulation 1
Simulation 2
Simulation 3
Simulation 4
0.035 0.093 0.090 0.197 0.037 0.035 0.062 0.021
0.039 0.078 0.071 0.192 0.048 0.057 0.065 0.025
0.034 0.072 0.068 0.192 0.043 0.060 0.072 0.030
0.035 0.068 0.063 0.176 0.040 0.062 0.074 0.033
Simulation 1
Simulation 2
Simulation 3
Simulation 4
0.700 0.713 0.732 0.736 0.812
0.000* 0.099 0.000* 0.259 0.419
0.000* 0.000* 0.000* 0.046 0.166
0.000* 0.000* 0.000* 0.000 0.046
Table 4. Accuracy of PTM Localization and Characterizationa Mascot_ET MS-Alignment ProteinProspector MODa PIPI
Values are the percentages of PSMs having correct PTM patterns. “Mascot_ET” stands for Mascot with error tolerant search. Because Mascot_ET and ProteinProspector only consider up to one modified amino acid or one type of amino acids being substituted in each peptide, we marked “*” on the corresponding values of “Simulations 2−4”, whose PTM-containing peptides contain more than one modified amino acid. Also, because MSAlignment only considers up to two modified amino acids in each peptide, we marked “*” on the corresponding values of “Simulations 3 and 4”, whose PTM-containing peptides contain more than two modified amino acids. a
where V is the number of false-positive PSMs and S is the number of true-positive PSMs. The results (Table 3) show that PIPI has the lowest FDP among all tools. Please note that, during the calculation of the FDP, we did not consider the correctness of the PTM pattern. Actually, the accuracy of PTM localization and characterization is quite low. With the number of PTM-containing PSMs whose sequences are correct, we calculated the percentages of PSMs having correct PTM patterns. We list the values in Table 4. Because Mascot_ET and ProteinProspector only consider up to one modified amino acid or one type of amino acid being substituted in each peptide, we marked “*” on the corresponding values of “Simulations 2−4”, whose PTMcontaining peptides contain more than one modified amino acid. Also, because MS-Alignment only considers up to two modified amino acids in each peptide, we marked “*” on the corresponding values of “Simulations 3 and 4”, whose PTMcontaining peptides contain more than two modified amino acids. The table shows that PIPI has the highest PTM localization and characterization accuracy. To find the reasons of getting false positives in identifying PTM-containing peptides, we analyzed the results based on the ground truth and the candidate lists generated by PIPI. According to the step in which PIPI gets the false peptide, we list the scenarios as the following: 1. Getting a false peptide in step 2 (Section 2.2): PIPI does not include the correct peptide sequence in the candidate list. 2. Getting a false peptide in step 3 (Section 2.3): The correct peptide sequence is in the candidate list, but the dynamic programming infers an incorrect PTM pattern. It has two outcomes: the correct peptide sequence is or is not the top scored one after step 4. We also calculated two XCorr scores (eq 7) for each spectrum: One corresponds to the ground truth and the other corresponds to the false-positive PSM. Table 5 shows the percentages of both scenarios and the percentages of a false-
Table 5. Percentages of Scenarios Causing False-Positive PSMs and the Percentages of the False-Positive PSM Having a Higher XCorr Score than the Ground Truth scenario 2
Simulation Simulation Simulation Simulation
1 2 3 4
scenario 1
top scored
not top scored
the false positive has a higher XCorr
0.040 0.058 0.109 0.172
0.805 0.881 0.850 0.770
0.155 0.061 0.041 0.058
0.153 0.159 0.175 0.241
positive PSM having a higher XCorr score than the ground truth. We list scenario 2’s two outcomes separately. We can see that the inaccuracy of dynamic programming is the major reason for getting false-positive PSMs. It is also interesting to find that a considerable fraction of false-positive PSMs having a higher XCorr scores than their ground truth. This means that there is no way to find these spectra’s correct results based on XCorr score function. It raises another issue of unrestricted search: There is a possibility that we can always find a higher scored peptide given a spectrum by adding more PTM types to match the peaks. A further discussion of this issue and the inaccuracy of dynamic programming is beyond the scope of this paper. We will address these issues in the future. 3.2. Experiments with Standard Protein Mixture Samples
We used five public data sets from the standard protein mixture samples58 to demonstrate the performance of PIPI in real data. We used the database published along with the data sets, which contains 18 standard proteins and 1801 contaminant proteins. We used restricted tools (i.e., Mascot, Comet, and MS-GF+) and unrestricted tools (i.e., Mascot_ET, MS-Alignment, ProteinProspector, MODa, and PIPI) to do the identification. Figure 8 plots the number of positive PSMs. The value at the top of each bar is the total number of positive PSMs. We used three different colors to indicate different categories of tools: Yellow indicates restricted tools, green indicates unrestricted 4430
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research
Figure 8. Bar plot showing the number of positive PSMs from standard protein mixture samples. “Mascot_ET” stands for Mascot with error tolerant search, “MSA” stands for MS-Alignment, and “PP” stands for ProteinProspector. The value at the top of the bar is the number of positive PSMs. Yellow bars correspond to restricted tools, green bars correspond to unrestricted tools considering up to one modified amino acid in each peptide, and blue bars correspond to unrestricted tools considering unlimited number of modified amino acids in each peptide. Under these bars, a single tool’s name indicates that its corresponding q-values are estimated using “the basic target-decoy approach”, and a tool’s name “+ Percolator” indicates that its corresponding q-values are estimated using “the Percolator approach”.
tools considering up to one modified amino acid in each peptide, and blue indicates unrestricted tools considering unlimited number of modified amino acids in each peptide. Under each bar, a single tool’s name indicates that its corresponding q-values are estimated using “the basic targetdecoy approach”, and a tool’s name “+ Percolator” indicates that its corresponding q-values are estimated using “the Percolator approach”. An error was reported when Percolator analyzed Mascot_ET’s results. Thus there is no result for “Mascot_ET + Percolator”. Figure 8 shows that PIPI identifies more PSMs than Mascot, MS-Alignment, and MODa but fewer PSMs than Comet, MSGF+, Mascot_ET, and ProteinProspector. Comet and MS-GF+ are restricted tools that achieve a higher sensitivity by only considering specified PTM types. Mascot_ET and ProteinProspector simplify the task by only considering up to one modified amino acid or one type of amino acids being substituted in each peptide. The performance of PIPI is poor under the “the basic target-decoy approach” setting. This is because the accuracy of the estimated e-value suffers from the small size of the database. When the database size is small, there are not enough peptides for generating random-match scores. In Section 3.3, we see that this issue disappears when the database is large enough. PIPI plus Percolator approach also does not have this issue. This is because Percolator uses multiple scores to estimate q-value, which reduces the effect of the inaccurate e-value. It is worth mentioning that Section 3.1 also does not have this issue because all of those spectra have a high quality and there are few decoy PSMs in the result. Please refer to Supplementary Results in the Supporting Information for details.
we generated a small database based on the procedure proposed by MS-Alignment.17 We first searched these data sets against the whole human database using InsPecT, which is a restricted tool. Then, we picked all proteins that had at least 2 peptides and 10 spectra being identified. We used these proteins to generate a small database that, without considering decoy proteins, contains 4125 proteins. This approach was recommended by the authors of MS-Alignment. We tried Mascot_ET, but it was still running after 2 weeks. Considering that the Mascot is installed in another lab and we are not the only user, we decided to stop the job. Thus we do not show the corresponding results here. Figure 9 shows the number of positive PSMs. Yellow bars correspond to restricted tools, green bars correspond to unrestricted tools considering up to one modified amino acid in each peptide, and blue bars correspond to unrestricted tools
3.3. Experiments with 24 Real Data Sets from Human Tissue
Figure 9. Bar plot shows the number of positive PSMs identified from 24 human data sets. Mascot, Comet, MS-GF+, MS-Alignment, ProteinProspector, MODa, and PIPI were used. “MSA” stands for MS-Alignment and “PP” stands for ProteinProspector. The value at the top of the bar is the number of positive PSMs. Yellow bars correspond to restricted tools, green bars correspond to unrestricted tools considering up to one modified amino acid in each peptide, and blue bars correspond to unrestricted tools considering unlimited number of modified amino acids in each peptide. Under each bar, a single tool’s name indicates that its corresponding q-values are estimated using “the basic target-decoy approach”, and a tool’s name “+ Percolator” indicates that its corresponding values are estimated using “the Percolator approach”.
We used 24 data sets published by Chick et al.34 There are in total 1 309 561 MS/MS spectra whose precursor charges are from 1 to 7, precursor masses are from 600 to 5000 Da, and peak numbers are larger than 10. Because the samples were from HEK293 cells, we used all proteins of Homo sapiens from UniProtKB/Swiss-Prot (20 205 proteins, 2015-11 release) as the database for Mascot, Comet, MS-GF+, MODa, and PIPI. MS-Alignment and ProteinProspector would need years to search these data sets against the whole human database. Thus 4431
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research
Figure 10. Histograms of PTM masses identified by PIPI and Chick et al.,34 respectively. We marked high peaks with the corresponding masses.
Table 6. Average Running Time of Analyzing One Data Seta simulation experiments restricted tools unrestricted tools
Comet MS-GF+ MSA PP MODa PIPI
1
2
3
4
standard protein experiments
human sample experiments
0.004 0.024 0.325 0.300 0.062 0.022
0.004 0.024 36.100 0.300 0.065 0.022
0.005 0.024 36.329 0.300 0.065 0.021
0.004 0.024 36.372 0.300 0.064 0.021
0.018 0.007 0.317 0.224 0.018 0.004
0.326 0.334 56.871* 13.866* 15.400 0.491
The unit is hours. The running time of MS-Alignment and ProteinProspector is marked with an “*” for analyzing human samples. This indicates that they used a custom database that is much smaller than the database used by other tools.
a
sulfonylation (183.03 Da). Please refer to the Supplementary Results in the Supporting Information for details
considering unlimited number of modified amino acids in each peptide. The value at the top of each bar is the number of positive PSMs. Under each bar, a single tool’s name indicates that its corresponding q-values are estimated using “the basic target-decoy approach”, and a tool’s name “+ Percolator” indicates that its corresponding q-values are estimated using “the Percolator approach”. The results show that PIPI identifies more PSMs than Mascot, Comet, MS-GF+, MS-Alignment, and MODa but fewer PSMs than ProteinProspector. This observation is consistent with what we observed in Sections 3.1 and 3.2. Please refer to the Supplementary Results in the Supporting Information for details. We summarized the histograms of PTM masses identified by PIPI and published by Chick et al.,34 respectively (Figure 10). It is easy to see that these two histograms are similar to each other, which implies the correctness of PIPI’s results. Please note that in Chick et al.34 most peptides have up to one modified amino acid, while PIPI allows multiple. We also marked high peaks with their masses. Most of the high peaks’ masses are close to those of common PTM types, such as oxidation (15.99 Da), deamidation (0.98 Da), carbamylation (43.00 Da), formylation (27.99 Da), and aminoethylbenzene-
3.4. Running Time
In this section, we compare the running time of different tools. We ran Mascot in an outside server from that we could not obtain the precise running time. We ran Comet, MS-GF+, MSAlignment, MODa, and PIPI on our computers with i7-6700 CPU (3.40 GHz, eight cores) and 32 GB RAM. Because Comet, MS-GF+, and PIPI support multithread computing, we ran them with eight cores. MS-Alignment and MODa only support single-thread computing. We ran ProteinProspector on the web server provided by its developers.62 As discussed in Section 3.3, we let MS-Alignment and ProteinProspector search against a small database, while the other tools searched against the Homo species’ whole database. Table 6 shows the average running time of analyzing one data set. The unit is hours. PIPI is much faster than all unrestricted tools. It is also comparable to restricted tools. It is worth mentioning that restricted tools’ search space is much smaller than unrestricted tools including PIPI. 4432
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research
4. DISCUSSION We can classify peptide identification methods into two main categories: de novo sequencing36−38 and database search.1−3,6,7 De novo sequencing infers a spectrum’s sequence without using any database. It checks pairs of peaks and labels them if their distances are within the tolerance ranges of the amino acids’ masses. Then, it links the labeled peak pairs into paths and scores them. Finally, it finds a high-scored path and interprets the path into a peptide sequence. Clearly, de novo sequencing requires a high-quality spectrum. Missing peaks and unspecified PTM types are disasters for de novo sequencing. Database search infers a spectrum’s sequence by finding the most similar candidate from a database. After defining a scoring scheme, it compares each experimental spectrum with all possible theoretical spectra. The top-scored candidate is the final result. Clearly, this approach is tolerant to missing peaks and noisy peaks, but unspecified PTM types still cause trouble. PIPI extracts local sequence information by inferring substrings (a.k.a. tags) from a spectrum. The process of extracting tags is similar to de novo sequencing, but the key difference is that PIPI does not infer the whole sequence. Instead PIPI codes all tags into a feature vector and uses the feature vector for identification purposes. This procedure is similar to database search. The subtle difference is that database search compares an experimental spectrum with theoretical spectra, while PIPI compares a vector coded from a spectrum with vectors coded from peptide sequences. The former is sensitive to PTM, while the latter is invariant to PTM. There are tools (e.g., MS-Alignment and MODa) trying to identify peptides without specifying PTM types beforehand. The major difference between these tools and PIPI is that the former performs alignment during the database search, while PIPI performs alignment after the database search. During the database search, MS-Alignment aligns an experimental spectrum against every possible theoretical spectrum, and MODa aligns a spectrum’s tags against every possible peptide sequence. These two tools use their alignment results in their scoring procedures. In contrast, PIPI represents experimental spectra and peptide sequences with coded vectors and uses them to find each spectrum’s 20 peptide sequence candidates. After narrowing down the candidates, PIPI aligns a spectrum’s tags against each peptide sequence and calculates a final score for q-value estimation. There are also differences in the dynamic programming among these three tools. MS-Alignment aligns an experimental spectrum against a theoretical spectrum, while PIPI aligns tags against a peptide sequence. Because there are many peaks in an experimental spectrum, MS-Alignment is much slower than PIPI, as presented in Section 3.4. MODa aligns variant lengths of tags against a peptide sequence, while PIPI aligns length-3 tags against a peptide sequence. There are open questions in unrestricted search. The low accuracy of PTM localization and characterization45−47 is one of them. Another issue is about the criteria of similarity. As we mentioned in Section 3.1.3, we may make any peptide highly similar to a spectrum by adding enough modifications. We will address these issues in the future.
■
■
Simulation Data: “Simulation 1−4” data sets, including spectra file (mgf format) and database (fasta format). (ZIP) Supplementary Results: Detailed results of all experiments in the paper. (ZIP) Parameter and Log Files: Parameters and logs of all experiments in the paper. (ZIP)
AUTHOR INFORMATION
Corresponding Authors
*N.L.: E-mail:
[email protected]. Tel: +852 2358 7335. *W.Y.: E-mail:
[email protected]. Tel: +852 2358 7054. Notes
The authors declare no competing financial interest. The source code and executable file can be found at http:// bioinformatics.ust.hk/pipi.html.
■
ACKNOWLEDGMENTS This work is partially supported by theme-based project T12402/13N from the Research Grant Council (RGC) of the Hong Kong S.A.R. government. We thank Jiaan Dai for his critical questions and suggestions.
■
REFERENCES
(1) Cottrell, J.; London, U. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551−3567. (2) Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5, 976−989. (3) Craig, R.; Beavis, R. TANDEM: Matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466−1467. (4) Wang, L.-h.; Li, D.-Q.; Fu, Y.; Wang, H.-P.; Zhang, J.-F.; Yuan, Z.-F.; Sun, R.-X.; Zeng, R.; He, S.-M.; Gao, W. pFind 2.0: A software package for peptide and protein identification via tandem mass spectrometry. Rapid Commun. Mass Spectrom. 2007, 21, 2985−2991. (5) Chalkley, R. J.; Baker, P. R.; Huang, L.; Hansen, K. C.; Allen, N. P.; Rexach, M.; Burlingame, A. L. Comprehensive analysis of a multidimensional liquid chromatography mass spectrometry dataset acquired on a quadrupole selecting, quadrupole collision cell, time-offlight mass spectrometer II. New developments in ProteinProspector allow for reliable and comprehensive automatic analysis of large datasets. Mol. Cell. Proteomics 2005, 4, 1194−1204. (6) McIlwain, S.; Tamura, K.; Kertesz-Farkas, A.; Grant, C. E.; Diament, B.; Frewen, B.; Howbert, J. J.; Hoopmann, M. R.; Kall, L.; Eng, J. K.; MacCoss, M. J.; Noble, W. S. Crux: Rapid open source protein tandem mass spectrometry analysis. J. Proteome Res. 2014, 13, 4488−4491. (7) Eng, J. K.; Jahan, T. A.; Hoopmann, M. R. Comet: An opensource MS/MS sequence database search tool. Proteomics 2013, 13, 22−24. (8) Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 2014, 5, 5277. (9) Mann, M.; Wilm, M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 1994, 66, 4390−4399. (10) Tabb, D. L.; Saraf, A.; Yates, J. R. GutenTag: High-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem. 2003, 75, 6415−6421. (11) Tanner, S.; Shu, H.; Frank, A.; Wang, L.-C.; Zandi, E.; Mumby, M.; Pevzner, P. A.; Bafna, V. InsPecT: Identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 2005, 77, 4626−4639.
ASSOCIATED CONTENT
S Supporting Information *
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jproteome.6b00485. 4433
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research (12) Frank, A.; Tanner, S.; Bafna, V.; Pevzner, P. Peptide sequence tags for fast database search in mass-spectrometry. J. Proteome Res. 2005, 4, 1287−1295. (13) Cao, X.; Nesvizhskii, A. I. Improved sequence tag generation method for peptide identification in tandem mass spectrometry. J. Proteome Res. 2008, 7, 4422−4434. (14) Kim, S.; Bandeira, N.; Pevzner, P. A. Spectral profiles, a novel representation of tandem mass spectra and their applications for de novo peptide sequencing and identification. Mol. Cell. Proteomics 2009, 8, 1391−1400. (15) Kim, S.; Gupta, N.; Bandeira, N.; Pevzner, P. A. Spectral dictionaries integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteomics 2009, 8, 53−69. (16) Searle, B. C.; Dasari, S.; Turner, M.; Reddy, A. P.; Choi, D.; Wilmarth, P. A.; McCormack, A. L.; David, L. L.; Nagalla, S. R. Highthroughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/MS de novo sequencing results. Anal. Chem. 2004, 76, 2220−2230. (17) Tsur, D.; Tanner, S.; Zandi, E.; Bafna, V.; Pevzner, P. A. Identification of post-translational modifications by blind search of mass spectra. Nat. Biotechnol. 2005, 23, 1562−1567. (18) Han, Y.; Ma, B.; Zhang, K. SPIDER: Software for protein identification from sequence tags with de novo sequencing error. J. Bioinf. Comput. Biol. 2005, 3, 697−716. (19) Hansen, B. T.; Davey, S. W.; Ham, A.-J. L.; Liebler, D. C. PMod: An algorithm and software to map modifications to peptide sequences using tandem MS data. J. Proteome Res. 2005, 4, 358−368. (20) Searle, B. C.; Dasari, S.; Wilmarth, P. A.; Turner, M.; Reddy, A. P.; David, L. L.; Nagalla, S. R. Identification of protein modifications using MS/MS de novo sequencing and the OpenSea alignment algorithm. J. Proteome Res. 2005, 4, 546−554. (21) Savitski, M. M.; Nielsen, M. L.; Zubarev, R. A. ModifiComb, a new proteomic tool for mapping substoichiometric post-translational modifications, finding novel types of modifications, and fingerprinting complex protein mixtures. Mol. Cell. Proteomics 2006, 5, 935−948. (22) Liu, C.; Yan, B.; Song, Y.; Xu, Y.; Cai, L. Peptide sequence tagbased blind identification of post-translational modifications with point process model. Bioinformatics 2006, 22, e307−e313. (23) Bandeira, N.; Tsur, D.; Frank, A.; Pevzner, P. A. Protein identification by spectral networks analysis. Proc. Natl. Acad. Sci. U. S. A. 2007, 104, 6140−6145. (24) Havilio, M.; Wool, A. Large-scale unrestricted identification of post-translation modifications using tandem mass spectrometry. Anal. Chem. 2007, 79, 1362−1368. (25) Baumgartner, C.; Rejtar, T.; Kullolli, M.; Akella, L. M.; Karger, B. L. SeMoP: A new computational strategy for the unrestricted search for modified peptides using LC- MS/MS data. J. Proteome Res. 2008, 7, 4199−4208. (26) Falkner, J. A.; Falkner, J. W.; Yocum, A. K.; Andrews, P. C. A spectral clustering approach to MS/MS identification of posttranslational modifications. J. Proteome Res. 2008, 7, 4614−4622. (27) Chalkley, R. J.; Baker, P. R.; Medzihradszky, K. F.; Lynn, A. J.; Burlingame, A. In-depth analysis of tandem mass spectrometry data from disparate instrument types. Mol. Cell. Proteomics 2008, 7, 2386− 2398. (28) Na, S.; Jeong, J.; Park, H.; Lee, K.-J.; Paek, E. Unrestrictive identification of multiple post-translational modifications from tandem mass spectrometry using an error-tolerant algorithm based on an extended sequence tag approach. Mol. Cell. Proteomics 2008, 7, 2452− 2463. (29) Chen, Y.; Chen, W.; Cobb, M. H.; Zhao, Y. PTMap-A sequence alignment software for unrestricted, accurate, and full-spectrum identification of post-translational modification sites. Proc. Natl. Acad. Sci. U. S. A. 2009, 106, 761−766. (30) Ahrné, E.; Nikitin, F.; Lisacek, F.; Müller, M. QuickMod: A tool for open modification spectrum library searches. J. Proteome Res. 2011, 10, 2913−2921.
(31) Na, S.; Bandeira, N.; Paek, E. Fast multi-blind modification search through tandem mass spectrometry. Mol. Cell. Proteomics 2012, 11, M111.010199. (32) Huang, X.; Huang, L.; Peng, H.; Guru, A.; Xue, W.; Hong, S. Y.; Liu, M.; Sharma, S.; Fu, K.; Caprez, A. P.; Swanson, D. R.; Zhang, Z.; Ding, S.-J. ISPTM: An iterative search algorithm for systematic identification of post-translational modifications from complex proteome mixtures. J. Proteome Res. 2013, 12, 3831−3842. (33) Kertész-Farkas, A.; Reiz, B.; Vera, R.; Myers, M. P.; Pongor, S. PTMTreeSearch: A novel two-stage tree-search algorithm with pruning rules for the identification of post-translational modification of proteins in MS/MS spectra. Bioinformatics 2014, 30, 234−241. (34) Chick, J. M.; Kolippakkam, D.; Nusinow, D. P.; Zhai, B.; Rad, R.; Huttlin, E. L.; Gygi, S. P. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 2015, 33, 743−749. (35) Shortreed, M. R.; Wenger, C. D.; Frey, B. L.; Sheynkman, G. M.; Scalf, M.; Keller, M. P.; Attie, A. D.; Smith, L. M. Global Identification of Protein Post-translational Modifications in a Single-Pass Database Search. J. Proteome Res. 2015, 14, 4714−4720. (36) Dancik, V.; Addona, T. A.; Clauser, K. R.; Vath, J. E.; Pevzner, P. A. De novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 1999, 6, 327−342. (37) Chen, T.; Kao, M.-Y.; Tepel, M.; Rush, J.; Church, G. M. A dynamic programming approach to de novo peptide sequencing via tandem mass spectrometry. J. Comput. Biol. 2001, 8, 325−337. (38) Frank, A.; Pevzner, P. PepNovo: De novo peptide sequencing via probabilistic network modeling. Anal. Chem. 2005, 77, 964−973. (39) Mascot Database Search: Error Tolerant Search. http://www. matrixscience.com/help/error_tolerant_help.html (accessed: 2016-0930). (40) Shilov, I. V.; Seymour, S. L.; Patel, A. A.; Loboda, A.; Tang, W. W. H.; Keating, S. P.; Hunter, C. L.; Nuwaysir, L. M.; Schaeffer, D. A. The Paragon Algorithm, a next generation search engine that uses sequence temperature values sequence temperature values and feature probabilities to identify peptides from tandem mass spectra. Mol. Cell. Proteomics 2007, 6, 1638−1655. (41) Han, X.; He, L.; Xin, L.; Shan, B.; Ma, B. PeaksPTM: mass spectrometry-based identification of peptides with unspecified modifications. J. Proteome Res. 2011, 10, 2930−2936. (42) Beausoleil, S. A.; Villén, J.; Gerber, S. A.; Rush, J.; Gygi, S. P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 2006, 24, 1285− 1292. (43) Cox, J.; Neuhauser, N.; Michalski, A.; Scheltema, R. A.; Olsen, J. V.; Mann, M. Andromeda: A peptide search engine integrated into the MaxQuant environment. J. Proteome Res. 2011, 10, 1794−1805. (44) Baker, P. R.; Trinidad, J. C.; Chalkley, R. J. Modification site localization scoring integrated into a search engine. Mol. Cell. Proteomics 2011, 10, M111.008078. (45) Savitski, M. M.; Lemeer, S.; Boesche, M.; Lang, M.; Mathieson, T.; Bantscheff, M.; Kuster, B. Confident phosphorylation site localization using the Mascot Delta Score. Mol. Cell. Proteomics 2011, 10, M110.003830. (46) Chalkley, R. J.; Clauser, K. R. Modification site localization scoring: Strategies and performance. Mol. Cell. Proteomics 2012, 11, 3− 14. (47) Fermin, D.; Walmsley, S. J.; Gingras, A.-C.; Choi, H.; Nesvizhskii, A. I. LuciPHOr: Algorithm for phosphorylation site localization with false localization rate estimation using modified target-decoy approach. Mol. Cell. Proteomics 2013, 12, 3409−3419. (48) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W. Automated reduction and interpretation of high resolution electrospray mass spectra of large molecules. J. Am. Soc. Mass Spectrom. 2000, 11, 320− 332. (49) Davis, J.; Goadrich, M. The Relationship between PrecisionRecall and ROC Curves. Proceedings of the 23rd International Conference on Machine Learning, 2006; pp 233−240. 4434
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435
Article
Journal of Proteome Research (50) UNIMOD Protein Modifications for Mass Spectrometry. http://www.unimod.org (accessed 2016-03-30). (51) Lynn, A. J.; Chalkley, R. J.; Baker, P. R.; Segal, M. R.; Burlingame, A. L. Protein Prospector and Ways of Calculating Expectation Values. Proceedings of the 54th ASMS Conference on Mass Spectrometry, Seattle, WA, 2006; p 351. (52) Kall, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 2007, 4, 923−925. (53) Choi, H.; Nesvizhskii, A. I. False discovery rates and related statistical concepts in mass spectrometry-based proteomics. J. Proteome Res. 2008, 7, 47−50. (54) Kall, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Posterior error probabilities and false discovery rates: Two sides of the same coin. J. Proteome Res. 2008, 7, 40−44. (55) Keich, U.; Kertesz-Farkas, A.; Noble, W. S. Improved False Discovery Rate Estimation Procedure for Shotgun Proteomics. J. Proteome Res. 2015, 14, 3148−3161. (56) Nesvizhskii, A. I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J. Proteomics 2010, 73, 2092−2123. (57) Elias, J.; Gygi, S. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4, 207−214. (58) Klimek, J.; Eddes, J. S.; Hohmann, L.; Jackson, J.; Peterson, A.; Letarte, S.; Gafken, P. R.; Katz, J. E.; Mallick, P.; Lee, H.; Schmidt, A.; Ossola, R.; Eng, J. K.; Aebersold, R.; Martin, D. B. The standard protein mix database: A diverse data set to assist in the production of improved peptide and protein identification software tools. J. Proteome Res. 2008, 7, 96−103. (59) Eng, J. K.; Hoopmann, M. R.; Jahan, T. A.; Egertson, J. D.; Noble, W. S.; MacCoss, M. J. A Deeper Look into CometImplementation and Features. J. Am. Soc. Mass Spectrom. 2015, 26, 1865−1874. (60) Wang, G.; Wu, W. W.; Zhang, Z.; Masilamani, S.; Shen, R.-F. Decoy methods for assessing false positives and false discovery rates in shotgun proteomics. Anal. Chem. 2009, 81, 146−159. (61) cRAP Protein Sequences. http://www.thegpm.org/crap/ (accessed: 2016-03-22). (62) ProteinProspector. http://prospector.ucsf.edu/prospector/ mshome.htm (accessed: 2016-04-30).
4435
DOI: 10.1021/acs.jproteome.6b00485 J. Proteome Res. 2016, 15, 4423−4435