QuickMod: A Tool for Open Modification Spectrum Library Searches

Apr 18, 2011 - Synopsis. The QuickMod MS2 data analysis tool is designed to identify modified variants of peptides listed in a spectral library. The s...
1 downloads 0 Views 2MB Size
ARTICLE pubs.acs.org/jpr

QuickMod: A Tool for Open Modification Spectrum Library Searches Erik Ahrne,* Frederic Nikitin, Frederique Lisacek, and Markus M€uller* Swiss Institute of Bioinformatics, Proteome Informatics Group, Geneva, Switzerland

bS Supporting Information ABSTRACT: MS2 library spectra are rich in reproducible information about peptide fragmentation patterns compared to theoretical spectra modeled by a sequence search tool. So far, spectrum library searches are mostly applied to detect peptides as they are present in the library. However, they also allow finding modified variants of the library peptides if the search is done with a large precursor mass window and an adapted Spectrum-Spectrum Match (SSM) scoring algorithm. We perform a thorough evaluation on the use of library spectra as opposed to theoretical peptide spectra for the identification of PTMs, analyzing spectra of a well-annotated modification-rich test data set compiled from public data repositories. These initial studies motivate the development of our modification tolerant spectrum library search tool QuickMod, designed to identify modified variants of the peptides listed in the spectrum library without any prior input from the user estimating the modifications present in the sample. We built the search algorithm of QuickMod after carefully testing different SSM similarity scores. The final spectrum scoring scheme uses a support vector machine (SVM) on a selection of scoring features to classify correct and incorrect SSM. After identification of a list of modified peptides at a given False Discovery Rate (FDR), the modifications need to be positioned on the peptide sequence. We present a rapid modification site assignment algorithm and evaluate its positioning accuracy. Finally, we demonstrate that QuickMod performs favorably in terms of speed and identification rate when compared to other software solutions for PTM analysis. KEYWORDS: proteomics, MS2, peptide identification, spectral library search, PTM, spectrum match score

’ INTRODUCTION Proteins undergo post-translational modification (PTM), which modulates their structure and their function and the identification of protein modifications is of paramount importance to understand the regulation and dynamics of a proteome. There is a tremendous interest in a variety of biologically significant PTMs such as phosphorylation, acetylation, glycosylation, and methylation. Mass spectrometry induced peptide fragmentation (MS2) is a central technology for protein characterization1 and numerous targeted PTM studies have shown promising results. For instance, protein phosphorylation, which plays a major role in signaling networks, was extensively mapped in large-scale MS studies.24 Similarly, the role of glycosylation as a functional modulation of secreted or membrane proteins has been investigated using MS2.5 However, exhaustive identification of protein modifications is challenging for a number of reasons including the large number of possible PTMs, substoichiometric amounts of modified proteins, and the fact that peptides carrying certain PTMs display MS2 fragmentation patterns which can be difficult to interpret. Therefore successful protein modification studies rely on carefully designed experimental setups applying extensive sample fractionation/enrichment protocols for the detection of low abundant protein species. MS data need to be accurate and contain information-rich r 2011 American Chemical Society

fragmentation patterns. Furthermore, sensitive MS2 data analysis tools capable of exploring large data volumes for modifications at a reasonable computational time need to be available to the proteomics community. For MS2-based identification of modified peptides with classical sequence search tools such as Sequest,6 Mascot,7 X!Tandem8 or Phenyx,9 known but unpredictable modifications have to be configured as variable modifications; that is, for each potentially modified amino acid, the modified and unmodified forms are accounted for. Considering all potential modifications can lead to longer search times as well as an increased overlap between the Peptide Spectrum Match (PSM) score distribution of incorrect and correct matches. Open Modification Search tools (OMS, also referred to as blind search tools) are designed to mitigate this problem and screen the query spectra for peptide modifications in a more or less unsupervised manner; allowing for all known modifications listed in a database or all possible modification masses up to a user-defined maximum value. Such tools typically deal with the large search space problem by dividing the data analysis into two steps. In a first step the initial database is reduced to a list of proteins or peptides likely to be present in the Received: February 22, 2011 Published: April 18, 2011 2913

dx.doi.org/10.1021/pr200152g | J. Proteome Res. 2011, 10, 2913–2921

Journal of Proteome Research

ARTICLE

Table 1. Theoretical Spectra versus Library Spectra oxidation

phosphorylation

carbamido-methyl

PSM Theoa

PSM SLb

PSM Theo

PSM SL

PSM Theo

b,y þ1

181

262

141

216

263

b,y þ2

142

263

106

247

255

a,b,y þ1

109

267

71

242

a,b,y þ2

44

257

16

b,y þ1

104

276

b,y þ2

78

a,b,y þ1 a,b,y þ2

60 65

PSM SL

acetylation

pyro-glu

total

PSM Theo

PSM SL

PSM Theo

PSM SL

PSM Theo

PSM SL

317

197

240

244

342

1026

1377

332

174

234

231

345

908

1421

185

329

95

250

156

351

616

1439

238

85

328

38

219

70

341

253

1383

74

262

165

292

87

270

119

321

549

1421

253

60

239

205

290

103

232

151

305

597

1319

275 255

34 43

256 240

96 173

295 290

43 83

260 232

79 130

324 302

312 494

1410 1319

Test Data z = 2

Test Data z = 3

a Valid PSMs at FDR 0.05 when searching a database containing full theoretical spectra of each peptide, including the ion types specified in column 1. þ2 means that both singly and doubly charged fragment ions are considered. b Valid identifications at FDR 0.05 when searching library database with the same number of peaks per spectrum. þ2 means that both singly and doubly charged fragment ions are considered.

sample. This can be done in two ways. Either by performing an initial sequence search where no or a very limited number of variable modifications are specified. On the basis of the valid identifications returned from this search a small size protein or peptide database is compiled. Next, this smaller size database is exhaustively screened for modifications. Another database reduction approach is based on spectrum sequence tag extraction of peptide subsequences of 34 amino acids, used to narrow down the list of candidate peptides for each spectrum. The InsPecT platform10,11 has been used to analyze very large CID data sets in OMS mode. Several groups have already combined OMS and spectral libraries.1214 Ahrne et al. first searched spectra against a sequence database and compiled a spectrum library from the confidently identified PSMs. Subsequent OMS of this library using SpectraST15 improved identification rates as additional spectra of both unmodified as well as modified peptides were matched to entries of the spectral library. In this paper, we present QuickMod, a spectrum library search tool especially adapted to identify peptides bearing one modification. First, we show that library spectra of unmodified peptides are better reference spectra than theoretical peptide spectra, when trying to identify query spectra of modified peptides. Second, we investigate how SSMs can be efficiently scored in OMS mode and present a rapid algorithm for positioning a modification on the peptide sequence. Third, we apply QuickMod to a set of spectra from a human plasma sample and compare it to InsPecT and SpectraST in terms of identification rate and speed.

’ METHODS Data Sets

We compiled two test data sets containing ion trap CID spectra of modified peptides (only 1 modification per peptide, all distinct peptides). Two-thousand five-hundred spectra of doubly (Mod_z2) and 2500 spectra of triply (Mod_z3) charged precursor ions were extracted from spectral libraries publicly available at NIST (http://peptide.nist.gov/) and ISB (http://www. peptideatlas.org16). Each test data set included equal numbers of 5 different common protein modifications; oxidation of methionine, phosphorylation of serine, threonine, or tyrosine, carbami-

domethylation of cysteine, n-term acetylation and pyro-glu on n-term glutamic acid or glutamine. We only extracted modified peptide spectra where the nonmodified peptide spectrum of the same charge state was present in the same library. The nonmodified peptide spectra were used to build a spectrum library of doubly charged precursor spectra (SL_z2, 2500 entries), and a second spectrum library of triply charged precursor spectra (SL_z3, 2500 entries). For each peptide in SL_z2 and SL_z3 we compiled databases of theoretical spectra (Theo_z2 and Theo_z3), where each database contained different combinations of ion types (see Table 1). An additional data set (Exp_HP) analyzed in this study consists of 55 640 spectra (precursor charge state 2þ and 3þ) from 15 different OGE fractions of a human blood plasma sample produced on an Orbitrap in ion trap CID fragmentation mode. The Lib_HP spectral library consists of 4168 PSM confidently identified from X!Tandem search results employing a FDR cutoff of 0.01, where Cys-CAM was specified as a fixed modification in the search parameters. QuickMod Scoring Algorithm

Several spectrum library search tools have been developed for targeted analysis of MS2 data sets.1518 Even though the spectrum of a nonmodified peptide and the spectrum of a modified variant of the same peptide often share important similarities, these spectrum library search tools have not been designed to match such spectral pairs. QuickMod is a modification tolerant spectrum library search tool, where the query data is explored for both unmodified and modified variants of the peptide entries listed in the spectrum library. Instead of matching the query spectra against theoretical spectra, created in accordance with some simple peptide fragmentation rules, QuickMod makes use of the experimentally observed fragmentation pattern of the unmodified peptide spectrum when attempting to identify modified variants of this peptide. A number of scoring algorithms have been developed to determine the similarity between an experimental spectrum and a theoretical spectrum. In its simplest form when analyzing CID MS2 data, the theoretical spectrum contains the calculated b- and y-ion fragments and the similarity score is based on the shared peak count between the compared spectra. In contrast, 2914

dx.doi.org/10.1021/pr200152g |J. Proteome Res. 2011, 10, 2913–2921

Journal of Proteome Research many scoring schemes include a multitude of fragment types including a-ions, immonium ions and neutral loss ions and extract several features in addition to the shared peak count such as the ratio of experimental to theoretical b- and y-ions, the length of continuous ion-series, or matching peak intensity. A discriminant score based on the combined measure of these features is then derived. Spectrum library search tools typically score SSMs taking into account the intensity of the matching spectrum peaks. The normalized dot product score (DPS), that is, the scalar product of normalized spectral vectors, is commonly used for this purpose. SpectraST combines it with a delta score measuring the difference between the highest DPS of a spectrum and its runner up, and a term that avoids that the final score is dominated by a few high intensity peaks. The SpectraST score is also made more robust with regards to random matches of high intensity fragments by transforming the raw intensities to their square root values. Bonanza13 and pMatch14 are spectrum library search tools designed for OMS, which assume that the precursor mass difference ΔM can be explained by one modified amino acid. Based on this assumption query and library spectra are matched, considering that a query peak and a library peak of the same iontype may be separated by the mass ΔM. Bonanza scores the match with a DPS. pMatch combines the DPS with an intensity independent binomial score, which calculates the probability that the matching peaks occur by chance. In order to control the false discovery rate (FDR) in the final list of SSMs, SpectraST creates decoy spectra by randomly shuffling the peptide sequence and shifting all annotated peaks in accordance with the shuffled sequence.19 QuickMod data analysis starts off with a spectrum preprocessing step. In this study, only peaks that are the highest within a (5 m/z window around the peak’s m/z value are retained. Then, for doubly (triply) charged peaks only the 40 (60) highest peaks are kept. After this filtering step, the intensities of the peaks are divided by the maximal intensity in the spectrum. All the preprocessing parameters are user defined and can be extended to different charge states. QuickMod matches each query spectrum against each library spectrum with a precursor mass difference ΔM within a user defined range (usually (200 Da). The matching algorithm accounts for the peak annotation of each library peak. For a peak with several possible annotations only the first one is considered where the ordering is provided by the spectrum library annotation. A library peak of m/z value mL matches a query peak of m/z value mQ if | mL  mQ |< δ or | mL  mQ  ΔM/ZF | < δ, where δ is the fragment m/z tolerance and ZF is the charge of the library peak. If a peak in the library spectrum is not annotated all shifts corresponding to all possible charge states ZF, which are compatible with m/z value mF of the fragment ion (i.e., { ZF | ZF ∈ {1:ZP}, mF < MP/ZF }, where ZP and MP are precursor charge and neutral mass), are taken into account. In cases where a library peak matches several query peaks, the query peak with the closest normalized intensity is matched. The SSM scoring algorithm of QuickMod was developed after careful evaluation of a list of features, which reflect various properties of the similarity between an experimental spectrum and a library spectrum, such as the intensity of the matched peaks, the number of matched peaks and the peptide sequence coverage. To limit the computational time required to evaluate the match between a query spectrum and a library spectrum we wanted to reduce the list of scoring features included in the

ARTICLE

QuickMod scoring scheme to those that are expected to significantly contribute to more identified peptides. This feature selection was carried out using the annotated modification rich test data sets described in the data set section. The features were ranked by their partial area under the ROC curve (AUC). The n highest ranked features were combined in a support vector machine (SVM),20 for n going from 1 to the number of features. For every number n of features the SVM was evaluated, and the smallest value of n with a performance comparable to best performance was used for feature selection (see below). Next we describe all similarity features taken into account. First we calculate the DPS between the experimental spectrum and the aligned library spectrum using two types of fragment ion peak intensity transformations; a simple square root transformation (dpSqrt) and the rank transformation (dpRank) where the intensity of each peak is replaced by the intensity rank ordered from lowest to highest intensity. For both square root and rank transformations the transformed peak intensities are divided by the transformed intensity of the most intense peak. Next we calculate the DPS of the query and library spectra neglecting shifted peaks (dpSqrtNoAlign and dpRankNoAlign). Then, the hyperProb feature neglects peak intensities and reflects the hypergeometric probability that the fragment peak matches between two spectra occur by chance. The hyperProbNoAlign feature is similar, but without allowing modification shifts in fragment m/z values. Finally, we add the rate of matched b- and y-ions (byIonCoverage) as well as a score giving the difference of the best and worst modification site score (PosScore). As mentioned above, spectrum library search tools give important weight to the peak intensities of matching and nonmatching peaks when scoring SSMs. However, the presence of a PTM on the peptide may affect its fragmentation propensities changing the spectrum intensity envelope of the modified compared to nonmodified peptide spectra. With this in mind, we studied the difference in intensity distribution between modified and nonmodified peptide spectra of our five distinct modification types included in our test data set (Mod_z2) and evaluated the discriminatory power of the final QuickMod scoring scheme, for each modification type. Modification Positioning

QuickMod uses a simple and fast algorithm to position a single modification on the peptide sequence. In an OMS all peptide residues are considered as possible modification anchoring positions. Sequence ions of all charge states matched between a query spectrum and a library spectrum (matchedAnnotatedIonList) are used as evidence to determine the optimal modification site. We exclude sequence ions with neutral losses or higher isotopes from matchedAnnotatedIonList, since their annotation is often erroneous. Iterating over all ions in matchedAnnotatedIonList the algorithm builds a modification site evidence histogram, where each histogram bin corresponds to the position in the peptide sequence of a peptide residue. If a fragment ion in matchedAnnotatedIonList matches a query peak when shifted with ΔM/ZF m/z units (see definition in previous section) this fragment ion is assumed to carry the modification and all histogram bins corresponding to residues covered by this fragment ion are incremented by 1. For example, if a b3 ion matches a query peak when shifted with ΔM/ZF m/z units the counts of histogram bins 1 through 3 are incremented. Similarly, if a y3 ion is matched when shifted, the counts of histogram bins n-3 through n are incremented, where n corresponds to the peptide 2915

dx.doi.org/10.1021/pr200152g |J. Proteome Res. 2011, 10, 2913–2921

Journal of Proteome Research Scheme 1. Algorithm 1: Modification Positioning

ARTICLE

and can be downloaded from http://javaprotlib.sourceforge.net/ packages/tools/quickmod/index.html and an online beta version is available at http://www.expasy.org/cgi-bin/quickmod/ form.cgi. The test data sets are available upon request.

’ RESULTS Identifying Peptide Modifications in a Database of Theoretical Spectra and a Spectrum Library

length. Whenever a nonshifted library peak, listed in matchedAnnotatedIonList, matches a query peak all histogram bins covered by its annotation are incremented by 1 (more details can be found in Scheme 1). The final count at bin i reflects the evidence that a modification sits at position i. Often, there are several sites that give a similar count and it is difficult to predict the right position, for example when crucial fragments are missing, when there is more than one modification or when the spectrum contains a mixture of different modification positions. In any case, we export the evidence histogram in the result file. We also provide the possibility to include knowledge from the UNIMOD database (www.unimod.org21). All modifications matching the precursor mass shift are extracted and the modification site evidence histogram is reduced to those amino acids, which are compatible with the UNIMOD annotation. QuickMod Workflow

We have integrated the QuickMod search tool in an identification workflow where a spectrum library is built from the identification results from an initial sequence search with one or multiple sequence search tools.12 Our in-house developed library creation tool is compatible with pepxml search result files accompanied by the corresponding peak-list file available in .mgf . mzxml or .mzml format and build a spectrum library of consensus spectra from all PSMs meeting a user-defined similarity score or FDR criterion. To allow for an accurate estimation of the FDR associated with output results of the subsequent spectral library search a decoy version of the spectral library is created using the DeLiberator software (Ahrne et al., submitted). DeLiberator works similarly to the SpectraST decoy library creation approach, but differs in the handling of nonannotated peaks and shuffling of peptide sequences. Next the fraction of the experimental data that the sequence search tool failed to identify is extracted. Under the assumption that part of these spectra can be explained by substoichiometric peptide modifications not considered in the sequence search, these spectra are reanalyzed in an open modification spectral library search using QuickMod. Note that QuickMod can also be used to screen experimental data set against publicly available spectrum libraries available in the .sptxt or .msp format. Availability

QuickMod was developed in Java based on an in-house java proteomics class library. It is available under open source license

Before we started the development of QuickMod we investigated the validity of a modification tolerant spectrum library search approach by screening two modification rich test data sets (Mod_z2 and Mod_z3) against two types of reference spectrum databases of unmodified peptides; a spectrum library (SL_z2 and SL_z3) and multiple variants of the theoretical spectrum databases (Theo_z2 and Theo_z3, different peptide fragmentation models were used to generate the theoretical spectrum databases, see Table 1). To make a fair comparison, library spectra and theoretical spectra of a given peptide contained the same number of peaks. The library spectra where filtered to include only the N most intense peaks, where N corresponds to the number of peaks in the corresponding theoretical spectrum. All searches were performed in OMS mode where the maximal precursor mass difference between a query and database spectrum was set to 200 Da. A simple shared peak count based scoring scheme, calculating the hyper-geometrical probability of matching a given number of peaks by chance,22 was applied to evaluate the match between a query and a library spectrum as well as a query and theoretical spectrum. Note that the intensities of the matching peaks were not considered in the similarity score. Each database was complemented with a decoy database (peptide sequences were reversed while leaving the c-term amino acid in place) to estimate the FDR associated with each match score listed in the results output. Table 1 shows the number of valid identifications per modification type (FDR cutoff 0.05), when screening different theoretical databases (applying different peptide fragmentation criteria) and spectral libraries. For any variant of the theoretical database substantially more modified peptides were identified when screening a spectrum library containing the same number of peaks. That is, the selection of peaks contained in a library spectrum allows for more discriminative SSM scoring of modified peptide spectra than all possible sequence ions of a given ion type included in the theoretical spectra. Note that the difference between the results obtained when screening the query data against the two types of reference spectra is especially large when analyzing triply charged spectra. We consequently draw the conclusion that it is meaningful to make use of the fragmentation pattern of the unmodified peptide when trying to identify modified variants of this peptide. QuickMod Scoring, Feature Selection, Training and Testing

Table 2 displays the performance of the initial list of scoring features considered when searching the modified spectra in Mod_z2 and Mod_z3 against their corresponding spectral libraries (SL_z2 and SL_z3). The partial AUC, which is the sensitivity of a feature integrated over a specificity range of 0.81.0, and the number of identifications at a FDR of 0.05 reflect the individual discriminatory power of each score. After the scoring features were ordered by decreasing partial AUC we evaluated the discriminatory power of a SVM with a linear kernel while iteratively including one more scoring feature at a time. The scoring model was trained on a random selection of the spectra 2916

dx.doi.org/10.1021/pr200152g |J. Proteome Res. 2011, 10, 2913–2921

Journal of Proteome Research

ARTICLE

Table 2. Scoring Features partial AUCa test data

test data

test data

test data

z = 2þ

z = 3þ

z = 2þ

z = 3þ

dpSqrt

0.85

0.89

784

1093

dpSqrtNoAlign

0.81

0.85

179

238

dpRank

0.92

0.95

1645

1901

dpRankNoAlign

0.86

0.87

1067

1021

hyperProb

0.9

0.91

1388

1429

hyperProbNoAlign

0.89

0.9

1233

1501

byIonCoverage PosScore

0.96 0.85

0.93 0.77

1565 344

1138 277

score feature

a

number of SSMb

Specificity interval 0.81.0. b At FDR 0.05.

containing half the data and tested on the other half and this process was repeated 10 times. Figure 1 shows that for both the triply and doubly charged data sets combining more than three features (the same three features for both doubly and triply charged SSMs) does not lead to a significant increase in the number of identified spectra. For three features the number of confidently identified spectra falls within one standard deviation of the number of identified spectra when combining 7 or more features. Further, we compared the SVM with a linear kernel to the radial and polynomial kernels. Supplemental Figure S1 (Supporting Information) shows that all three kernel methods give similar results. Therefore, the final scoring algorithm of QuickMod is based on the linear combination of the byIonCoverage, dpRank, and hyperProb scores. As for the identification of nonmodified peptides QuickMod employs the same feature selection with a scoring function trained on nonmodified peptide SSMs. The QuickMod software tool includes the option to retrain the SVM weights. This can be done on annotated reference data sets as well as on not yet identified experimental data. In the latter case decoy library spectra serve as negative examples, while the highest scoring target library spectra form the positive examples. Evaluating the QuickMod Scoring Scheme for Each Modification Type

As described earlier, spectral library search tools typically employ a square root transformation of peak intensities and SSMs are evaluated using a normalized dot-product score. However our analysis of the discriminatory power of various scoring features (see Table 2) and the outcome of our feature selection suggest that similarity measures giving little weight to peak intensities are better suited when evaluating the SSM of a nonmodified and modified peptide spectral pair. As an additional evaluation, we investigated how the intensity distribution of nonmodified and modified spectra of the same peptide differ for the five modification types included in our test data set. Given these results, we studied how well the QuickMod scoring classifies incorrect and correct SSMs, for the different modification types. We split the test data set of doubly charged spectra (Mod_z2) in two batches. One batch (1250 spectra, 250 spectra of each modification type) was used to train the QuickMod scoring algorithm and the other batch was used for testing. Next, the 250 spectra of each modification type selected for testing were separately screened against the spectral library Lib_z2, allowing

Figure 1. Number of identified modified peptide spectra (Mod_z2/3 searched against SL_z2/3) at a FDR of 0.05 is displayed as a function of the feature set. The feature just below the box plot and all features to its left are included in the linear SVM classifier. (A) Analysis of spectra with precursor charge 2þ. (B) Analysis of spectra with precursor charge 3þ.

for a modification mass tolerance of 200 Da (to simulate an OMS). Three searches were performed per modification, two of which employed a dot-product scoring scheme applied on raw (dpRaw) and rank transformed (dpRank) intensities, respectively, and the third search used the QuickMod scoring. The discriminatory power of a dpRaw score where no transformation of peak intensities is performed is highly dependent on the similarity in peak intensities of the matching peaks between spectra. On the contrary, an intensity rank transformation strongly deemphasizes the importance of intense spectral features. Thus, the discriminatory power of the second type search is less dependent on the similarity in original peak intensity of matching spectral peaks. The 20 highest scoring library candidates for each spectrum were selected and ROC curves were calculated reflecting the discriminatory power in the results output of each search. Comparing the dpRaw and dpRank ROC curves suggests that relying on a high intensity similarity of matching peaks generally leads to poor separation between scores assigned to correct and incorrect SSMs. Notably, the analysis of the phosphorylation 2917

dx.doi.org/10.1021/pr200152g |J. Proteome Res. 2011, 10, 2913–2921

Journal of Proteome Research

Figure 2. ROC curves represent the discriminatory power of three SSM similarity scores (dpRaw, red; dpRank, blue; and the QuickMod score, black), when separately analyzing the spectral data of each modification type included in the test data set Mod_z2 (carbamidomethylation; “longdash”; acetylation, “dotdash”; oxidation, “solid”; pyro-glu, “shortdash”; phosphorylation, “dot”). DpRaw represents a dot-product score with no transformation of the original peak intensities. DpRank represents a dot-product score where peak intensities are rank transformed. QuickMod score corresponds to the default QuickMod score for CID data.

data with no peak intensity transformation has particularly low discriminatory power (see Figure 2). This is not surprising since phosphorylation is a labile modification. Labile modifications such as O-glycosylation, sulphation, and phosphorylation often produce spectra that are tricky to interpret. For these modifications, we find a mixture of both modified and unmodified neutral loss fragment ions, an altered intensity pattern, as well as intense signals corresponding to neutral loss precursor peaks. CID spectra of phosphorylated peptides are commonly dominated by an intense peak corresponding to the loss of the phosphoric acid from the precursor ion. Consequently a SSM similarity score giving little weight to peak intensities is better suited for these modification types. The QuickMod scoring scheme combining the dpRank score with two intensity independent scoring features (byIonCoverage, hyperProb) leads to better sensitivity at any given specificity (see Figure 2) for all modification types. Still, the QuickMod scoring scheme proves to be slightly less discriminant for phosphorylation compared to other modifications. A principal component analysis (PCA) further illustrates how different modifications display somewhat distinct score characteristics, that is, the feature vectors of 3 selected modifications (carbamidomethylation, phosphorylation and pyro-glu) cluster in different regions of the space of QuickMod scoring features (see Figure 3). However, the PCA also shows that feature scores assigned to phophorylated peptide SSMs generally display less separation from the distribution of incorrect feature scores compared to SSMs of carbamidomethylated and pyro-glu peptides. Modification Site Assignment

The results listed in Table 1 show the number of so-called delta correct peptide identifications, meaning that the FDR associated with a given number of spectrum match reflects identification accuracy of a peptide with the assigned modification mass. It does not reflect the modification site accuracy. For

ARTICLE

Figure 3. Test data set Mod_z2 was searched against SL_z2. The plot displays the results of a Principal Component Analysis (PCA) of the three feature vectors included in the QuickMod scoring scheme (byIonCoverage, dpRank, and hyperProb), grouped by peptide identification type (carabmidometyhlation, red diamonds; phosphorylation, black solid triangles; pyro-glu, blue solid circles; and incorrect SSMs, gray triangles).

modification site assignment, QuickMod uses Algorithm 1 shown in Scheme 1. We evaluated the performance of the modification site assignment algorithm matching the query data sets (Mod_z2 and Mod_z3) against a database of theoretical spectra (Theo_z2/3) and a spectrum library (SL_z2/3). In each analysis, a modified query spectrum was only evaluated against the unmodified reference spectrum of the same peptide sequence. Doubly charged precursor theoretical spectra in Theo_z2/3 included all singly charged b- and y-ions. Triply charged spectra in Theo_z2/3 included all singly and doubly charged b- and y-ions. Library spectra of doubly (triply) charged precursor ions in SL_z2/3 were filtered to include the 40 (60) most intense peaks. The accuracy of the modification site assignments are shown in Table 3. The first two columns of Table 3 display the percentage of correct site assignments when all residues of the peptide sequence were considered as possible modification sites. Both library and full theoretical spectra give equivalent modification site accuracy when analyzing the query data set of doubly charged precursor spectra. In the case of triply charged precursor spectra the modification site accuracy drops substantially. Here, a slightly higher fraction of the peptide modifications are accurately positioned when running the positioning algorithm on library spectra. Even though the sequence ions of library spectra are not expected to cover the full peptide sequence, the additional peaks present in the full theoretical spectra of Theo_z2/3 do not improve the global modification positioning accuracy. The third column of Table 3 shows the modification site accuracy when typical sequence search tool restraints were imposed on the possible modification sites of each modification type: acetylation and pyro-glu mass shifts are always N-terminal, oxidation mass shifts on methionine, carbamidometyhl mass shifts on cysteine, and phosphorylation on serine, tyrosine or threonine. Obviously, these rules will position the modifications correctly if only one candidate site is available, and they lead to a 2918

dx.doi.org/10.1021/pr200152g |J. Proteome Res. 2011, 10, 2913–2921

Journal of Proteome Research

ARTICLE

Table 3. Modification Site Positioning Accuracy exact position agreement Theo

a

position agreement (2 residues

SL

Theo

position agreement (restricted possibilities)a

SL

Theo

SL

Test Data z = 2þ

75%

74%

87%

86%

94%

97%

Test Data z = 3þ

55%

58%

67%

72%

92%

92%

A restricted number of positioning options are considered for each peptide, for example, oxidation can only by positioned on methionine.

dramatic improvement of the modification site accuracy. Generally one can expect a reasonably high accuracy of the modification site prediction for confident SSMs when screening a data set of tryptic peptide spectra allowing for a limited list of highly site specific modifications. However it is important to keep in mind that MS2 spectra alone often do not bear enough information and only the proper reduction of the search space can come to rescue. An extended version of Table 3 displaying the modification site accuracy per modification type is available in the Supporting Information (Table S1). Benchmarking

The spectrum-spectrum matching and scoring algorithm of QuickMod has been optimized to identify modified variants of the spectral library entries. In order to evaluate the relevance of this effort we compared the performance of QuickMod to a standard spectrum library search tool SpectraST, which was not developed for OMS. The search parameters of SpectraST where set to allow for the matching of library and query spectra with a large precursor mass difference. We also compared QuickMod to InsPecT combined with the PTMFinder11 postprocessing tool, a popular sequence search based open modification data analysis pipeline. The three tools were benchmarked analyzing the Exp_HP data set including approximately 56 000 spectra. In the QuickMod and SpectraST analysis the query data set was screened against the spectral library Lib_HP concatenated to a decoy spectral library of the same size. InsPecT was searched in blind mode against a peptide sequence database including the same peptide entries as the spectral library (Cys-Cam was specified as a fixed modification). The search parameters of all search tools were configured to allow for the matching of PTMs with a maximum mass of 200 Da (see Supporting Information for more details). The QuickMod search was based on the scoring model trained on the test data described in the data set section. Figure 4a displays the number of modified peptide spectra, identified per modification mass bin (bins of 1 Da) for all three search tools (SSMs with a delta mass lower than 20 Da were discarded as modifications corresponding to the loss of c- or n-term amino acids are reported differently by the three tools). For QuickMod and SpectraST a FDR cutoff of 0.05 was employed and the InsPecT-PTMFinder results were filtered for identifications with a p-value