MUDE: A New Approach for Optimizing Sensitivity in the Target-Decoy Search Strategy for Large-Scale Peptide/Protein Identification Fabio R. Cerqueira,*,†,‡ Armin Graber,§ Benno Schwikowski,| and Christian Baumgartner*,† Institute of Electrical, Electronic and Bioengineering (iebe), University for Health Sciences, Medical Informatics and Technology (UMIT), Hall in Tirol, Austria, Department of Informatics (DPI), Federal University of Vic¸osa (UFV), Minas Gerais, Brazil, Institute for Bioinformatics (bioinf), UMIT, Hall in Tirol, Austria, and Institut Pasteur, Systems Biology Lab (Sysbio), Department of Genomes and Genetics/CNRS, URA 2171, F-75015 Paris, France Received November 10, 2009
The target-decoy search strategy has been successfully applied in shotgun proteomics for validating peptide and protein identifications. If, on one hand, this method has proven to be very efficient for error estimation, on the other hand, little attention has been paid to the resulting sensitivity. Only two scores are normally used and thresholds are explored in a very simplistic way. In this work, a multivariate decoy analysis is described, where many quality parameters are considered. This analysis is treated in our approach as an optimization problem for sensitivity maximization. Furthermore, an efficient heuristic is proposed to solve this problem. Experiments comparing our method, termed MUDE (multivariate decoy database analysis), with traditional bivariate decoy analysis and with Peptide/ProteinProphet showed that our procedure significantly enhances the retrieved number of identifications when comparing the same false discovery rates. Particularly for phosphopeptide/protein identifications, we could demonstrate more than a two-fold increase in sensitivity compared with the Trans-Proteomic Pipeline tools. Keywords: tandem mass spectrometry • shotgun proteomics • peptide/protein identification • targetdecoy search strategy • phosphoproteomics • data mining
Introduction Liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has become the de facto method for protein identification in complex mixtures. SEQUEST1 and MASCOT2 are currently the most widely used computational tools for assigning peptide sequence to MS/MS spectra generated by this approach. Both tools take a protein sequence database (DB) to generate putative solutions. The protein sequences are virtually digested using properties of the protease(s) used in sample preparation. As a result, given a MS/ MS spectrum, the best peptide candidates to be assigned are those having theoretical mass close (a tolerance range) to the precursor mass reported by the mass spectrometer. The theoretical fragment ions of the obtained candidates are then compared to the observed peaks. The best peptide-spectrum matches according to some score are thus reported as most likely results. The huge number of MS/MS spectra generated in a single run of the LC-MS/MS approach, however, makes the assess* To whom correspondence should be addressed. E-mail: fabio.cerqueira@ ufv.br;
[email protected]. Phone: +55 31 38991776; +43 50 8648-3827. Fax: +55 31 38992394; +43 50 8648-67 3827. † iebe, UMIT. ‡ DPI, UFV. § bioinf, UMIT. | Sysbio, Institut Pasteur. 10.1021/pr901023v
2010 American Chemical Society
ment of results a hard task. The amount of false positives is too high in general, which means an expressive difficulty in mining correct spectrum assignments. In former procedures to filter out wrong hits, the application of fixed cutoff values were commonly reported in the literature.3 This simple method was rapidly considered inappropriate as distinct data sets might present completely different score distributions. Therefore, a certain cutoff applied to a given data set is not necessarily suitable for different data generated under distinct conditions (sample preparation, mass spectrometer, protein nature, etc.). When coping with phosphoproteins, for instance, scores are typically lower than usual due to the fact that the precursor fragmentation process in mass spectrometry using low energy dissociation presents a bias toward phosphate groups. Consequently, important fragment ions are frequently suppressed.4-6 A more sophisticated technique called PeptideProphet was proposed by Keller et al.7 in which standard statistical distributions are used to fit the data. Mixture models found by the expectation maximization algorithm represent the underlying score distributions of false and true positives, so that a probability can be associated to each spectrum interpretation. As long as the data follows the assumed distribution, PeptideProphet is proven to be very efficient. The calculated probabilities are very precise, being employed to predict the number of correct and incorrect assignments.7 On the other hand, the method may fail for data presenting very different Journal of Proteome Research 2010, 9, 2265–2277 2265 Published on Web 03/03/2010
research articles characteristics. When applying SEQUEST on phosphodata, for example, in addition to low Xcorrs, ∆Cn values (the relative difference between the top and the second hits) are frequently null due to the fact that the produced hits for a spectrum present often the same sequence, differing only in the phosphorylation sites.5,8 Note that Xcorr and ∆Cn are key scores in the PeptideProphet approach. As a result, the target-decoy search strategy for false discovery rate (FDR) estimation introduced by Peng et al.9 has been broadly applied in protein identification pipelines for its simplicity and impartiality, specially on phosphodata.5,8,10,11 In this technique, fake (decoy) protein sequences are also considered in the search. Assuming that a wrong hit has the same probability of coming from a normal (target) or a decoy sequence, the number of decoy hits is taken as an estimate to the number of wrong assignments among normal hits. This strategy has been also applied for constructing training sets (TS) in studies that use machine learning techniques in shotgun proteomics.10 Note that the TS construction is a fundamental step in learning procedures. In shotgun proteomics, it is particularly critical as distinct data sets can present completely different characteristics, that is, the TS has to be formed ideally in a dynamic manner (a specific TS for each specific data).6,10,12 However, in all works cited here using decoy approaches, only one or two scores are used to find thresholds leading to FDRs less than 1-2%. Furthermore, thresholds are pursued in a very simplistic way. Consequently, even finding score values producing the desired precision, the resulting sensitivity is expected to be very poor. Considering the increasing importance of the target-decoy search strategy, noticed in many relevant works, we call attention to the necessity of further improving this method. We here introduce a new method termed MUDE (multivariate decoy database analysis), which aims at optimizing the sensitivity or identification success rate of target-decoy search approaches. For a given target FDR, our method searches for the combination of thresholds producing the highest number of peptide identifications, and thus an advanced proteome coverage. In this paper, we used SEQUEST in our experiments. The identification success rate could be significantly increased because: First, we consider many more quality parameters than usual (normally two), namely, Xcorr, ∆Cn, ∆M, SpRank, percentage of ions found, and deviation between predicted and observed retention times. Hence, we perform a multivariate decoy DB analysis. Note that retention time (RT) is neglected in many approaches, but this information has proven to significantly enhance the discriminatory power in peptide identification assessment.13 Second, the problem of finding min/max score values yielding the desired FDR is treated here as an optimization problem, which is solved using a modified hill-climbing approach.14 It makes possible a more efficient exploitation of the threshold search space in contrast with methods that simply fix a value for one parameter, while varying the value of another score in an unidirectional manner as previously described.3,5 In the course of the paper, we refer to this usual approach as BIDE (bivariate decoy database analysis). On peptide level, we compared our approach with BIDE and with PeptideProphet. Considering FDR ) 0.01, our method revealed an increase in sensitivity of 23 and 55% for nonphosphodata of +2 charged (CH2) precursor ions, when compared with BIDE and PeptideProphet, respectively. For spectra of +3 (CH3) charged precursor ions, the enhancements were approximately 33 and 56%, respectively. Performing the 2266
Journal of Proteome Research • Vol. 9, No. 5, 2010
Cerqueira et al. same analysis in phosphodata, the improvements relative to PeptideProphet were even more substantial. While our method demonstrated an augmentation of approximately 17 and 55% in sensitivity compared with BIDE for CH2 and CH3 data, respectively, the advances relative to PeptideProphet were about 163 and 157% for CH2 and CH3, respectively. Comparisons with ProteinProphet15 were also performed (protein level) and similar enhancements could be observed.
Material and Methods MS/MS Data. To verify the performance of our method on phosphodata, we used three data sets corresponding to phospho-enriched samples of three independent experiments using cytosol of Mek1-/- MEFs described in Cerqueira et al.6 In this work, after sample preparation, the resulting peptides were separated by reversed-phase high-performance liquid chromatography (RP-HPLC), which was coupled to a LTQ FT mass spectrometer (hybrid linear ion trap/Fourier transform ion cyclotron resonance (Thermo Electron, Bremen)) using a multistage activation method.16 The three produced data sets were converted to dta files (SEQUEST text-file format to represent MS/MS spectra), comprising 24405 (S1), 23668 (S2), and 18996 (S3) spectra, respectively. SEQUEST (Bioworks v 3.3, Thermo Electron) was run on these three data sets for spectrum sequence assignment. As the phospho-enrichment process is not very efficient, each data set was split into two parts. One part is composed by spectra whose top hit is reported as a phosphopeptide, while the other part is made up of spectra whose top hit indicates a nonphosphopeptide. In this way, we could analyze phosphodata and nonphosphodata separately. Each obtained part was further split according to the precursor charge state. This is also necessary, since the score distributions differ significantly among identifications of distinct charge states.7,17 Here, we tested only for charges +2 and +3. Therefore, instead of three, twelve datasetswereavailableforanalysis.WedefinethemasS1_PH_CH2, S1_PH_CH3 S1_NPH_CH2, S1_NPH_CH3, S2_PH_CH2, S2_ PH_CH3, S2_NPH_CH2, S2_NPH_CH3, S3_PH_CH2, S3_PH_CH3, S3_NPH_CH2, and S3_NPH_CH3, where “PH” and “NPH” indicate phosphodata and nonphosphodata, respectively, while “CH2” and “CH3” denote +2 and +3 charge states, respectively. After verifying the results for S3_NPH_CH3 by a decoy DB analysis and Trans-Proteomic Pipeline v 4.2 (tool containing PeptideProphet and ProteinProphet),18 this set was eliminated from the experiments as it presents fewer than 10 correct spectrum assignments. We refer to these 11 data sets as DATA1. Three additional data sets described in the work of Pfeifer et al.,13 whose constituent proteins are known a priori, were used for running a second experiment series. In their work, peptides were separated by RP-HPLC as well. Online ESI-MS detection was performed with a quadrupole ion-trap mass spectrometer (Model esquire HCT, Bruker Daltonics, Bremen, Germany). These data sets are defined as DATA2. The mixtures were composed by the following proteins: • Mixture 1: β-casein (bovine milk), conalbumin (chicken egg white), myelin basic protein (bovine), hemoglobin (human, divided in subunits alpha and beta in the DB), leptin (human), creatine phosphokinase (rabbit muscle), R1-acid-glycoprotein (human plasma, appearing in two distinct versions in the DB), and albumin (bovine serum). • Mixture 2: cytochrome C (bovine heart), β-lactoglobulin A (bovine), carbonic anhydrase (bovine erythrocytes), catalase (bovine liver), myoglobin (horse heart), lysozyme (chicken egg
research articles
MUDE: A New Approach for Optimizing Sensitivity white), ribonuclease A (bovine pancreas), transferrin (bovine), R-lactalbumin (bovine), and albumin (bovine serum). • Mixture 3: thyroglobulin (bovine thyroid), and albumin (bovine serum). To use RT information for peptide identification assessment in our method, the out files (produced by SEQUEST) of each data set were converted into a unique IdXML (v 1.1) file, which is the format used by the algorithm (OpenMS v 1.4) for retention time prediction.13 Database Search Details. The databases used in the searches are a composition of target protein sequences taken from the DBs described below appended to the same sequences after reversion (decoy proteins) as recommended by Elias et al.19 For the searches on DATA1, we used the mouse IPI database (v 3.18).20 The search parameters were set the same for all runs. Enzyme: trypsin; missed cleavages: up to 2; fixed modifications: carbamidomethyl (C), methyl (C-term), Methyl (DE); variable modifications: oxidation (M), phosphorylation (ST), phosphorylation (Y); protein mass: unrestricted; mass values: monoisotopic; peptide mass tolerance: (10 ppm; fragment mass tolerance: (0.6 Da. For the searches on DATA2, the sequences of the proteins present in the mixtures listed in the previous section were appended to the whole bacteria database delivered from swissprot (v 51.6). In this way, counting true and false positives (hits pertaining or not pertaining to the expected proteins) becomes trivial as homologies are minimized. The search parameters were as follows. Enzyme: trypsin, missed cleavages: up to 1, fixed modifications: Carboxymethyl (C), protein mass: unrestricted, mass values: monoisotopic, peptide mass tolerance: (1.3 Da, fragment mass tolerance: (0.3 Da. Implementation Details. The algorithms developed in this work were implemented in Java (JDK 1. 6.0) using a standard Windows XP platform. Considering the cross-platform characteristics of Java and that the resulting programs present a command-line interface, our codes can be easily deployed to other operating systems. The software is open-source and is available under the URL http://sourceforge.net/projects/mude/. The experiments described in section “Results and Discussion” were performed using a stand-alone PC: Intel CORE2 DUO CPU 2.40 GHz, 4GB RAM, under Windows XP. Decoy DB Analysis. Decoy DBs are broadly applied in shotgun proteomics for FDR estimation. A clear strength of this approach is the fact that no assumption on data is needed. This is a major reason why FDR estimation by decoy DB analysis is the method of choice in phosphoproteomics.5,8,10 In this technique, given the target protein sequences, decoy sequences are produced preserving the general composition of the targets. A common way to make it possible is reversing the real sequences. Other methods reshuffle the amino acids of the original sequences.21,22 Next, one can use real and decoy sequences in an independent way. In this case, the search is performed twice, one run for each DB separately, using the same parameters. Assuming that a wrong hit has the same probability of coming from target or decoy databases, the FDR in the search using the real DB is estimated by: ˆ) FDR
DT HT
(1)
where DT is the number of decoy hits filtered through a set of thresholds T, and HT is the number of target hits obtained also by T.
Another possibility is to create a composite target-decoy DB (our case) by appending the decoy sequences to the real ones. In this case, only one search is performed and the FDR is estimated by:
ˆ) FDR
2DT NT
(2)
where DT is the number of decoy hits, and NT is the total number of peptide identifications (decoys and targets) using thresholds in T. Since decoy hits are obviously wrong, we prefer to disregard them, that is, only normal hits are considered in the FDR estimation:
ˆ) FDR
DT N T - DT
(3)
As already mentioned, decoy DB methods have been widely applied to find score thresholds leading to a given FDR, particularly in the case of phosphodata to which score distributions are odd. However, to the best of our knowledge, this method has been used without any concern to maximize sensitivity, where sensitivity here means the proportion of true identifications captured by the chosen thresholds. Only one or two quality parameters are normally explored. Moreover, once the user finds thresholds producing the desired FDR, no other score combination that might provide a higher number of identifications is verified. As a result, we suggest the use of additional parameters in the analysis as well as a new procedure to explore them. Additional Quality Parameters in the Decoy DB Analysis. Reviewing previous works, we could see that one or two parameters are usually exploited to achieve a number of normal ˆ e ε, where ε is a and decoy hits corresponding to FDR predefined maximum FDR. We first propose the inclusion of further parameters to the decoy DB analysis. Figure 1 shows an example to illustrate the benefit of adding other parameters. The example lists Xcorr and ∆Cn (the most significant SEQUEST scores) values of twelve peptide hits, where a composite targetdecoy DB was used for the search. The list is sorted by Xcorr in descending order. If one tries to find a Xcorr threshold ˆ ) 0, the only possibility is to set Xcorr g 3.3, leading to FDR resulting in only three identifications. If, on the other hand, Xcorr and ∆Cn are combined, it is possible to find thresholds ˆ ) 0, leading to five for both scores also providing FDR identifications. This result is possible for Xcorr g 2.1 and ∆Cn g 0.03 as well as for Xcorr g 1.0 and ∆Cn g 0.15. Thus, the incorporation of another quality parameter made possible an increase in sensitivity for the same FDR. The application of Xcorr and ∆Cn in the decoy DB analysis is very common in the literature.5,23 However, several papers have already demonstrated that other quality parameters, notably, ∆M, SpRank, and percentage of ions found, are worthwhile to explore for assessing peptide identifications found by SEQUEST.7,8,17 Retention time is also very useful in the process of curation as demonstrated by Pfeifer et al.13 In this work, a new kernel function for support vector regression (SVR) is proposed. Giving the amino acid sequence of a reported hit, the SVR model is used to predict the peptide RT. Next, it is compared to the observed RT (given in the HPLC Journal of Proteome Research • Vol. 9, No. 5, 2010 2267
research articles
Cerqueira et al.
ˆ ) 0, Figure 1. Example showing how the inclusion of additional parameters can improve sensitivity in a decoy DB approach. For FDR only three identifications can be obtained when using Xcorr thresholds. The inclusion of ∆Cn enables threshold combinations leading to five identifications for the same error rate. (a) Textual representation. (b) Graphical representation (∆Cn vs Xcorr), where crosses represent decoy hits and circles denote normal hits.
process) and a p-value is produced by calculating the probability that a correct identification has a higher deviation between observed and predicted retention times. The same authors have recently reported an extension of this method to peptides undergoing post-translational modifications.24 The RT p-value is demonstrated to be a very powerful score to eliminate false positives. However, the RT information is frequently neglected in significant contributions to petide/ protein identification assessment.5,8,17,22,23 Particularly on methods using decoy DB approaches, we have not yet found studies ˆ calculation. including RT deviation in the FDR We introduce here MUDE, a multivariate decoy DB analysis strategy, rather than the traditional uni- or bivariate procedure. Six important quality parameters are proposed in our approach: Xcorr, ∆Cn, ∆M, SpRank, percentage of ions found, and RT p-value according to the method described in Pfeifer et al.13 As a result, multiple threshold scenarios can now be evaluated to achieve higher sensitivities (Figure 1). Decoy DB Analysis Viewed As an Optimization Problem. When performing BIDE, a common approach is the consultation of the literature to define a value in which one threshold 2268
Journal of Proteome Research • Vol. 9, No. 5, 2010
can be fixed and then the second parameter is varied. As an example, when using Xcorr and ∆Cn, the latter is frequently fixed as follows: ∆Cn g 0.10, while Xcorr starts with a small value and it is gradually increased to the point where ˆ FDR e ε.5 In contrast to such a simplistic method, we propose a more sophisticated way to explore the threshold search space. Here, this procedure is treated as an optimization problem of the form: max f(x1, x2, ..., xm) subject to g(x1, x2, ..., xm) e ε, li e xi e ui, i ) 1, 2, ..., m, xi ∈ Di, i ) 1, 2, ..., m
(4)
where x1, x2, ..., xm are the m scores being investigated, li and ui are, respectively, lower and upper bounds for score xi, D i is the domain of score xi, f(x) provides the number of assignments, and g(x) is a function for FDR estimation. Therefore,
research articles
MUDE: A New Approach for Optimizing Sensitivity this formulation aims at finding the set of thresholds that leads to the maximum number of assignments with FDR not greater than ε, according to the domains and limits of the given parameters. Formulation 4 is defined here as ε-masp (maximization of sensitivity problem). Considering the quality parameters proposed in the last section, our specific ε-masp formulation is given as follows: maxf(minXcorr, min∆Cn, max∆M, maxlnSpRank, minPercIons, minRTp-value) subject to g(minXcorr, min∆Cn, max∆M, maxlnSpRank, minPercIons, minRTp-value) e ε, a1 e minXcorr e a2, minXcorr ∈ R+, b1 e min∆Cn e b2, min∆Cn ∈ [0, 1),
c1 e max∆M e c2, max∆M ∈ R+, d1 e maxlnSpRank e d2, maxlnSpRank ∈ [0, ln 500], e1 e minPercIons e e2, minPercIons ∈ [0, 100], f1 e minRTp-value e f2, minRTp-value ∈ [0, 1] (5)
Note that we use the natural log of SpRank as proposed by Keller et al.7 Function f counts hits presenting: Xcorr g minXcorr, ∆Cn g min∆Cn, ∆M e max∆M, ln SpRank e maxlnSpRank, PercIons g minPercIons, and RT p-value g minRTp-value. Function g counts the number of decoy hits passing through the same threshold filtering above and applies ˆ calculation. eq 3 to FDR Consequently, in addition to the estimation of FDRs, the method pursues the maximum sensitivity so that more peptides, and therefore more proteins, can be identified. Modified Hill-Climbing Algorithm. For N reported assignments with m scores, the obtainment of an exact solution for the general problem of Formulation 4 has time complexity given by O(Nm). Hence, even though having a polynomial complexity, finding a solution for a problem with m ) 3 would be already a very slow process. For m > 3, it would be infeasible. In our case, m ) 6, which makes an exact solution definitely out of question. For this reason, the ε-masp is here solved in an approximate manner, that is, a heuristic strategy is applied to solve this problem. We recommend a modification of the hill-climbing algorithm to this end. This algorithm was chosen because of its easy implementation, low memory requirements, and fast convergence. Algorithm 1 denotes the pseudocode of the original hill-climbing strategy.
case, then the procedure is finished returning the current node state as a solution. Otherwise, the search proceeds, that is, the found successor becomes the current node and a new iteration is started. When current represents a final state, then its highest-valued successor will be current itself, meaning that neighbor.VALUE ) current.VALUE, which will also cause the procedure to stop. One important advantage of this method is that it is very memory-efficient as the search tree does not need to be kept. Only the current node and its successors are verified at a time. The main weakness of hill-climbing, on the other hand, is that it stops when it finds the first local maximum, meaning a very premature end of search. To make the above method more versatile, yet keeping its efficiency in terms of memory, we propose a modified hillclimbing method (Algorithm 2). In its modified form, the algorithm’s stop criteria (aside reaching the final state) are based on the number of iterations (nIt) and the number of iterations without improvements (nUnchIt). While the expansions occur, the best state is stored along the path being traced through the search tree, but the algorithm does not stop when finding a worse state. Therefore, it is now possible to expand a node to a worse successor, escaping from local maxima and continuing the search. In this way, as long as nIt and nUnchIt do not exceed their maximum allowed values, many local maxima may be visited during a run. Hence, the chances to obtain a local maximum being also global, or at least close to the optimum value, is dramatically increased.
Particularly to ε-masp, a state is composed by a set of score thresholds and the resulting assignments after filtering the original list of identifications. The value of a state is given by the following HF: ˆ) HF(NT, DT) ) (NT - 2DT)(1 - FDR
In each iteration, the hill-climbing algorithm takes the best among the current node successors. The value of a node is measured by a heuristic function (HF), which is problem specific. Next, the algorithm tests whether the chosen successor has a HF value not greater than the current node. If this is the
(6)
Therefore, the HF value depends on the estimated number of correct hits (NT - 2DT) and the estimated precision (1 ˆ). As a consequence, consider two states s1 and s2 with FDR 1000 and 100 assignments, respectively. If s1 and s2 have both 100 estimated correct hits, then s2 will be preferable as it has a ˆ () 0). If, conversely, s1 and s2 have both FDR ˆ ) 0.01 lower FDR (10 incorrect hits in the former and 1 incorrect in the latter), then state s1 is favored since it contains a much higher number Journal of Proteome Research • Vol. 9, No. 5, 2010 2269
research articles
Cerqueira et al.
ˆ ) 0. Bold nodes form the path along the tree according to Figure 2. Example of a search tree using only Xcorr and ∆Cn to reach FDR Function 6.
of supposed correct identifications. In other words, the HF calculation is based on the number of estimated correct hits weighted by estimated precision. The initial state is built by filtering the assignments using the initial thresholds: a1, b1, c1, d1, e1, and f1 as shown in Formulation 5. To expand nodes for the search tree exploration, a step value is defined to each parameter so that the thresholds can be varied. For scores that have to be increased (e.g., Xcorr), the step value is positive. Conversely, negative step values are defined to scores being decreased (e.g., ∆M). Thus, the successors of a node are generated using all possible modifications of the given parameters: one threshold at a time, all possible pairs of thresholds, all threshold triplets, and so on, until the successor whose generation is obtained by altering all thresholds at once. Therefore, the number of successors are given by:
∑ (mi ) m
(7)
i)1
where m is the number of parameters taken into account. In our case m ) 6, which means 63 successors for each iteration. ˆ e ε or all thresholds reached A state is considered final if FDR their final values: a2, b2, c2, d2, e2, and f2. Figure 2 shows an example considering only Xcorr and ∆Cn for simplicity. This example uses the hits shown in Figure 1. The initial thresholds for Xcorr and ∆Cn are 1.0 and 0.01, respectively, while their upper bounds are 4.0 and 0.30, respectively. The step values to increase minXcorr and min∆Cn are 0.5 and 0.05, respectively, and ε ) 0. In this case, a final state was reached due ˆ ) 0. to FDR 2270
Journal of Proteome Research • Vol. 9, No. 5, 2010
As a last practical observation, it is possible to produce a ˆ e ε but with a HF value lower than the value state where FDR of the best state so far. In this case, the algorithm stops and returns the state presenting the desired FDR instead of the best state. This is the only situation where the best state is not returned. Algorithm 2 has time complexity O(S × H), where S is the time required to expand a node and H is the tree height. The time to expand a node is the node’s number of children multiplied by the time required to filter the node’s spectra using the obtained thresholds. Note that at each iteration the filtering process decreases the number of spectra, which means that for deep levels the time required to expand a node is quite fast. The height H depends on what stop criterion is reached first. However, this value is normally small because the condition ˆ e ε is often achieved on shallow levels in the tree, that is, FDR much sooner than the other stop criteria. Data Mining Framework for Identification Assessment. The whole process proposed here is performed in four stages and is illustrated as a data mining framework in Figure 3. First, to obtain a good quality set of assignments to perform the support vector regression, a 0-masp is solved without RT p-values, that ˆ is, using only five scores. The resulting assignments with FDR ) 0 constitute the TS for the algorithm presented by Pfeifer et al.13 Next, the SVR is applied on this data to generate a model for RT prediction. The model is subsequently used to predict retention times of the originally reported sequences and p-values are calculated based on the deviation between predicted and observed RTs. Finally, a solution for ε-masp using all proposed quality parameters is found. It should be noted that the user has the chance of solving ε-masp several times selecting distinct settings (different initial/final values as well as step sizes), so that various distinct paths can be traced in
MUDE: A New Approach for Optimizing Sensitivity
research articles
Figure 3. Data mining framework for peptide/protein identification assessment. A 0-masp with only five scores is first solved to generate a training set. The resulting data is then used in the SVR to generate a RT prediction model. Next, the model is used to calculate RT p-values for each hit. Finally, k (user-defined) ε-masp’s are solved, using the six recommended scores. The k results are merged in a final solution.
the search tree by the modified hill-climbing strategy. Our procedure automatically merges all obtained results in a final unique solution. It is also important to note the possibility of fixing values for any score or even eliminating scores from the optimization process. In this way, the user can adapt the parameter settings to specific necessities of particular data sets.
Results and Discussion In this paper, we have introduced a multivariate decoy DB analysis in contrast to the commonly used bivariate approach. The goal of using many quality parameters is the maximization of sensitivity, that is, the obtainment of a higher number of assignments for a given target FDR. We yet pose MUDE as an optimization problem for which a good solution can be quickly established by the use of a modified hill-climbing strategy. Experiments on peptide and protein level using several data sets showed that our method could notably potentiate the application of the target-decoy search strategy in identification assessment. We started the experiments using DATA1 to verify the impact of each proposed additional score. Next, comparisons on peptide and protein level were conducted between MUDE and BIDE as well as MUDE and Trans-Proteomic Pipeline tools. In a second computational run, we used DATA2, which is composed of data sets of known protein mixtures, to show the accuracy of FDRs estimated by our method. Note that the running time of our programs to solve a ε-masp can largely variate depending on the user settings. On average, a run for a set of 4.000 spectra took approximately 10 s. Impact of Scores. Xcorr and ∆Cn are definitely the quality parameters of choice when using a bivariate target-decoy search strategy to estimate FDR.3,9,19,25 In this section, we show how the other four scores: ∆M, SpRank, percentage of ions found, and RT p-value, individually, affect the optimization process. To explore several scenarios, we combined various initial, final, and step values of each score, resulting in 45 different parameter configurations. For each data set, we used these settings to solve 45 ε-masp’s five times. In the first time, we used all six scores. In the other four set of runs, we removed one score, out of the four extra scores, at a time. In this way, we could compare results before and after the removal of each quality parameter (Figure 4). Such comparisons were carried out using the number of peptide identifications obtained for ˆ ) 0.01. According to Elias et al. and Balgley et al.,23,26 a FDR 1% FDR represents the best trade-off between sensitivity and precision when assessing MS/MS spectrum interpretations. In fact, this is the most pursued error rate reported in related
literature.5,8,10,11 The parameters used in all runs can be seen in Supporting Information. Figure 4 shows the impact of each score by means of boxplots. Three main measurements are considered in this case: the interquartile range (defined by the upper and lower edges of a box), the median value (line in the box), and the maximum value (the upper end of the vertical dashed line). It can be seen that percentage of ions found, SpRank, and RT p-value have a positive influence on the optimization. When removing each one at a time, it can be observed that the interquartile range, the median value, and the maximum value are in general lower. RT p-value is particularly noteworthy. Its influence is clearly more prominent compared to the other scores. This is a valuable observation as this score has been disregarded in other methods for evaluating MS/MS spectrum assignments. The removal of ∆M, on the other hand, was not as impacting as we could see for the other three scores. This can be explained by the fact that we used a very tight tolerance in SEQUEST runs ((10 ppm), which restricts the gain that this score can provide. One can thus use less tight tolerances to strengthen the use of ∆M, as previously described.7,8,17 Nonetheless, notice that the maximum values were mostly lower for runs without ∆M compared to results using all scores. This is an important fact since the maximum value (the highest number of identifications) for a certain FDR is the value of interest at the end, which makes the inclusion of ∆M also interesting for this data. Peptide Identifications. For comparisons on peptide level, we started contrasting our method to BIDE with Xcorr and ∆Cn (BIDEXcorr,∆Cn). However, unlike the way these scores have been used, we tested all possible combinations of values reported in the SEQUEST results to get a minimum Xcorr and a ˆe minimum ∆Cn that lead to the highest sensitivity with FDR ε. Note that it costs O(n2), that is, there is no necessity to look for an approximate solution in BIDE, as described in previous studies, since an exact solution (one giving the best sensitivity) can be obtained very quickly. Particularly to phosphodata, Beausoleil et al.8 and Jiang et al.5 call attention to the fact that the ∆Cn score is normally suppressed due to Xcorr similarities when a phosphopeptide presents more than one potential phosphorylation site. Hence, comparisons with a bivariate analysis using Xcorr and ∆M (BIDEXcorr,∆M) were additionally performed for phosphopeptide identifications.8 PeptideProphet evaluations were also contrasted to our results. It should be noted that PeptideProphet contains its own model to predicting RTs. It provides yet the option of using decoy hits to pin down the negative distribution. We used both options when setting PeptideProphet parameters so that this program could be run Journal of Proteome Research • Vol. 9, No. 5, 2010 2271
research articles
Cerqueira et al.
Figure 4. Boxplots to evaluate the impact of the newly proposed scores. Each boxplot was constructed using 45 values obtained from ˆ ≈ 0.01. (a) Nonphosphodata. (b) 45 ε-masp runs. Each value corresponds to the number of peptide identifications for FDR Phosphodata. For each data set, the runs were performed 5 times. In the first time, all six scores were applied. In the other cases, one of the four extra scores was removed at a time. All means the use of the six recommended scores, -Pe means the removal of percentage of ions found, -Sp represents the elimination of SpRank, -RT denotes the absence of RT p-value, and -∆M is the case where ∆M is ignored.
using its full abilities. The MUDE results shown below correspond to the various runs mentioned in the previous section, using the six proposed scores. For each data set, the 45 results were merged by our programs in a final solution taking the ˆs. highest sensitivities for the reported FDR ˆs Figure 5 contains plots of number of assignments vs FDR for nonphosphodata. The plots allow the comparison of the sensitivity of each method for the same error rates. Figure 5 demonstrates that MUDE curves dominate BIDE and Pepˆs considered. The point where tideProphet curves for all FDR ˆ FDR ≈ 0.01 is indicated by a small circle on MUDE curves. Taking this FDR, we achieved an increase in sensitivity of 31, 20, and 18% for data sets S1_NPH_CH2, S2_NPH_CH2, and S3_NPH_CH2, respectively, when contrasting our method to BIDE. Comparing with PeptideProphet for the same data and the same FDR, the improvements were 31, 102, and 31%, 2272
Journal of Proteome Research • Vol. 9, No. 5, 2010
respectively. For triply charged peptides, under the same conditions above, our method could deliver again a higher number of identifications. An augmentation of 52 and 14% could be observed from BIDE to MUDE in S1_NPH_CH3 and S2_NPH_CH3, respectively. Analyzing PeptideProphet evaluations in these data sets, we could raise the sensitivity in at least 32 and 80%, respectively. For phosphorylated peptides (Figure 6), the enhancements provided by our method were substantial compared with PeptideProphet. It should be emphasized that the MUDE curves present again a clear dominance over the other curves. Moreover, the difference between MUDE and PeptideProphet curves is impressive. Even comparing PeptideProphet with BIDE, PeptideProphet had a worse performance in general. It reinforces the caveat of Jiang et al.27 about using PeptideProphet on data sets where score
MUDE: A New Approach for Optimizing Sensitivity
research articles
ˆs for nonphosphodata. (a) S1_NPH_CH2. (b) S2_NPH_CH2. (c) S3_NPH_CH2. (d) Figure 5. Plots of number of assignments vs FDR ˆ ≈ 0.01. MUDE curves (blue) clearly dominates S1_NPH_CH3. (e) S2_NPH_CH3. A small circle on the MUDE curve highlights where FDR BIDEXcorr,∆Cn (black) and PeptideProphet (red) curves.
ˆs for phosphodata. (a) S1_PH_CH2. (b) S2_PH_CH2. (c) S3_PH_CH2. (d) S1_PH_CH3. Figure 6. Plots of number of assignments vs FDR ˆ ≈ 0.01. MUDE (blue) demonstrated once more (e) S2_PH_CH3. (f) S3_PH_CH3. A small circle on the MUDE curve highlights where FDR its dominance over BIDEXcorr,∆Cn (black) and PeptideProphet (red) as well as BIDEXcorr,∆M (green). The difference between MUDE and PeptideProphet is particularly worth noting.
distributions are very different from assumed. This fact also explains why the target-decoy search strategy is normally preferred when evaluating phosphopeptide/protein identifications.5,8,10 Considering again a 1% FDR, it can be noticed an average increase in sensitivity of 18 and 62% when comparing MUDE with BIDEXcorr,∆Cn for CH2 and CH3 data, respectively. Analyzing sensitivity of BIDEXcorr,∆M at the same
FDR, the increments provided by our approach are of 15 and 47% on average for the respective charges. Finally, contrasting the number of identifications reported by PeptideProphet and our method under the same error rate above, we could observe a general improvement of approximately 163 and 157% for doubly and triply charged peptides, respectively. Journal of Proteome Research • Vol. 9, No. 5, 2010 2273
research articles
Cerqueira et al.
ˆ. In each diagram, the left set represents assignments retrieved by MUDE, Figure 7. Venn diagrams of nonphosphodata for a 1% FDR while the right set indicates identifications found by the approach being compared with. (a) Comparisons with PeptideProphet. (b) Comparisons with BIDEXcorr,∆Cn. The number of exclusive assignments reported by MUDE is significantly higher in all cases.
ˆ. In each diagram, the left set represents assignments retrieved by MUDE, while Figure 8. Venn diagrams of phosphodata for a 1% FDR the right set indicates identifications found by the approach being compared with. (a) Comparisons with PeptideProphet. (b) Comparisons with BIDEXcorr,∆Cn. (c) Comparisons with BIDEXcorr,∆M. MUDE showed its superiority, reporting many more exclusive identifications when contrasting to the other procedures. The discrepancy between MUDE and PeptideProphet for phosphopeptides is remarkable.
We further compared BIDE and PeptideProphet with our procedure by means of venn diagrams to demonstrate the ˆ. Figures 7 number of exclusive identifications using a 1% FDR and 8 show this analysis for nonphosphodata and phosphodata, respectively. Note that in both cases the number of exclusive assignments mined by MUDE was revealingly higher compared with the other methods. This means that MUDE identifies exclusive proteins as well, or at least that it provides a higher coverage and number of matches for proteins identified in 2274
Journal of Proteome Research • Vol. 9, No. 5, 2010
common with the other approaches. Figure 9 illustrates an example of a spectrum from S1_PH_CH2 detected uniquely by MUDE. The assigned peptide sequence belongs to a phosphoprotein called DNA replication licensing factor MCM2. This is a mini-chromosome maintenance protein, vital for the initiation of eukaryotic genome replication. MCM2 is necessary for the entry in synthesis phase as well as cell division, and its malfunction is associated with colon cancer.28 Note that the spectrum is in agreement with the reported sequence. This is
research articles
MUDE: A New Approach for Optimizing Sensitivity
Figure 9. Example of an assignment detected only by MUDE. Phosphorylations are indicated by underscores. The precursor is reported as a doubly charged ion with m/z ) 1062.44.
ˆprot for nonphosphodata. (a) S1_NPH_CH2. (b) S2_NPH_CH2. (c) S3_NPH_CH2. (d) Figure 10. Plots of number of protein hits vs FDR S1_NPH_CH3. (e) S2_NPH_CH3. On protein level, MUDE (blue) shows its superiority over ProteinProphet (red) and BIDEXcorr,∆Cn (black).
a typical phosphopeptide spectrum containing a central prominent peak and suppressed b/y series. The highest peak at m/z ) 964.7 is a consequence of a double neutral loss of H3PO4 (approximately -196 Da corresponding to an offset of ≈ 98 ) 1062.44 - 964.7) undergone by the doubly charged precursor ion. Besides, the phosphorylation sites SER 139 and SER 140 are very well-known in literature,29-33 which reinforces the correctness of this assignment. The corresponding protein had only two peptide matches (one of them is shown in Figure 9), both detected exclusively by MUDE. Protein Identifications. While expanding the search tree, our ˆ of each newly generated state. In method records the FDR this way, the user can see at the end not only the final ˆ e ε, but also the thresholds thresholds corresponding to FDR
leading to other error rates (2%, 3%, etc.) of intermediary states. During this procedure, the algorithm computes the protein FDR estimate for each state, exactly as done on peptide level (eq 3). Note that the target-decoy search strategy has been traditionally employed in peptide identification assessment. However, Bianco et al.11 revealed that the same approach can be successfully applied on protein level to estimate error rates. Therefore, the protein FDR is computed here using: ˆprot ) FDR
DPT NPT - DPT
(8)
where DPT is the number of decoy proteins filtered through the same set of thresholds T during the peptide level analysis Journal of Proteome Research • Vol. 9, No. 5, 2010 2275
research articles
Cerqueira et al.
ˆprot for phosphodata. (a) S1_PH_CH2. (b) S2_PH_CH2. (c) S3_PH_CH2. (d) S1_PH_CH3. Figure 11. Plots of number of protein hits vs FDR (e) S2_NH_CH3. (f) S3_NH_CH3. Again MUDE (blue) overcomes the other approaches: ProteinProphet (red), BIDEXcorr,∆Cn (black), and BIDEXcorr,∆M (green).
and NPT is the total number of protein identifications (decoys and targets) using T. As a result, thresholds for several protein FDRs are reported by our algorithm as a byproduct. We call it a byproduct because no optimization is performed for proteins. The idea presented in Formulation 5 is based on peptide hits. What we do is simply counting the proteins reported for each state in the search tree and recording the corresponding ˆprot. On the other hand, it is expected that the optimizaFDR tion of sensitivity on peptide level results in enhancements on protein level as well. Figures 10 and 11 show that the above affirmation is ˆprot are reasonable. Plots of number of protein hits vs FDR shown comparing MUDE with ProteinProphet and BIDE. Again it can be seen a significant dominance of MUDE curves for both charges in all data sets. Therefore, it demonstrates that our approach indeed provides a higher proteome coverage. Evaluation on a Known Protein Mixture. Even though the target-decoy search approach has been already widely verified in several previous applications as a very good FDR estimator, we tested our method on a data set whose constituent proteins are known a priori. We could confirm the accuracy of the reported errors. The set of spectra used in the experiment was obtained by merging the three data sets (DATA2) described in Pfeifer et al.13 (see section “MS/MS data” for details). The merge procedure was performed in order to produce a data set corresponding to a pseudomixture of 20 proteins. Note that we used the decoy DB approach on protein level as well. Hence, taking each data set separately, that is, using spectra of few proteins would not be appropriate to our experiments since decoy DB approaches may be inaccurate for small sets. Figure 12 contains plots of the predicted number of wrong identifications (number of decoy hits) vs the observed number of wrong identifications (hits not related to the expected proteins) recorded for various thresholds. On protein level, we ˆ values so that used thresholds to obtain a wider range of FDR more data points could be produced for regression. On both 2276
Journal of Proteome Research • Vol. 9, No. 5, 2010
levels, it can be seen a very good agreement between predicted and observed results, ratifying the accuracy in FDR estimation by the use of decoy DBs as proposed here.
Conclusion The target-decoy search strategy is established as a firstchoice method to FDR estimation in MS/MS spectrum interpretation assessment. However, the concern about sensitivity has been broadly neglected. In this work, besides taking advantage of decoy DB approaches for precision calculation, as traditionally done, we propose a multivariate analysis approach in form of an optimization problem to maximize sensitivity in peptide/ protein identification assessment. To obtain a fast convergence in the optimization procedure, we use a modified version of the hill-climbing heuristic. Our adjustments allow for the visitation of many local optima in contrast to the conventional algorithm. As a result, the chances to reach global optima, or at least get closer, are highly advanced. Comparisons with the classical (bivariate) decoy DB analysis and with the TransProteomic Pipeline (TPP) tools demonstrated that our method could present a significant higher sensitivity. Particularly for phosphorylated peptide and proteins, the differences between the MUDE approach and the TPP tools were impacting. The experiments shown for the eleven data sets of DATA1 were conducted using exactly the same parameter settings. In practice, however, the user has the chance of using suitable parameter values for particular data sets with specific characteristics, such that the best of our algorithms can be extracted. Still, although our experiments were performed using SEQUEST, our proposed approach is easily extendable to the output of other DB search algorithms such as X!Tandem34 and MASCOT. It is just a question of adapting the algorithms for the specific scores. Even though our initial motivation was the particular problem of assessing phosphoprotein identifications, the results
research articles
MUDE: A New Approach for Optimizing Sensitivity
Figure 12. Linear regression on peptide (a) and protein (b) levels. The x axis represents the observed number of wrong hits, while the y axis designates the predicted number of wrong hits. The regression demonstrates a very good correspondence between predicted and observed values.
have shown that the proposed method can be successfully applied to other kinds of data sets, demonstrating that the presented procedure constitutes a powerful tool in shotgun proteomics.
Acknowledgment. This work was supported by the Austrian Genome Program (GEN-AU), project Bioinformatics Integration Network (BIN III). Supporting Information Available: Supplemental tables. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976–989.
(2) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551–3567. (3) Washburn, M. P.; Wolters, D.; Yates, J. R., III. Nat. Biotechnol. 2001, 19, 242–247. (4) Imanishi, S. Y.; Kochin, V.; Ferraris, S. E.; Thonel, A.; Pallari, H. M.; Corthals, G. L.; Eriksson, J. E. Mol. Cell. Proteomics 2007, 6, 1380– 1391. (5) Jiang, X.; Han, G.; Feng, S.; Jiang, X.; Ye, M.; Yao, X.; Zou, H. J. Proteome Res. 2008, 7, 1640–1649. (6) Cerqueira, F. R.; Morandell, S.; Ascher, S.; Mechtler, K.; Huber, L. A.; Pfeifer, B.; Graber, A.; Tilg, B.; Baumgartner, C. J. Proteomics Bioinform. 2009, 2, 150–164. (7) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383–5392. (8) Beausoleil, S. A.; Ville´n, J.; Gerber, S. A.; Rush, J.; Gygi, S. P. Nat. Biotechnol. 2006, 24, 1285–1292. (9) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43–50. (10) Lu, B.; Ruse, C.; Xu, T.; Park, S. K.; Yates, J. R., III. J. Anal. Chem. 2007, 4, 1301–1310. (11) Bianco, L.; Mead, J. A.; Bessant, C. J. Proteome Res. 2009, 8, 1782– 1791. (12) Nesvizhskii, A. I.; Roos, F. F.; Grossmann, J.; Vogelzang, M.; Eddes, J. S.; Gruissem, W.; Baginsky, S.; Aebersold, R. Mol. Cell. Proteomics 2006, 4, 652–670. (13) Pfeifer, N.; Leinenbach, A.; Huber, C. G.; Kohlbacher, O. BMC Bioinformatics 2007, 8, 468. (14) Russell, S. J.; Norvig, P. Artificial intelligence: A modern approach, 2nd ed.; Prentice Hall: Upper Saddle River, NJ, 2003. (15) Nesvizhskii, A. I.; Keller, A.; kolker, E.; Aebersold, R. Anal. Chem. 2003, 75, 4646–4658. (16) Schroeder, M. J.; Shabanowitz, J.; Schwartz, J. C.; Hunt, D. F.; Coon, J. J. Anal. Chem. 2004, 76, 3590–3598. (17) Dworzanski, J. P.; Snyder, A. P.; Chen, R.; Zhang, H.; Wishart, D.; Li, L. Anal. Chem. 2004, 76, 2355–2366. (18) Keller, A.; Eng, J.; Zhang, N.; Li, X.; Aebersold, R. Mol. Syst. Biol. 2005, 1–8. (19) Elias, J. E.; Gygi, S. P. Nat. Methods 2007, 4, 207–214. (20) Kersey, P. J.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.; Apweiler, R. Proteomics 2004, 4, 1985–1988. (21) Baumgartner, C.; Rejtar, T.; Kullolli, M.; Akella, L. M.; Karger, B. L. J. Proteome Res. 2008, 7, 4199–4208. (22) Bianco, L.; Mead, J. A.; Bessant, C. J. Proteome Res. 2009, 8, 1782– 1791. (23) Balgley, B. M.; Laudeman, T.; Yang, L.; Song, T.; Lee, C. S. Mol. Cell. Proteomics 2007, 6, 1599–1608. (24) Pfeifer, N.; Leinenbach, A.; Huber, C. G.; Kohlbacher, O. J. Proteome Res. 2009, 8, 4109–4115. (25) Elias, J. E.; Haas, W.; Faherty, B. K.; Gygi, S. P. Nat. Methods 2005, 2, 667–675. (26) Elias, J. E.; Gibbons, F. D.; King, O. D.; Roth, F. P.; Gygi, S. P. Nat. Biotechnol. 2004, 22, 214–219. (27) iang, X.; Dong, X.; Ye, M.; Zou, H. Anal. Chem. 2008, 80, 9326– 9335. (28) Giaginis, C.; Georgiadou, M.; Dimakopoulou, K.; Tsourouflis, G.; Gatzidou, E.; Kouraklis, G.; Theocharis, S. Dig. Dis. Sci. 2009, 54, 282–191. (29) Ballif, B. A.; Ville´n, J.; Beausoleil, S. A.; Schwartz, D.; Gygi, S. P. Mol. Cell. Proteomics 2004, 3, 1093–1101. (30) Ville´n, J.; Beausoleil, S. A.; Gerber, S. A.; Gygi, S. P. Proc. Natl. Acad. Sci. U.S.A. 2007, 104, 1488–1493. (31) Zanivan, S.; Gnad, F.; Wickstro¨m, S. A.; Geiger, T.; Macek, B.; Cox, J.; Fa¨ssler, R.; Mann, M. J. Proteome Res. 2008, 7, 5314–5326. (32) Pan, C.; Gnad, F.; Olsen, J. V.; Mann, M. Proteomics 2008, 8, 4534– 4546. (33) Li, H.; Xing, X.; Ding, G.; Li, Q.; Wang, C.; Xie, L.; Zeng, R.; Li, Y. Mol. Cell. Proteomics 2009, 8, 1839–1849. (34) Craig, R.; Beavis, R. C. Rapid Commun. Mass Spectrom. 2003, 17, 2310–2316.
PR901023V
Journal of Proteome Research • Vol. 9, No. 5, 2010 2277