A Refined Method To Calculate False Discovery Rates for Peptide Identification Using Decoy Databases Pedro Navarro and Jesu ´ s Va´zquez* Centro de Biologı´a Molecular Severo Ochoa (CSIC), Universidad Auto´noma de Madrid, 28049 Madrid, Spain Received May 19, 2008
Using decoy databases to estimate the number of false positive assignations is one of the most widely used methods to calculate false discovery rates in large-scale peptide identification studies. However, in spite of their widespread use, the decoy approach has not been fully standardized. In conjunction with target databases, decoy databases may be used separately or in the form of concatenated databases, allowing a competition strategy; depending on the method used, two alternative formulations are possible to calculate error rates. Although both methods are conservative, the separate database approach overestimates the number of false positive assignations due to the presence of MS/MS spectra produced by true peptides, while the concatenated approach calculates the error rate in a population that has a higher size than that obtained after searching a target database. In this work, we demonstrate that by analyzing as a whole the joint distribution of matches obtained after performing a separate database search, and applying the competition strategy, it is possible to make a more accurate calculation of false discovery rates. We show that both separate and concatenated approaches clearly overestimate error rates with respect to those calculated by the new algorithm, using several kinds of scores. We conclude that the new indicator provides a more sensitive alternative, and establishes for the first time a unique and integrated framework to calculate error rates in large-scale peptide identification studies Keywords: Peptide identification • false discovery rate • tandem mass spectrometry • SEQUEST • Mascot
Introduction One of the key techniques in modern Proteomics is the analysis of peptides by tandem mass spectrometry and their subsequent identification in databases. Although there are powerful algorithms for interrogating databases in search of sequences that correspond to the fragment spectra, interpreting the results of a database search is a subject that is still open to considerable controversy. In clear analogy with the microarray field, the use of false discovery rates (FDR) to determine the statistical significance associated to a list of identified peptides is becoming one of the most widely accepted approaches in the Proteomics community. One of the most robust and used methods to control for error rates, although not the only one, is based on the use of decoy databases to estimate the number of false peptide assignments.1 However, and in spite of the inherent simplicity of the decoy/target approach, it has been recently pointed out that the language for presenting and discussing these results has not been highly standardized nor formalized.2 Particularly, there is no clear consensus about the correct method to compute FDRs, the nature of the sequences that could serve as decoys, or whether the FDR should characterize the data set in aggregate or be associated with each item on the list.3 * To whom correspondence should be addressed. Phone: +34 91 196 4628. Fax: +34 91 196 4420. E-mail:
[email protected].
1792 Journal of Proteome Research 2009, 8, 1792–1796 Published on Web 01/22/2009
The simplest way to calculate the FDR associated to a given score threshold is to search separately the collection of MS/ MS spectra against a target and a decoy database of exactly the same size, and count out the number of decoy peptidespectrum matches (PSM) and of target PSM above threshold, and estimate the FDR by simply computing the ratio of these two values.4 Although this simple criterion is conservative and has been widely used, it does not take into account two major issues: (a) the collection of target PSM includes both correct and incorrect assignments, and hence, the number of decoy PSM that is considered for FDR calculation is overestimated; (b) MS/MS spectra corresponding to correct PSM tend to give higher scores (with most database searching schemes) against decoy sequences than those corresponding to incorrect PSM, and hence, the distribution of decoy scores does not accurately reflect the behavior of false assignments. An interesting discussion about the first issue was recently made by Ka¨ll et al.4 These authors provide a novel interpretation of decoy database approaches based on formal statistical procedures taken from the microarray field and suggest a method to correct the percentage of target PSM that are incorrect (PIT). This is conservatively made by considering the population of scores close to zero and calculating the ratio of decoy to target PSM within that set.4 The rationale behind this approach is that the majority of target PSM with small score values should correspond to incorrect PSM.4 In this context, a 10.1021/pr800362h CCC: $40.75
2009 American Chemical Society
research articles
Calculating False Discovery Rates plot of PIT versus score can help to estimate the PIT, and the FDR can then be adjusted by multiplying by this factor.4 A widely used method to deal with the second issue is a competition strategy whereby decoy and target sequences compete for the MS/MS spectra.1 This method is based on the idea that when a given PSM is correct, the target sequence is expected to produce a higher score than the decoy sequence; however, when the PSM is not correct, there is an equal probability of matching a target and a decoy sequence. Hence, if a composite or “concatenated” database containing both target and decoy sequences is used, the number of false assignments can be accurately estimated by doubling up the number of decoy PSM obtained above threshold.1 The differences and analogies between these two approaches have been recently pointed out by Fitzgibbon et al.2 The main difference is that in the competition strategy the total number of spectra above threshold contains both target and decoy PSM, while in the other approach, only the population of target PSM is taken into account for FDR calculation. A very interesting observation made by these authors is that in these two approaches the joint behavior of the results obtained against the two databases, containing potentially relevant information, is ignored.2 An immediate question that arises is whether it would be possible to design an FDR-calculation scheme that could integrate and conceal these two different viewpoints. In this work, we start from the joint behavior concept proposed by Fitzgibbon et al.2 and propose a refined algorithm that integrates the concepts from the two previous methods. The new method is more sensitive than the others and also calculates the FDR in the population of matches above threshold in the target database.
Experimental Section The data set used to test the method, containing more than 40 000 MS/MS spectra was taken from a previous work,5 and corresponded to the analysis of a total proteome from Jurkat cell extracts. SEQUEST data were processed using the probability ratio method as described.6 Database searches using Mascot were performed against the same decoy and target databases and using identical precursor mass windows and the same fixed and variable modifications.
this database where the multiple hypothesis is tested that all these peptides have been correctly identified. Therefore, the reference population where the FDR should be calculated is the collection of peptides identified in the target database. Rather than defining what is the correct reference population (which is unambiguously defined), the matter to be resolved is to determine the best approach to estimate the FDR within this population. A conclusion that follows from this point is that, if there are two error rate estimates of equal accuracy, we will consider as a better estimate the one that calculates the error rate in the correct reference population. Let us now assume that with the sole purpose of making a certain estimation of the FDR we make two separate database searches and analyze the joint structure of the results obtained against the decoy database (DDB) and the target database (TDB), as proposed by Fitzgibbon et al.2 This is done by plotting the best scores obtained against the DDB versus those obtained against the TDB, obtaining graphs like those in Figure 1. Since the points above the diagonal line in these graphs correspond to PSM yielding a better score in DDB, and those below the line to PSM with better scores in TDB, and since it is assumed that a spectrum which does not hit its true sequence will hit either a DDB or TDB entry with equal probability, it follows that the cloud of false positive PSM must be symmetrically distributed with respect to the diagonal line. Let us now assume that we establish a t score threshold and want to calculate the FDR within the population of PSM above this threshold. Although, as stated before, this reference population should only include PSM in TDB, in general and considering as a whole the results obtained against the two databases, the PSM having scores above t may be classified into four categories: those yielding a score above threshold in both TDB and DDB and being better in TDB (tb or “target better”) or in DDB (db or “decoy better”), and those yielding a score above threshold in one database but not in the other (do or “decoy only” and to or “target only”) (see Figure 1). In the conventional method, spectra are searched against TDB and DDB as separate databases and the number of PSM in DDB is used as an estimate for the number of false positive PSM in TDB. Hence, only the projection of the points onto the target and decoy axes are considered, and it is quite obvious that the FDR estimated by the separate database (SD) search method is given by
Results and Discussion We will start by establishing the two basic assumptions upon which we will base this work. The first one is the validity of the decoy approach for estimating false positive assignments, which we will presume by stating that false positive PSM have the same probability of being assigned to a decoy or to a target sequence. This point has been mainly analyzed in the Elias and Gygi paper,1 where evidence is presented supporting its validity, and will not be further discussed here. We will assume that the decoy databases are properly constructed, and that the number of peptide identifications is high enough to expect that the data will follow this behavior. The second assumption is that there is only one valid FDR definition, which may be stated as the proportion of false positive assignments in the collection of peptides matching the “real” database (i.e., the target database). The purpose of searching engines is to identify as many peptides as possible in a database of peptides that are expected to exist in nature. It is within the collection of peptides potentially identified in
FDRSD )
db + tb + do db + tb + to
(1)
Note that, although this method estimates the FDR in the correct reference population (the denominator), the tb region in the numerator may contain a certain proportion of PSM that may correspond to true positive PSM (Figure 1). As a consequence, the number of false positive PSM is overvalued, and therefore, this method tends to overestimate the FDR. Let us now assume that the spectra are searched against a composite database containing entries from both TDB and DDB, and that the competition strategy as described by Elias and Gygi1 is used to calculate FDR. This method includes in the reference population all the points that lie above threshold in either TDB or DDB (do + db + to + tb), and estimates the number of false positive PSM in this population as the double of assignments which get best score in DDB [2(do + db)]. Hence, the FDR calculated by the composite database (CD) search method is given by Journal of Proteome Research • Vol. 8, No. 4, 2009 1793
research articles 2(do + db) (2) do + db + tb + to It is evident that we may arrive to the same formula by exploiting the symmetry of the cloud of false positives with respect to the diagonal line, from which it is immediately seen that the collection of PSM yielding a better score in TDB (to + tb) contains a number of false positives equivalent to the size of the opposite region (do + db). The advantage of this strategy is that no PSM belonging to the cloud of true positives is included in the estimation of false positives, and thus, the main drawback of the separate database search method is avoided. Hence, this method seems to bring forth an improvement in comparison with the SD method. However, at this point, we cannot truly conclude that the CD method is better than the SD method, because the FDRCD )
Navarro and Va´zquez improvement in accuracy of false positive PSM estimation in the former is made at the expense of inflating the reference population by including PSM belonging to the do region (which correspond to peptide sequences that do not exist in nature) in the latter. The two FDR estimates are done in populations of different nature, and hence, we cannot conclusively affirm that an FDR estimate in the reference population is worse than a more accurate FDR estimate in an inflated population. This situation may be resolved by establishing a proper comparative framework on the basis of the joint distribution of Figure 1. The cloud of true positive PSM (TP) is common to the reference populations used by the two methods and can be directly compared. In the CD method, the number of TP may be estimated by subtracting to the size of the reference population (the denominators in eq 2) the number of estimated false positive PSM (the numerator), obtaining TPCD ) (tb + to) - (db + do). The FDR in the correct reference population is then calculated by estimating the false positive PSM by subtracting TPCD from this population (db + tb + to), and dividing the result by the same number: FDRTD )
Figure 1. Representation of the joint distribution of three different kind of scores among the decoy and target databases. The scores used are (A) SEQUEST Xcorr score; (B) probability ratio computed from SEQUEST scores as described,6 and (C) Mascot E-values. In B and C, the scores are presented in a co-logarithmic scale. Letters refer to the six regions of the joint distribution delimited by the score threshold (dashed lines) and the diagonal line, as described in the text. Figures indicate the number of PSM contained in the corresponding regions. Numbers in parentheses indicate the PSM obtained when the score is increased until the inflated population in the CD method becomes of the same size than the reference population in SD and TD methods. 1794
Journal of Proteome Research • Vol. 8, No. 4, 2009
do + 2db db + tb + to
(3)
Again, eq 3 could have been directly deduced from the joint distribution in Figure 1 by considering the symmetry with respect to the diagonal line of the three regions which are enclosed in the reference population: the db region, containing only false positive PSM, and the tb and to regions, which contain a number of false positives equivalent to the opposite db and do regions, respectively. We will call this refined estimate the target/decoy (TD) method. Comparing eq 3 with eq 1, it is now numerically evident why the competition strategy produces a better FDR estimate than the separate database search strategy. In comparison with the SD method, the TD method does not overestimate the number of false positive PSM in the numerator, and therefore, it is possible to affirm that the TD method is better than the SD method. Concerning to the CD and TD methods, they both are based on the same competition (or diagonal symmetry) strategy and, hence, have the same accuracy; the TD method may thus be viewed as a refined version of the CD method. However, and according to the second of our starting premises, we should consider that the TD method is better than the CD method, since it calculates the FDR in the correct reference population. A further point should be noted here: to calculate FDRTD, it is necessary to determine the size of the correct reference population, or, more precisely, of the db region, which cannot be calculated from the analysis of PSM obtained against a composite database. Therefore, to use the refined method, it is necessary to search separately against TDB and DDB. Accordingly, the TD method may also be viewed as a refinement of the separate database search method where the improvement comes from exploiting the symmetry along the diagonal line in the joint distribution to calculate the number of false positive PSM. As stated before, CD and SD methods use a different reference population, and therefore, their FDR estimates should not be directly compared in absence of an adequate framework, as proposed here. However, these estimates have been directly compared in other works, and it would be very illustrative to analyze here, on the basis of the joint distribution, the results that would be obtained by directly comparing the three
Calculating False Discovery Rates methods. If a fixed score threshold is set and the FDR is computed in the resulting populations, then we can compare the estimates by analyzing the results obtained from eqs 1, 2, and 3. Since tb g db and do g 0, we can conclude that, in the most general cases, FDRSD g FDRTD and FDRCD g FDRTD, and therefore, the TD method will always give the lowest FDR rate. The relative performance of FDRSD and FDRCD cannot, however, be generally predicted and mainly depends on the size of the do and tb regions. In the SEQUEST case (Figure 1A), and due to the correlation existing between decoy and target scores corresponding to true positive PSM, the tb region tends to have a high proportion of PSM and this penalizes the SD method, which produces consistently higher FDR estimates than the CD method, as reported previously.1,6 Figure 1A shows a representative result obtained from the analysis of a large collection of MS/MS spectra, and includes the number of PSM in each one of the four categories for a certain Xcorr threshold. From these numbers and eqs 1-3, we get FDRSD ) 0.19, FDRCD ) 0.15 and FDRTD ) 0.10, so that FDRSD is higher than FDRCD, and is about 50% higher than FDRTD. The correlation between target and decoy scores is expected to be less remarked when probability indicators such as the probability ratio,6 or the Mascot expectation values are used as scores (Figure 1B,C). In the Mascot case, the number of PSM in do increases (Figure 1C), and it penalizes the CD method, which yields an FDR estimate similar to that of the SD method (FDRSD ) 0.16; FDRCD ) 0.15); however, these two methods clearly overestimate the FDR in comparison with the TD method (FDRTD ) 0.10). When the probability ratio is used as score, the proportion of PSM in the tb region is lower than in the other methods; consistently, the SD method now produces a lower FDR estimate than the CD method (FDRSD ) 0.12; FDRCD ) 0.17); this result demonstrates in practice why the CD method cannot be, in general, claimed as better than the SD method. In this case, the TD method also gives a lower estimate (FDRTD ) 0.10) than the other two, as expected. The comparative behavior of all these estimates is maintained in a large range of FDR, as shown in Figure 2. The estimated error rates for the CD method may be different if the number of peptide identifications is taken as the reference to compare FDR values instead of the score threshold. This procedure would be followed to compare, for instance, identification performance curves, where the number of identified peptides is plotted against the FDR. For this purpose, the t-threshold used in the CD method has to be increased until its reference population becomes of the same size as that of the reference population used by the SD and TD methods (see numbers in parentheses in Figure 1). Note that in doing so we are implicitly assuming that the inflated population in the CD method would have the same error rate that a real reference population of the same size. Since error rates decrease when the size of the reference population decreases, in this scenario, the FDR estimates given by the CD method are always lower than those obtained using as a reference the score threshold (compare green and blue lines in Figure 2), although in all cases the FDRTD estimates clearly give lower values. As shown, FDRCD values now become similar to those of FDRSD when the probability ratio is used (Figure 2B), while in the Mascot case, FDRCD gives clearly lower estimates than FDRSD. All these results not only illustrate the complexity that may arise when FDR estimates of different nature are directly compared in the absence of a common framework, but also show how error rates estimated by the refined method proposed in this work
research articles
Figure 2. Comparison of false discovery rates obtained using the conventional separated database approach (FDRSD, red lines) and the concatenated database approach (FDRCD, green and blue lines) with those of the refined target/decoy approach proposed in this work (FDRTD). (A), (B), and (C) refer to the three scores indicated in Figure 1. Note that there are two FDRCD curves in each plot; one (FDRCD1) is obtained using the same score threshold (green lines), and the other (FDRCD2) is obtained using the same number of positive peptide identifications (blue lines). For further details see the main text.
are consistently and considerably lower than those estimated by the other methods in a wide range of situations. Finally, it is important to note that the symmetry of false PSM around the diagonal line of the joint distribution may also be exploited to estimate the number of true and false negative PSM and henceforth all the statistical measurements, including sensitivity, specificity and accuracy. For this purpose, it is necessary to consider the two remaining regions (tu or “target under” and du or “decoy under”) that lie below threshold at the two sides of the diagonal line (Figure 1). For instance, the difference in the size of tu and du serves as an estimate for the number of false negative PSM. The rest of measurement estimates are easily deduced and are described in Table 1. In practice, to calculate FDRTD, it is necessary not only to make a separate search against a target and a decoy database, but also to compare the results obtained in the two searches, in order to determine whether the scores associated to each one of the MS/MS spectra are better in TDB or in DDB, and Journal of Proteome Research • Vol. 8, No. 4, 2009 1795
research articles
Navarro and Va´zquez
Table 1. Estimation of Statistical Measurements from the Joint Distribution measurement
formulation
estimate
False positive True positive False negative True negative False discovery rate Precision
FP TP FN TN FP/(TP + FP)
2db + do tb - db + to - do tu - du do + 2du (2db + do)/(tb + to + db)
TP/(TP + FP)
Sensitivity
TP/(TP + FN)
Specificity
TN/(TN + FP)
Accuracy
(TP + TN)/(TP + TN + FP + FN)
(tb - db + to - do)/ (tb + to + db) (tb - db + to - do)/(tb db + to - do + tu - du) (do + 2du)/(2db + 2do + 2du) (tb - db + to + 2du)/(tb + to + tu + db + do + du)
thus calculate the size of the regions in Figure 1. This process is relatively fast and does not consume appreciable computational time in comparison with that needed to make the database searches. In conclusion, in this work, we propose a new algorithm to calculate error rates on the basis of the results obtained after performing separate searches against decoy and target databases. The new method integrates concepts developed in previously reported approaches. The refined algorithm is based on the identical behavior that is expected between false positive matches in the target database and the matches in the decoy database and also exploits the competition strategy of methods that search against concatenated databases. Although our algorithm does not obviate the current need of developing methods to construct decoy databases that reflect more accurately the behavior of false positive peptide identifications,7
1796
Journal of Proteome Research • Vol. 8, No. 4, 2009
it efficiently resolves some of the main issues existing with current methods. The refined method provides a superior and more sensitive alternative, and establishes for the first time a unique and integrated framework to calculate error rates in large-scale peptide identification studies.
Acknowledgment. This work was supported by grants BIO2003-01805, BIO2006-10085, GR/SAL/0141/2004 (CAM), CAM BIO/0194/2006, the Fondo de Investigaciones Sanitarias (Ministerio de Sanidad y Consumo, Instituto Salud Carlos III, RECAVA) and by an institutional grant by Fundacio´n Ramo´n Areces to CBMSO. References (1) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4 (3), 207–14. (2) Fitzgibbon, M.; Li, Q.; McIntosh, M. Modes of inference for evaluating the confidence of peptide identifications. J. Proteome Res. 2008, 7 (1), 35–9. (3) Tabb, D. L. What’s driving false discovery rates. J. Proteome Res. 2008, 7 (1), 45–6. (4) Kall, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J. Proteome Res. 2008, 7 (1), 29–34. (5) Lopez-Ferrer, D.; Martinez-Bartolome, S.; Villar, M.; Campillos, M.; Martin-Maroto, F.; Vazquez, J. Statistical model for large-scale peptide identification in databases from tandem mass spectra using SEQUEST. Anal. Chem. 2004, 76 (23), 6853–60. (6) Martinez-Bartolome, S.; Navarro, P.; Martin-Maroto, F.; LopezFerrer, D.; Ramos-Fernandez, A.; Villar, M.; Garcia-Ruiz, J. P.; Vazquez, J. Properties of average score distributions of SEQUEST: the Probability Ratio method. Mol. Cell Proteomics 2008, 7, 1135– 45. (7) Choi, H.; Nesvizhskii, A. I. False discovery rates and related statistical concepts in mass spectrometry-based proteomics. J. Proteome Res. 2008, 7 (1), 47–50.
PR800362H