False Discovery Rates and Related Statistical Concepts in Mass

Dec 8, 2007 - False Discovery Rates and Related Statistical Concepts in Mass ... measure p-value to the multiple testing corrected false- discovery ra...
2 downloads 0 Views 82KB Size
False Discovery Rates and Related Statistical Concepts in Mass Spectrometry-Based Proteomics Hyungwon Choi†,‡ and Alexey I. Nesvizhskii*,†,§ Departments of Pathology and Biostatistics, and Center for Computational Medicine and Biology, University of Michigan, Ann Arbor, Michigan 48109 Received November 15, 2007

Development of statistical methods for assessing the significance of peptide assignments to tandem mass spectra obtained using database searching remains an important problem. In the past several years, several different approaches have emerged, including the concept of expectation values, targetdecoy strategy, and the probability mixture modeling approach of PeptideProphet. In this work, we provide a background on statistical significance analysis in the field of mass spectrometry-based proteomics, and present our perspective on the current and future developments in this area. Keywords: mass spectrometry • peptide identification • database searching • statistical validation • decoy sequences • false discovery rates

Introduction Development of methods of assessing the statistical confidence in tandem mass spectrometry (MS/MS)-derived peptide and protein identifications is an active area of research. The manuscript by Ka¨ll et al.1 presents an introduction to the statistical significance analysis applied to the mass spectrometrybased peptide identification problem, in a language clear and readable to nonstatisticians. In this Perspective, we would like to supplement the background on statistical significance analysis presented in ref 1, and discuss it in parallel to our work on the development of statistical methods for assessing the confidence of peptide and protein identifications. The work by Ka¨ll et al.1 walks the readers through the “significance analysis” in a progressive manner: from the basic measure p-value to the multiple testing corrected falsediscovery rate (FDR) and q-value. Meanwhile, it also highlights the importance of decoy database as a generator of the score distribution of incorrect peptide matches, that is, what statisticians call the “null distribution” in hypothesis testing framework.

From p-Value to Multiple Testing Correction Statistical significance analysis is a straightforward procedure. In mass spectrometry-peptide proteomics, for each best matching peptide assignment to an MS/MS spectrum, the null distribution is often estimated by constructing the histogram of scores from all peptides (except the best scoring peptide that is being evaluated) in the searched sequence database that were scored against that particular spectrum. One can reference an observed search score (e.g., hyperscore in X! TANDEM) to the distribution of those random matches, and assign a significance * To whom correspondence should be addressed. Email: nesvi@ med.umich.edu. † Department of Pathology, University of Michigan. ‡ Department of Biostatistics, University of Michigan. § Center for Computational Medicine and Biology, University of Michigan. 10.1021/pr700747q CCC: $40.75

 2008 American Chemical Society

measure to the match. The farther away the observed score is located from the core of the null distribution, the more significant (i.e., more likely to be correct) the match should be. This is the basic concept of p-value, which is essentially the tail probability in the distribution generated from random matches. In practice, a related statistical measure called expectation value, E-value, is used more often (i.e., it is reported by MASCOT, X! TANDEM, OMSSA, and several other peptide identification tools). E-value refers to the expected number of peptides with scores equal to or better than observed score under the assumption that peptides are matching the experimental spectrum by random chance. We refer to p-values and E-values as single-spectrum statistical confidence measures.2 The sheer use of p-value (or E-value for that matter), however, is yet to be further harnessed in order to keep the number of incorrect matches low among all accepted peptide matches. Multiple testing correction remains an important issue when there are too many peptide assignments to be “simultaneously” validated for correct or incorrect identification. Even if one finds an extremely low p-value indicating high statistical significance for a particular match, when there are many MS/MS spectra in the data set and thus many matches with similarly low p-values, by random chance alone there can be many incorrect matches among all the matches called correct. Allowing a large number of incorrect peptides will be detrimental when the protein level inference is based on the peptide-level identifications. Therefore, it is crucial to find a reasonable error control procedure to provide a more conservative measure of significance than p-value. Indeed, there are numerous ways to perform multiple testing corrections. One can start from adjusting threshold p-value to achieve a specified overall error rate. The most classical adjustment is known as “Bonferroni correction”. If one were to call all the matches with p-value less than 0.05, for example, Bonferroni correction would make the criterion more stringent, in proportion to the number of matches being validated. That Journal of Proteome Research 2008, 7, 47–50 47 Published on Web 12/08/2007

perspectives is, if one had 10 000 matches to validate, then the significance threshold p-value 0.05 is adjusted to 0.05/10 000 ) 0.000005. As one can easily see, this adjustment will make the entire selection criterion extremely conservative, and the invention of this adjustment was clearly not based on the large size of data sets generated in genomics and proteomics, but for more modest problems.

False-Discovery Rates Therefore, a more (but not too) liberal measure of significance was introduced, so-called false-discovery rate (FDR). FDR is defined as the “expected” proportion of incorrect assignments among the accepted assignments at the global level. A pioneering work in this area was a procedure developed by Benjamini and Hochberg,3 which works on ordered p-values under a set of reasonable assumptions; most importantly, all p-values are statistically independent, and such assumptions have been relaxed in many following works. Turning to the literature on mass spectrometry-based peptide identification, there have been previous proposals to control global error rates. A simple but popular approach is based on the use of the target-decoy database search strategy.4 The FDR for a classification rule that calls peptide assignment correct if the search score, S, is above a certain threshold, ST, is defined as the expected proportion of target peptide matches that are incorrect, Ninc, among all target peptide matches passing that threshold, Nt, so that FDR ≈ Ninc/Nt. Since Ninc is not known, in most target-decoy search strategies Ninc is estimated as Nd, and thus FDR as Nd/Nt, where Nd is the number of matches to decoy peptides passing the same score threshold. Alternatively, this is sometime computed as 2Nd/ (Nt + Nd). Regardless of the choice, the important underlying assumption here is that the distribution of incorrect matches to target sequences is identical to that of matches to decoy peptides; thus, the error is roughly double the number of decoy peptides passing the given cutoff. This is obviously an approximation that may not hold in general, as discussed below. The accuracy of the simple FDR estimates obtained as described above depends on how the target-database search was conducted. In general, there are two options for performing such searches: either as a single search against a concatenated target plus decoy database, or as two separate searches against the target and decoy database. The work by Ka¨ll et al.1 discusses only the target-decoy search strategy that involves two separate searches. The authors correctly point out that FDR computed simply as Nd/Nt is a conservative estimate. This is so because the number of decoy matches, Nd, overestimates Ninc when all spectra are allowed to match to decoy database sequences, including those spectra that are identified correctly in the target database search. Thus, the global error control is more accurate when the validation process takes into account the proportion of peptide assignments in the data set that are incorrect (called PIT in ref 1). This correction was not applied in the previous work by Olsen and Mann,5 Higgs et al.,6 Weatherly et al.,7 and in other manuscripts cited in ref 1 that used the separate search strategy. We would like to point out, however, that in practice in many if not most studies the search is performed using the first option, that is, against the concatenated target-decoy database, which to some degree corrects for this proportion; see, for example, Elias et al.4 and references therein. 48

Journal of Proteome Research • Vol. 7, No. 01, 2008

Choi and Nesvizhskii

Model-Based Approach to FDR Efron et al.8 introduced a model-based method to calculate FDR using mixture models. Mixture model literally refers to a statistical model that explains the distribution of interest as a mix of more than a single distribution, where the mixing proportion is a parameter of interest. In peptide identification, this proportion is the fraction of incorrect peptide assignments among all target peptide matches (dubbed as PIT in ref,1 p(-) in ref 10, but more conventionally as π0). The FDR calculated taking account of π0 is closer to the truth than simple FDR, which makes perfect sense, since not all target matches are correct, as explained above. Efron et al.8 and Storey9 provide a way to conservatively estimate π0, which may or may not be close to the true proportion. In Ka¨ll et al.,1 this procedure for estimating π0 is illustrated in the context of the peptide identification problem. Since it may not be apparent from Ka¨ll et al.,1 we would like to point out that the use of mixture model-based error control in the area of mass spectrometry-based proteomics has already been in use. The work by Keller et al.10 and Nesvizhskii et al.11 has presented a version of the model-based calculation of FDR in the context of peptide and protein identification problem, for a recent review see Nesvizhskii et al.2 The software implementation PeptideProphet is in use across many laboratories conducting proteomics research. In Keller at al.,10 the probability of correct identification is estimated by the exactly same kind of mixture modeling as the one presented in Efron et al., and as a result, probability-based filtering (e.g., probability that peptide assignment is correct, P > 0.9) can be performed. In the original work,10 and in the PeptideProphet software implementation however, the classification performance was showcased in terms of conventional receiver operating characteristics (ROC) curves. It has not been explicitly emphasized that the estimated probability of incorrect identification is connected to the local false-discovery rate (fdr), a numerical derivative of FDR at a particular score. The FDR itself was described as the false positive identification error rate. The difference in terminology simply reflects the fact that the terms fdr and FDR were not yet widely used in the field at the time. This interpretation has recently been established in Choi and Nesvizhskii.12

Multiple Testing Correction via q-Value The manuscript by Ka¨ll et al.1 also presents a discussion using the mass spectrometry terminology of a particular multiple testing correction method, q-value. Despite the theoretical advantages of q-value to FDR, it remains unclear how crucial the difference is in practical terms. According to the authors, first, q-value is defined for individual matches, not a set of peptide matches above a threshold. Second, as a consequence of the former, q-value is a monotone function of database search score (the greater the score, the smaller the q-value), which makes it more interpretable than FDR. These properties are clearly desirable in the sense that the set of matches ordered by the search score will match the set ordered by q-values. The advantage nonetheless diminishes when one considers the peptide identification process in a larger context. First, most database search tools produce many other companion features such as the mass difference between the measured and calculated peptide mass, the number of terminal amino acid not specific to the enzymatic digestion (tryptic versus non-

perspectives

FDRs and Related Statistical Concepts in MS-Based Proteomics tryptic), and so forth. These auxiliary parameters have been found helpful in many previous works, see Nesvizhskii et al.2 for review. In the presence of this extra information, the ordering by search scores becomes less interpretable. For example, if a high mass accuracy instrument is used, a peptide match with, say, SEQUEST Xcorr score 3.1 and mass error 0.05 Da may be more likely to be correct than another match with a score 3.5 but with a larger mass error of 0.2 Da. Therefore, although the use of q-value may still retain good control of overall false-discovery rates, the power of detection of correct identification is supposedly equivalent to regular FDR. While, in general, the q-value estimation procedure could be extended to the case when a composite (and more discriminative) score is created using individual discriminant features, it is not discussed in Ka¨ll et al.1 Even more importantly, the q-values should still be considered a global error rate measure, albeit with better properties than FDR, whereas the local estimates such as fdr are more relevant in the practical setting, as we argue in the following section.

FDR at the Level of Distinct Peptides and Proteins The ability to estimate the global FDR at the spectrum level is hardly of any practical significance by itself. We would like to argue that the more useful statistical measures, at the level of peptide assignments to spectra, are not the global measures but the individual probabilities (or local false-discovery rates) computed for each assignment in the data set. Assignment of peptides to MS/MS spectra is an intermediate step in the analysis leading to the identification of proteins present in the sample prior to digestion. As discussed in detail in Nesvizhskii and Aebersold,13 protein identification via mass spectrometry is complicated by a number of factors. First, in shotgun proteomics, the connectivity between peptides and proteins is lost, leading to ambiguities in assigning peptides to entries in the protein sequence database (the protein inference problem). Second, there is a nonrandom grouping of peptides to proteins, resulting in an amplification of error rates in going from spectrum level to distinct peptide to protein level, especially in the case of proteins identified by a single peptide. Several solutions have been presented, including the method and its implementation in ProteinProphet and other approaches, as reviewed in Nesvizhskii et al.2 Still, the proteinlevel models based on peptide identification are clearly far from being complete. The ideal protein identification model would consider (1) individual probabilities of peptide matches to MS/ MS spectra (spectrum-level properties), (2) peptide-level properties, such as the number of MS/MS spectra supporting the peptide identification, evidence from different peptide–ion charge states, and sequence modification states, (3) proteinlevel properties such as relative protein abundance in the cell, protein length, number of expected peptides (which in turn depends on the physicochemical properties of peptides and sequence digestion properties), and so forth. As an outcome, the model would calculate the probability of correct protein identification jointly for all identified proteins, addressing the multiple mapping issue between peptides and proteins. One may also naively assume that, at least as far as the error rate estimation is concerned, the protein-level problem can be addressed by the same target-decoy search strategy discussed above in the context of the peptide identification problem. However, we would like to argue against this simplification,

which brings the discussion back to the choice of the proper null distribution, both at the peptide and at the protein level.

Choice of Proper Null Distribution The acquisition of proper null distribution (score distribution of incorrect peptide matches) is crucial for the significance assessment.1,12 This point is straightforward from the context since, in methods of FDR estimation that completely rely on the decoy peptides, the accuracy of the estimates are dependent on how “representative” the decoy matches are of incorrect identification. In practice, especially in a database search where decoys are formed simply by reversing the sequences from the target database itself (the most commonly used approach), it is difficult to assess whether there is an underlying factor(s) that may have driven the null distribution away from the true negative distribution even slightly. Such deviation may well affect the validation of peptide assignments on the borderline of cutoff. In this regard, it is worth pointing out that the mixture modeling approach of ref 10 possibly has an advantage in that the distribution of incorrect peptide matches can deviate from the distribution of decoy peptides. In the unsupervised implementation of Keller et al.,10 it is determined completely from the data (without the use of decoys), but can also benefit from the presence of decoy peptides as described in refs 12 and 14. Finally, we would like to discuss what we believe is a fundamental difficulty in the process of assessing the significance of peptide and protein identifications, using target-decoy search strategy or any strategy for that matter. On the basis of our experience with mass spectrometry data analysis, we propose that incorrect peptide assignments (and protein identifications) should be considered as coming from two different sources. The first kind of incorrect matches are truly random, in the sense that the matched peptide sequence has no significant homology to the true peptide that produced the MS/MS spectrum. The second kind includes incorrect matches to sequences homologous to the true peptides. When creating decoys by reversing or randomizing target protein sequences, one can possibly derive an accurate representation of the distribution of the first type of incorrect peptide matches only (truly random matches). The second source of the false positives remains underestimated. While the problem may be less severe at the level of spectral matches, it can produce a significant bias in the error rate estimates derived at the protein level. Several examples of erroneous peptide matches that would clearly be assigned less than accurate probabilities (or q-values for that matter) can be found in Nesvizhskii et al.15 In particular, that work presents examples of high scoring spectrum matches to sequences in the searched expressed sequence tag (EST) database which are likely to be false positives, and when the true sequences were highly abundant peptides chemically modified during sample preparation. Recently, an attempt was made to generate decoy databases using more sophisticated rules that preserve some of the sequence homology of the target database.16 Ultimately, this problem can only be addressed by accurate analysis and complete understanding of the sources of false-positive peptide and protein identifications. This, in turn, should open the possibility of designing methods for more realistic generation of decoy protein sequences.

Acknowledgment. This work was supported in part by NIH/NCI Grant CA-126239. Journal of Proteome Research • Vol. 7, No. 01, 2008 49

perspectives References (1) Ka¨ll, L.; Storey, J.; MacCoss, M.; Noble, W. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J. Proteome Res. 2008, 7, 29–34. (2) Nesvizhskii, A. I.; Vitek, O.; Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 2007, 4, 787–797. (3) Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc., Ser. B 1995, 57, 289–300. (4) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4, 207–214. (5) Olsen, J. V.; Mann, M. Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proc. Natl. Acad. Sci. U.S.A. 2004, 101 (37), 13417– 13422. (6) Higgs, R. E.; Knierman, M. D.; Bonner-Freeman, A.; Gelbert, L. M.; Patil, S. T.; Hale, J. E. Estimating the statistical significance of peptide identifications from shotgun proteomics experiments. J, Proteome Res. 2007, 6, 1758–1767. (7) Weatherly, D. B.; Atwood, J. A.; Minning, T. A.; Cavola, C.; Tarleton, R. L.; Orlando, R. A heuristic method for assigning a false discovery rate for protein identifications from mascot database search results. Mol. Cell. Proteomics 2005, 4, 762–772. (8) Efron, B.; Tibshirani, R.; Storey, J. D.; Tusher, V. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 2001, 96 (456), 1151–1161. (9) Storey, J. D. A direct approach to false discovery rates. J. R. Stat. Soc. 2002, 64, 479–498.

50

Journal of Proteome Research • Vol. 7, No. 01, 2008

Choi and Nesvizhskii (10) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383– 5392. (11) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 2003, 75, 4646–4658. (12) Choi, H.; Nesvizhskii, A. I. Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. J. Proteome Res. 2008, 7, 254–265. (13) Nesvizhskii, A. I.; Aebersold, R. Interpretation of shotgun proteomic datasthe protein inference problem. Mol. Cell. Proteomics 2005, 4, 1419–1440. (14) Choi, H.; Ghosh, D.; Nesvizhskii, A. I. Statistical validation of peptide identifications in large-scale proteomics using target-decoy database search strategy and flexible mixture modeling. J. Proteome Res. 2008, 7, 286–292. (15) Nesvizhskii, A. I.; Roos, F. F.; Grossmann, J.; Vogelzang, M.; Eddes, J. S.; Gruissem, W.; Baginsky, S.; Aebersold, R. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of posttranslational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 2006, 5, 652–670. (16) Feng, J.; Naiman, D. Q.; Cooper, B. Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies. Bioinformatics 2007, 23, 2210–2217.

PR700747Q