Unbiased False Discovery Rate Estimation for Shotgun Proteomics

Nov 28, 2016 - Target-decoy approach (TDA) is the dominant strategy for false discovery rate (FDR) estimation in mass-spectrometry-based proteomics. O...
1 downloads 9 Views 428KB Size
Subscriber access provided by NEW YORK UNIV

Article

Unbiased False Discovery Rate Estimation for Shotgun Proteomics Based on Target-Decoy Approach Lev I. Levitsky, Mark V. Ivanov, Anna A. Lobas, and Mikhail V Gorshkov J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.6b00144 • Publication Date (Web): 28 Nov 2016 Downloaded from http://pubs.acs.org on November 29, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Unbiased False Discovery Rate Estimation for Shotgun Proteomics Based on Target-Decoy Approach Lev I. Levitsky,†,‡ Mark V. Ivanov,†,‡ Anna A. Lobas,†,‡ and Mikhail V. Gorshkov∗,†,‡ †Moscow Institute of Physics and Technology, Dolgoprudny, Moscow Region, Russia ‡ V.L. Talrose Institute for Energy Problems of Chemical Physics, Russian Academy of Sciences, Moscow, Russia E-mail: [email protected] Phone: +7 499 1378257 Abstract Target-decoy approach (TDA) is the dominant strategy for false discovery rate (FDR) estimation in mass-spectrometry-based proteomics. One of its main applications is direct FDR estimation based on counting of decoy matches above a certain score threshold. The corresponding equations are widely employed for filtering of peptide or protein identifications. In this work we consider a probability model describing the filtering process and find that, when decoy counting is used for q value estimation and subsequent filtering, a correction has to be introduced into these common equations for TDA-based FDR estimation. We also discuss the scale of variance of false discovery proportion (FDP) and propose using confidence intervals for more conservative FDP estimation in shotgun proteomics. The necessity of both the correction and the use

1

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

of confidence intervals is especially pronounced when filtering small sets (such as in proteogenomics experiments) and when using very low FDR thresholds.

Keywords proteomics, false discovery rate, target-decoy approach

Introduction In shotgun proteomics, a search engine typically produces a large set of peptide-spectrum matches (PSMs), which raises the problem of their quality assessment. 1 While p value has been traditionally central for determination of statistical significance of experimental measurements, 1 it in itself does not account for the multiple testing scenario of database search and thus must be corrected. 2 For this reason, the notion of false discovery rate (FDR) 3–5 was readily adopted, as it is inherently well-suited for shotgun proteomics. 2 A typical FDR controlling procedure starts with a list of PSMs sorted by search score and considers all possible score thresholds, while estimating the corresponding values of FDR. A number of estimation procedures have been proposed, including ones based on empirical p values 4 and posterior error probabilities (PEPs). 6,7 The introduction of the target-decoy approach (TDA) 8–10 provided new means of estimating the statistical significance of PSMs. While major post-search validation algorithms assume certain score distributions and employ decoy PSMs for model training and subsequent estimation of PEPs, 6,7,11 a number of popular search engines 12–16 implement a simpler algorithm based solely on TDA. As of version 2.10, Percolator 6 also uses the “target-decoy competition” mode by default. This algorithm is agnostic of score distributions 17 and uses PSM scores for ranking only. 18 The decoy counting approach has been shown to be consistent with PEP-based FDR estimation using several major scoring schemes. 11 In this work we discuss the naïve method of FDR estimation and filtering, based on PSM sorting and 2

ACS Paragon Plus Environment

Page 2 of 19

Page 3 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

decoy counting. Throughout this paper, we use the following terminology adopted from the earlier works. 19,20 By false discovery proportion (FDP) we mean the actual proportion of false positive matches above a certain score threshold. 19 By false discovery rate (FDR) we mean the expected value of FDP: F DR = E(F DP ), 19 where the expectation value is taken with regard to a fixed \ score threshold. 20 To denote the estimated value of FDR, we use F DR, which can be considered as a function of the score threshold. We refer to q values as the minimal FDR threshold at which a given PSM is accepted, following the definition by Käll et al. 21,22 Note that making a distinction between “real” and estimated FDR naturally leads to making the same distinction between “real” and estimated q values, and any method of q value estimation is based on the corresponding method of FDR estimation. The main assumption behind TDA is that the probability of a false match coming from the target or decoy database is proportional to their relative sizes. If the size of the decoy database is equal to that of the target database, then there are equal probabilities for a false PSM to originate from the target or decoy database. Having the above assumption in mind, the numbers of decoy and target false PSMs are usually considered equal, and FDR is estimated using the following formula: 18,23 d \ F DR = t

(1)

where d and t are the number of decoy and target PSMs in the set, respectively. In the more general case, when the sizes of decoy and target databases are not equal, 18 Eq. 1 takes the form: d \ F DR = rt

(2)

where r is the size ratio between decoy and target databases. The procedure of filtering a set of PSMs to a desired FDR level can be modelled in the 3

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 19

following way. First, the list of PSMs is sorted according to their scores. Then, FDR is estimated for each sublist containing top i PSMs, Si : \ fˆi = F DR(Si )

(3)

After that, q values for all PSMs can be estimated using the above-mentioned definition as follows: qˆi = min fˆj j>i

(4)

By definition, the calculated q values grow monotonically in the list of sorted PSMs. The last step in the filtering procedure is to locate the position in the list where the estimated q value exceeds the desired FDR level:

n = max j : qˆj 6 F

(5)

The sublist Sn is the filtered PSM list corresponding to the desired FDR threshold F . The described procedure is abstract, in the sense that it does not specify how FDR is estimated in Eq. 3. In case the estimation is performed using Eq. 2, the sequence fˆi is not monotonic, because fˆi decreases with each target PSM and increases with each decoy PSM. In the subsequent step (Eq. 4) the sequence is “monotonized”, but the estimated q values also increase at decoy PSMs only. As a result, the score threshold is always set so that a decoy PSM occurs next in the list, immediately below the threshold. Intuitevely, this fact must introduce an imbalance between target and decoy false matches above the score threshold, which is not accounted for in the equations above. This effect may contribute to certain drawbacks of TDA pointed out elsewhere, 24 such as the estimated zero FDR value for PSM sets without any decoy matches. It was also mentioned in the cited work that TDA yields inaccurate results for data sets with small numbers of spectra. 24 It should also be

4

ACS Paragon Plus Environment

Page 5 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

emphasized that, because of the random nature of false matches, the number of false target PSMs above the threshold (and, accordingly, the FDP) is a random variable, and as such should not be characterized exclusively by its expected value. The random variance of FDP typically remains unaddressed in proteomics studies, although error estimations have been performed, both theoretically 1,25 and using simulations. 8 Herein we analyze the filtering process and propose a probability model of TDA-based filtering, which allows direct calculation of the expected value and variance of FDP, and introduce the necessary correction to Eq. 2 as an attempt to address the above-mentioned drawbacks of TDA. We also propose using confidence intervals for conservative FDR estimation and present a freely available implementation of the proposed algorithm. Although we use PSM-level FDR for the following discussion, the same reasoning applies equally to peptide- and protein-level FDR filtering, if it is based on TDA.

Results and discussion Probability model During the spectrum matching process, each spectrum is considered and assigned independently from the others. According to TDA, false PSMs are distributed uniformly across the target and decoy databases, so for any given spectrum which is not assigned the correct peptide sequence, the probability, p, of this spectrum being assigned a peptide from the target database is given by the decoy-to-target database size ratio, r, as follows:

p=

1 1+r

(6)

Thus, each false PSM can be represented by a Bernoulli trial, and if one iterates over the filtered PSM list, Sn , then all occurring false PSMs can be viewed as a Bernoulli process, i.e. a series of independent Bernoulli trials, or black and white balls being drawn from an urn

5

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 19

(with replacement). Black and white balls correspond to decoy PSMs and false PSMs from the target database, respectively. According to the filtering procedure described above, if the stoppage criterion is tied to a specific FDR value estimated using Eq. 2, the FDR threshold is always set immediately before a decoy PSM. Let d be the number of decoy PSMs above the FDR threshold, and T be the random variable denoting the number of false target PSMs above the same threshold. Although T has been modelled before using the binomial distribution, 25 that approach requires artificial normalization to obtain the probability distribution of T . On the other hand, it does not seem to take into account that the iteration stops when the (d+1)-th decoy PSM is found (and not included in the output). This dictates the following formulation of our urn problem: how many white balls will be drawn from the urn before the (d + 1)-th black ball is drawn (given the constant probability p of drawing a white ball in a trial)? In this case, the number of false PSMs from the target database above the threshold, T , follows the negative binomial distribution with parameters d + 1 and p: T ∼ N B (d + 1; p). Its probability mass function is then:

P(T = k) =



 k+d k p (1 − p)d+1 , k

(7)

where P denotes probability. Then, the expected value for T , E(T ) is given by:

E(T ) = (d + 1) ·

p d+1 = 1−p r

(8)

This means that, since we consider a sublist of PSMs that is “cut” just before a decoy, we should expect that, on average, it contains

d+1 r

false targets, rather than dr , as implied by

the common formula. Interestingly, this is consistent with the results reported by He et al. 26 and Huttlin et al. 25 Indeed, the normalization procedure used by Huttlin et al. effectively converts the binomial distribution into negative binomial (see Supporting Information for formal proof). 6

ACS Paragon Plus Environment

Page 7 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

However, Eqs. 7 and 8 above imply that T can assume any values from 0 to infinity. In fact, the range of possible values for T is limited by the number of all target matches above the applied FDR threshold, t. The corresponding corrections are discussed in Supporting Information (S1). We show that, in the typically used range of F DR 6 5%, Eq. 8 can be safely used for FDR estimation (see Supplementary figure S1). By substituting

d r

with the corrected estimation from Eq. 8, we obtain the more correct

form of Eq. 2: d+1 E(T ) \ = F DR ≈ t r·t

(9)

Eq. 9 represents the main outcome of the considered model. In the most common case, r = 1, it takes the form: d+1 \ F DR ≈ t

(10)

Thus, with r = 1 the model predicts that the expected number of false target PSMs above the FDR threshold does not equal the number of decoys, d, but is actually closer to d + 1. This means that Eqs. 1 and 2 have an optimistic bias and should be corrected if employed for FDR filtering. Eq. 10 can also be obtained if FDR is estimated using decoy-based empirical p values, adjusted to get the correct type I error rate. 2,27

Calculation of confidence intervals If the number of decoy matches above the threshold is low, the effect of the correction becomes more apparent. This is often the case in proteogenomics studies and other settings aimed at detecting rare events. However, even with the correction relative deviation of FDP from the unbiased FDR estimation may be very significant. For example, if there is one decoy PSM above the threshold (corresponding to 100 PSMs at “1% FDR” or 1000 PSMs at “0.1% FDR”, based on the uncorrected estimation), then the uncorrected expected number 7

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 19

of false target PSMs in the filtered set is 1, while the correct expected value is 2. Direct calculation of P(T 6 2) with d = 1 using Eq. 7 shows that there is a ∼31% probability that 3 or more false target PSMs are in the filtered set, which means a 3-fold deviation of FDP from the FDR estimation without correction, or a 1.5-fold deviation in case the correct expected value is used. For this reason, it may be more practical to consider confidence intervals of T , rather than just its expected value. Although Huttlin et al. propose using pseudo-symmetric confidence intervals, 25 we argue that it may be more sensible to consider intervals with their left boundary fixed at 0, so that they correspond to the probability of FDP exceeding a certain value. Given the already calculated probability distribution (Eq. 7), we can introduce the “confidence value”, Cα (T ), as the right border of such an interval for confidence level α: Cα (T ) ≡ inf {k ∈ Z+ 0 : P(T 6 k) > α},

(11)

where Z+ 0 is the set of all non-negative integers. Thus, Cα (T ) is the lowest non-negative integer k such that P(0 6 T 6 k) > α. Confidence value is the upper bound for the number of false targets above a certain threshold. In the example above (d = 1), the 95% confidence value for T is C0.95 (T ) = 6, i.e. with probability > 95% there is no more than 6 false targets in the filtered set when there is 1 decoy above the threshold. Here, 6 is a more conservative and meaningful estimation of the possible number of false targets than the expected number 2 and especially the uncorrected expected number 1. Direct implementation of Eq. 11 implies calculation and summing of the probabilities P(T = k) (e.g. using Eq. 7) for consecutive integers k starting from zero, until the sum exceeds α. The last k then is the confidence value Cα (T ). This procedure, as well as the corrected FDR estimation, is implemented in the Pyteomics library. 28 In order to obtain a simple estimate for the scale of FDP variance without convoluted iterative calculations, one can use the known expression for standard deviation of the negative

8

ACS Paragon Plus Environment

Page 9 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

binomial distribution N B (r; p), σ =

q

pr : (1−p)2

σ(T ) =

or σ(T ) =

p

s

(d + 1) p , (1 − p)2

(12)

2(d + 1) for p = 21 . Eq. 12 is based on the negative binomial distribution and

thus only applicable when FDR does not significantly exceed 5% (see Supporting Information for details).

Effect of monotonization Equation 4 can be rewritten as: qˆi = min(fˆi ; qˆi+1 ). This has the following implications: (1) qˆi 6 qˆi+1 (monotonicity) and (2) qˆi 6 fˆi . The latter means that, in general, q values will have an optimistic bias relative to the FDR estimation for the corresponding subset of PSMs. Thus, if q values are reported for a list of PSMs, they generally present a biased estimation of the corresponding FDR levels. This holds as long as Eq. 4 is used for q value calculation, regardless of the FDR estimation method. The scale of this bias depends on the score distributions (more specifically, on local FDR). However, if the PSM list is filtered using q values, then the threshold position k is obviously chosen so that qˆk < qˆk+1 , which means that qˆk = fˆk . Thus, filtering based on q values does not introduce any additional bias into FDR estimation. Figure 1 shows the difference between qˆi and fˆi in case when fˆi are calculated using Eq. 1. The correction discussed in this work applies when filtering is based on a predefined FDR value, and Eq. 1 or similar is used for FDR estimation (which results in a decoy PSM directly following the calculated threshold). In other cases, e.g. when using a predefined search score threshold or simply considering a certain number of top matches, the “+1” correction is not applicable. For instance, Percolator v. 2.10 uses Eqs. 1 and 4 by default for q value estimation (see Supporting Information for demonstration). In this case, if one considers top 500 results and the 501-st PSM is not a decoy, qˆ500 reported by Percolator will be a 9

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 19

biased estimation of FDR, while a more correct estimation would be fˆ500 = ( dt )500 (the “+1” correction is not needed). However, if one filters the list by q value, e.g. qˆi < 0.01 (the case described by the proposed model), then the threshold q value will be qˆk = fˆk = ( dt )k , ) . For another example, while the correct FDR estimation for the filtered list will be ( d+1 t k MaxQuant v. 1.5.2.8 reports “q values” for identified protein groups. Analysis shows that these values are in fact not monotonic and correspond to FDR estimations, fˆi = ( dt )i , rather than to q values as defined by Eq. 4. This means that they do not have the monotonization bias, but the correction proposed in this work still applies.

Simulated experiments The following experiment was performed in order to test the above-mentioned theoretical conclusions. Two score distributions were chosen arbitrarily to model “true” and “false” PSMs; normal distributions were used for convenience. 1000 “true” PSMs were generated by drawing random numbers from the “true” score distribution. They were all labeled as “true” and “target”. Then, 5000 “false” PSMs were also generated by drawing random numbers from the “false” distribution. They were all labeled as “false”; labels “target” or “decoy” were assigned randomly to each “false” PSM with equal probabilities, corresponding to p = 21 , or r = 1. The resulting set of PSMs was then filtered to 1% FDR using the standard TDA approach. This corresponds to

t d

= 100, which allows using Eq. 10 for FDR estimation, as

mentioned above. After filtering, the number of “false” PSMs from “target database” (i.e. the actual value of T ), was calculated and compared to the number of “decoy” PSMs, d. The actual value of F DP =

T t

\ and the estimated F DR =

d t

were averaged over the repeated

experiments. The results are shown in Figure 2. Expectedly, the mean difference tends to

1 t

after a sufficient number of repeats. The experimental workflow described above was also employed to evaluate the effect of using confidence values instead of expected values of FDP. For this purpose, a set of PSMs was generated once. Apart from regular q values, a more conservative estimation of q values 10

ACS Paragon Plus Environment

Page 11 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

was made by using C0.95 (T ) instead of E(T ) in FDR estimation. The results are shown in Figure 3. The q values calculated using expected values oscillate around the “true” values, but the curve calculated using the confidence values provides a more conservative estimate. ROC curves plotted using both sets of q values are shown in Supplementary figure S2. The results of the simulation do not depend on the shape of the score distributions or the extent of overlap between them, because the employed filtering method only considers PSM ranking. Thus, the overlap only affects the number of PSMs above the threshold corresponding to a certain FDR setting, but the error in the number of false positive target PSMs always tends to 1. However, in case of high discrimination between “true” and “false” PSMs, the sorting-based procedure itself can be substantially suboptimal. This effect is illustrated in Supporting Information (S2).

Conclusions Several major search engines and post-search algorithms employ a naïve approach to FDR filtering, using only ranked lists of PSMs, and not score values, for q value calculation. If these q values are used for FDR filtering, a decoy identification is always immediately below the calculated threshold. Careful examination of the corresponding filtering process shows that, in this case, an important correction to the commonly used equation is needed for unbiased estimation of FDR in filtered sets of peptide and protein identifications. When filtering small sets or using very low FDR thresholds, random deviation of actual FDP from the FDR threshold becomes very significant. Confidence values of FDP can be effectively used instead of point estimations of FDR. The amended FDR calculation procedures proposed in this work have been implemented in the open-source Pyteomics library, 28 allowing more accurate TDA-based filtering, especially in case of small sets of PSMs.

11

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Acknowledgement The work was supported by Russian Science Foundation, grant #14-14-00971. The authors thank Anton Goloborodko and Julia Bubis for useful discussion of the presented results.

Supporing Information Available Corrected form of Eq. 8 (S1) and visualization of the effect of correction (Fig. S1). Source code for the simulated experiments and for generation of all figures in this work, in the form of an IPython Notebook (S2). Comparison of ROC curves based on expected value and confidence value (Fig. S2). Proof of equivalence between the equations used by Huttlin et al. 25 and the negative binomial distribution used in this work (S3). Analysis of q values reported by Percolator v. 2.10 (S4).

References (1) Granholm, V.; Käll, L. Quality assessments of peptide-spectrum matches in shotgun proteomics. Proteomics 2011, 11, 1086–1093. (2) Granholm, V.; Navarro, J. F.; Noble, W. S.; Käll, L. Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics. J. Proteomics 2013, 80, 123–131. (3) Soric, B. Statistical “Discoveries” and Effect-Size Estimation. J. Am. Stat. Assoc. 1989, 84, 608–610. (4) Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B: Statistical Methodology 1995, 57, 289–300.

12

ACS Paragon Plus Environment

Page 12 of 19

Page 13 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(5) Storey, J. D.; Tibshirani, R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U. S. A. 2003, 100, 9440–9445. (6) Käll, L.; Canterbury, J. D.; Weston, J.; Noble, W. S.; MacCoss, M. J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 2007, 4, 923–925. (7) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383–92. (8) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in largescale protein identifications by mass spectrometry. Nat. Methods 2007, 4, 207–214. (9) Moore, R. E.; Young, M. K.; Lee, T. D. Qscore: An algorithm for evaluating SEQUEST database search results. J. Am. Soc. Mass Spectrom. 2002, 13, 378–386. (10) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: The yeast proteome. J. Proteome Res. 2003, 2, 43–50. (11) Käll, L.; Storey, J. D.; Noble, W. S. Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry. Bioinformatics 2008, 24, 42–48. (12) Wenger, C. D.; Coon, J. J. A proteomics search algorithm specifically designed for high-resolution tandem mass spectra. J. Proteome Res. 2013, 12, 1377–86. (13) Cox, J.; Neuhauser, N.; Michalski, A.; Scheltema, R. A.; Olsen, J. V.; Mann, M. Andromeda: a peptide search engine integrated into the MaxQuant environment. J. Proteome Res. 2011, 10, 1794–805.

13

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(14) Cox, J.; Mann, M. MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification. Nat. Biotechnol. 2008, 26, 1367–72. (15) Dorfer, V.; Pichler, P.; Stranzl, T.; Stadlmann, J.; Taus, T.; Winkler, S.; Mechtler, K. MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra. J. Proteome Res. 2014, 13, 3679–3684. (16) Kim, S.; Pevzner, P. A. MS-GF+ makes progress towards a universal database search tool for proteomics. Nat. Commun. 2014, 5, 5277. (17) Nesvizhskii, A.; Vitek, O.; Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 2007, 4, 787–797. (18) Elias, J. E.; Gygi, S. P. In Methods in Molecular Biology; Hubbard, S. J., Jones, A. R., Eds.; Methods in Molecular Biology; Humana Press: Totowa, NJ, 2010; Vol. 604; pp 55–71. (19) Pawitan, Y.; Calza, S.; Ploner, A. Estimation of false discovery proportion under general dependence. Bioinformatics 2006, 22, 3025–3031. (20) Fan, J.; Han, X.; Gu, W. Estimating false discovery proportion under arbitrary covariance dependence. J. Am. Stat. Assoc. 2012, 107, 1019–1035. (21) Käll, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J. Proteome Res. 2008, 7, 29–34. (22) Käll, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Posterior error probabilities and false discovery rates: Two sides of the same coin. J. Proteome Res. 2008, 7, 40–44. (23) Jeong, K.; Kim, S.; Bandeira, N. False discovery rates in spectral identification. BMC Bioinf. 2012, 13, S2. 14

ACS Paragon Plus Environment

Page 14 of 19

Page 15 of 19

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(24) Gupta, N.; Bandeira, N.; Keich, U.; Pevzner, P. A. Target-decoy approach and false discovery rate: when things may go wrong. J. Am. Soc. Mass Spectrom. 2011, 22, 1111–20. (25) Huttlin, E. L.; Hegeman, A. D.; Harms, A. C.; Sussman, M. R. Prediction of error associated with false-positive rate determination for peptide identification in largescale proteomics experiments using a combined reverse and forward peptide sequence database strategy. J. Proteome Res. 2007, 6, 392–8. (26) He, K.; Fu, Y.; Zeng, W.-F.; Luo, L.; Chi, H.; Liu, C.; Qing, L.-Y.; Sun, R.-X.; He, S.M. A theoretical foundation of the target-decoy search strategy for false discovery rate control in proteomics. 2015; http://arxiv.org/abs/1501.00537, arXiv:1501.00537 [stat.AP]. (27) Davison, A.; Hinkley, D. Bootstrap Methods and their Application; Cambridge University Press: Cambridge, UK, 1997; pp 175–180. (28) Goloborodko, A. A.; Levitsky, L. I.; Ivanov, M. V.; Gorshkov, M. V. Pyteomics – a Python framework for exploratory data analysis and rapid software prototyping in proteomics. J. Am. Soc. Mass Spectrom. 2013, 24, 301–304.

15

ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 16 of 19

Page 17 of 19

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment

Page 18 of 19

Page 19 of 19

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ACS Paragon Plus Environment