Instance Based Algorithm for Posterior Probability Calculation by

Nov 4, 2008 - National Chromatographic R&A Center, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China, Graduate ...
0 downloads 0 Views 815KB Size
Anal. Chem. 2008, 80, 9326–9335

Instance Based Algorithm for Posterior Probability Calculation by Target-Decoy Strategy to Improve Protein Identifications Xinning Jiang,†,‡ Xiaoli Dong,†,§ Mingliang Ye,*,† and Hanfa Zou*,† National Chromatographic R&A Center, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China, Graduate School of Chinese Academy of Sciences, Beijing 100049, China, and Department of Chemistry, Xixi Campus, Zhejiang University, Hangzhou 310028, China The target-decoy database search strategy is often applied to determine the global false-discovery rate (FDR) of peptide identifications in proteome research. However, the confidence of individual peptide identification is typically not determined. In this study, we introduced an approach for the calculation of posterior probability of individual peptide identification from the “local falsediscovery rate” (local FDR), which is also determined based on a target-decoy database search. The peptide identification scores output by the database search algorithm were weighted by their discriminating power using a Shannon information entropy based strategy. Then the local FDR of a peptide identification was calculated based on the fraction of decoy identifications among its nearest neighbors within a small space defined by these weighted scores. It was demonstrated that the calculated probability matched the actual probability precisely, and it provided powerful discriminating performance between true positive and false positive identifications. Hence, the sensitivity for peptide identification as well as protein identification was significantly improved when the calculated probability was used to process different proteome data sets. As an instance based strategy, this algorithm provides a safe way for the posterior probability calculation andshouldworkwellfordatasetswithdifferentcharacteristics. Coupled with liquid chromatography, tandem mass spectrometry based proteomic technologies provide a rapid and sensitive way for analysis of biologically complex protein mixtures.1-5 In

this method, proteins are first digested by protease, and then the resulting peptide mixtures are separated by high-performance liquid chromatography and sprayed into a mass spectrometer to generate a collection of MS and MS/MS spectra. The acquired MS/MS spectra are then interpreted by database search algorithms, such as SEQUEST,6 Mascot,7 X!Tandem,8 and OMSSA.9 These algorithms commonly predict peptide sequences by comparing the similarity between the observed spectra and theoretical spectra and then calculate scores to evaluate how well they match. These scores help discriminate between correct and incorrect assignments. One of the most fundamental steps in shotgun experiments is assigning an interpretable measure of confidence to the identifications after a database search as all subsequent results depend highly on these confidence measures. Compared with other database search algorithms, it is especially difficult for SEQUEST,6 one of the first and most popular database search algorithms, to determine the proper filter criteria as more than one discriminating score is determined for a single peptide identification, such as the cross correlation score (Xcorr) between the experimentally and theoretical spectra and the delta Xcorr score (∆Cn) between the best and second best matches.6 The conventional way for filtering of SEQUEST outputs is to set criteria which are established empirically using parameters of Xcorr and ∆Cn, e.g., Xcorr filters of 1.9, 2.2, and 3.75 for singly, doubly, and triply charged peptides and a ∆Cn filter of at least 0.1;5 these thresholds are typically published as criteria for which the correctness is defined. Some of other efforts have been made on establishing statistical analysis methods10-15 to determine the probability of

* To whom correspondence should be addressed. Prof. Dr. Mingliang Ye, National Chromatographic R&A Center, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, China. Phone: +86-411-84379620. Fax: +86-411-84379620. E-mail: [email protected]. Prof. Dr. Hanfa Zou, National Chromatographic R&A Center, Dalian Institute of Chemical Physics, The Chinese Academy of Sciences, Dalian 116023, China. Phone: +86-41184379610. Fax: +86-411-84379620. E-mail: [email protected]. † Dalian Institute of Chemical Physics, Chinese Academy of Sciences. ‡ Graduate School of Chinese Academy of Sciences. § Zhejiang University. (1) Aebersold, R.; Mann, M. Nature 2003, 422, 198–207. (2) Chen, E. I.; Hewel, J.; Felding-Habermann, B.; Yates, J. R. Mol. Cell. Proteomics 2006, 5, 53–56. (3) Florens, L.; Washburn, M. P.; Raine, J. D.; Anthony, R. M.; Grainger, M.; Haynes, J. D.; Moch, J. K.; Muster, N.; Sacci, J. B.; Tabb, D. L.; Witney, A. A.; Wolters, D.; Wu, Y. M.; Gardner, M. J.; Holder, A. A.; Sinden, R. E.; Yates, J. R.; Carucci, D. J. Nature 2002, 419, 520–526.

(4) Koller, A.; Washburn, M. P.; Lange, B. M.; Andon, N. L.; Deciu, C.; Haynes, P. A.; Hays, L.; Schieltz, D.; Ulaszek, R.; Wei, J.; Wolters, D.; Yates, J. R. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 11969–11974. (5) Washburn, M. P.; Wolters, D.; Yates, J. R. Nat. Biotechnol. 2001, 19, 242– 247. (6) Eng, J. K.; McCormack, A. L.; Yates, J. R., III J. Am. Soc. Mass Spectrom. 1994, 5, 976–989. (7) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551–3567. (8) Craig, R.; Beavis, R. C. Bioinformatics 2004, 20, 1466–1467. (9) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X. Y.; Shi, W. Y.; Bryant, S. H. J. Proteome Res. 2004, 3, 958–964. (10) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383–5392. (11) Moore, R. E.; Young, M. K.; Lee, T. D. J. Am. Soc. Mass Spectrom. 2002, 13, 378–386.

9326

Analytical Chemistry, Vol. 80, No. 23, December 1, 2008

10.1021/ac8017229 CCC: $40.75  2008 American Chemical Society Published on Web 11/05/2008

positive identifications, e.g., PeptideProphet.10,16 However, these methods commonly rely on the underlying models using various assumptions on the score distributions which are usually determined based on simple standard proteins. Therefore, there may be high risk when these models are used to process the data sets with very different characteristics. Another approach, target-decoy database search strategy, is independent of the variations introduced by samples, analytical techniques, data processing, and search parameters.17-20 In this strategy, mass spectra are searched against a composite database containing both target (original) and decoy (shuffle) sequences, then the final peptide identifications can be determined based on a specific false-discovery rate (FDR). As the target and decoy databases are of the same size, the probability that a peptide is identified incorrectly from the decoy database is expected to be same as the probability that it is identified incorrectly from the target protein database. Therefore, the FDR of the final identified data set can be easily calculated.17-19 Instead of the evaluation of the posterior probability of the individual peptide, the global FDR is determined to evaluate the confidence level in this strategy. As FDR can directly reflect the confidence level of the finally identified data set and can be easily evaluated by target-decoy strategy, it has been widely used. The use of decoy matches for the simulation of null distributions make the statistic models safer and more precise than the parametric models trained by a small amount of standard protein data sets.16,21,22 Recently, several research studies have been focused on estimating the statistical significance (p-value) of peptide identifications from a target-decoy database search.23,24 And this can help to distinguish which peptide identification is more likely to be a random match in the data set with a specific FDR. While FDR focus on the evaluation of the overall confidence level for the data set, posterior probability is the probability that the observed peptide match for a spectrum is correct. Compared with FDR, the peptide probability can be easily assembled to generate probability for protein assignment using statistic models.25-27 It also enables the inclusion of potentially valid low (12) Anderson, D. C.; Li, W. Q.; Payan, D. G.; Noble, W. S. J. Proteome Res. 2003, 2, 137–146. (13) Baczek, T.; Bucinski, A.; Ivanov, A. R.; Kaliszan, R. Anal. Chem. 2004, 76, 1726–1732. (14) Ulintz, P. J.; Zhu, J.; Qin, Z. H. S.; Andrews, P. C. Mol. Cell. Proteomics 2006, 5, 497–509. (15) Zhang, J. Y.; Li, J. Q.; Xie, H. W.; Zhu, Y. P.; He, F. C. Proteomics 2007, 7, 4036–4044. (16) Choi, H.; Nesvizhskii, A. I. J. Proteome Res. 2008, 7, 254–265. (17) Peng, J. M.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43–50. (18) Elias, J. E.; Gygi, S. P. Nat. Methods 2007, 4, 207–214. (19) Jiang, X. N.; Jiang, X. G.; Han, G. H.; Ye, M. L.; Zou, H. F. BMC Bioinformatics 2007, 8, 323. (20) Jiang, X. N.; Han, G. H.; Feng, S.; Jiang, X. G.; Ye, M. L.; Yao, X. B.; Zou, H. F. J. Proteome Res. 2008, 7, 1640–1649. (21) Zhang, J. Y.; Li, J. Q.; Liu, X.; Xie, H. W.; Zhu, Y. P.; He, F. C. BMC Bioinf. 2008, 9–18. (22) Choi, H.; Ghosh, D.; Nesvizhskii, A. I. J. Proteome Res. 2008, 7, 286–292. (23) Higgs, R. E.; Knierman, M. D.; Freeman, A. B.; Gelbert, L. M.; Patil, S. T.; Hale, J. E. J. Proteome Res. 2007, 6, 1758–1767. (24) Kall, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. J. Proteome Res. 2008, 7, 29–34. (25) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. Anal. Chem. 2003, 75, 4646–4658. (26) Feng, J.; Naiman, D. Q.; Cooper, B. Anal. Chem. 2007, 79, 3901–3911. (27) Price, T. S.; Lucitt, M. B.; Wu, W. C.; Austin, D. J.; Pizarro, A.; Yocum, A. K.; Ian, A. B.; FitzGerald, G. A.; Grosser, T. Mol. Cell. Proteomics 2007, 6, 527–536.

scoring peptide matches in the final protein assignment. However, as the posterior probability reported in previous works was commonly calculated using statistic models based on the characteristics and score distributions of a particular data set, there may be a large bias when the models are applied to another data set with very different characteristics. Instead, estimation of posterior probability by local FDR (fdr) is a safer way.28,29 Because the probability is evaluated by the false positives within the local area regardless the overall distributions, it should be applicable in different proteome samples using different database search algorithms. Peptide identifications with similar scores should have a similar confidence and should be confined in a “local area”. The fdr could be easily determined by the fraction of decoy identifications among all peptide identifications within this area. The question is how to define which peptide identifications should be considered in the “local area” of a peptide identification for the calculation of its fdr. In this study, we used k nearest neighbors (KNN) methods to solve this problem. In spite of their simplicity, KNN methods are among the best performers in a large number of classification problems. Because KNN methods do not make any assumption about the underlying data, they are especially successful when the decision boundaries are irregular or the data distribution is unknown. To integrate the scores outputted after the database search, a Shannon information entropy30 based attribute weighting strategy was used to evaluate the relative discriminating weights. Then the fdr of peptide identification was calculated from its k nearest neighbors which were evaluated by Euclidean distance against this peptide. It was demonstrated that the calculated probability matched the actual probability precisely and showed powerful discriminating performance. As an instance based strategy, this algorithm provides a safe way for the posterior probability estimation and should work well for data sets with different characteristics. EXPERIMENTAL PROCEDURES MS/MS Spectra and Database Search. Eleven replicate runs of tandem mass spectra on a control sample composed of seven purified proteins and 2D LC-MS/MS data sets from human plasma and human liver tissue were collected as previously described.19 The acquired MS/MS spectra were searched using SEQUEST in BioWorks 3.2 software suite (Thermo Finnigan, San Jose, CA). For seven standard proteins, the database was the composite of protein sequences from yeast (9 492 entries) in forward (original) and reverse orient as well as the forward and reversed sequences of all standard proteins plus trypsin and R-s2casein (for the impurity of R-casein). The database used for two human proteome samples was a composite of normal IPI human database (v3.04, 49 078 entries) from European Bioinformatics Institute with reversed version of the same database attached in the end. MS/MS spectra were searched using fully tryptic cleavage constraints and up to two missed cleavage sites were allowed. Cysteine residues were set as a static modification of + (28) Efron, B.; Tibshirani, R.; Storey, J. D.; Tusher, V. J. Am. Stat. Assoc. 2001, 96, 1151–1160. (29) Ka¨ll, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. J. Proteome Res. 2008, 7, 40–44. (30) Shannon, C. E.; Weaver, W. The Mathematical Theory of Communication; University of Illinois Press: Champaign, IL, 1963.

Analytical Chemistry, Vol. 80, No. 23, December 1, 2008

9327

Table 1. The Attributes Used in KNN Algorithm Attribute name

Description

Xcorr’ (ln(Xcorr)/ ln(len))

cross-correlation score between experimental and theoretical spectra with peptide length correction. (see ref 10) difference in normalized correlation scores between next best and best hits preliminary score for a peptide match initial peptide rank based on preliminary score parent ion mass error between observed and theoretical one fraction of theoretical peaks matched in the preliminary score measure of the ratio of basic amino acids to free protons for a peptide (see ref 14) similarity score between the experimental spectrum and the predicated spectrum by kinetic model (see refs 31-33)

∆Cn Sp ln(Rsp) Delta MH+ ions proton mobility factor (PMF) similarity score (Sim)

57.02 Da and methionine residues were set as a variable modification of + 16 Da. Mass tolerances were 2 Da for the peptide and 1 Da for the fragment. Calculation of Posterior Probability for Peptide Identifications. FDR Evaluation by Target-Decoy Strategy. In target-decoy database search strategy, the probability that a random peptide is identified incorrectly from the decoy database is expected to be the same as the probability that it is identified incorrectly from the original protein database as the sizes of the decoy database and the original database are the same. If we define the function: f(p) )

{

1 if a target peptide -1 if a decoy peptide

(1)

to indicate whether a peptide identification is from the original database. Then FDR for a given data set with N peptide identifications can be calculated as N

NFDR )

∑ f(p) p)1

N

N

)1-

∑ f(p) p)1

N

(2)

Construction of Attribute Matrix. After the database search, several scores are calculated by SEQUEST to indicate how well a peptide matches the spectrum. The attribute vector, b S ) {s1, s2,..., sn} where sj denotes the jth score for this peptide identification, which can be Xcorr, ∆Cn, and so on, is constructed to indicate a peptide identification. In addition to the Sequest scores, two additional calculated scores, mobile proton factor (MPF)14 and the similarity score between the predicted spectrum and the experimental spectrum (Sim)31-33 were also used. All the attributes used in this study and their descriptions are listed in Table 1. If there are m peptides identified with n score attributes in a data set, then the matrix can be expressed as S ) (sij)m×n, where (31) Zhang, Z. Q. Anal. Chem. 2004, 76, 3908–3922. (32) Sun, S. J.; Meyer-Arendt, K.; Eichelberger, B.; Brown, R.; Yen, C. Y.; Old, W. M.; Pierce, K.; Cios, K. J.; Ahn, N. G.; Resing, K. A. Mol. Cell. Proteomics 2007, 6, 1–17. (33) Zhang, Z. Q. Anal. Chem. 2005, 77, 6364–6373.

9328

Analytical Chemistry, Vol. 80, No. 23, December 1, 2008

sij is the jth score attribute of the ith peptide identification. Because different score attributes are measured with different scales, the score matrix was normalized to form a new matrix R ) (rij)m×n, so that all score attributes lie between 0 and 1. The normalization of score attribute was calculated by

rij )

sij - min{sij} j

max{sij} - min{sij} j

(3)

j

where sij and rij are the actual and normalized values of attribute score j for the identification of peptide i, and the maximum and minimum values are taken over the jth attribute of all peptides in the data set. Thus, instances in the matrix R represent the normalized score information for all peptide identifications. Estimation of Posterior Probability from fdr Using k Nearest Neighbors Algorithm with Distance Weighting (KNNDW). The KNN algorithm assumes that all instances correspond to points in n dimension space, and an instance can be approximately represented by its k nearest neighbors. The proximity, or similarity, between the query instance rq and the instance i in the evaluation matrix R can be measured as the Euclidean distance between the corresponding points in the attribute space,

d(rq, ri) )



n

∑ (r

2 qj - rij)

(4)

j)1

where rqj and rij are values of the jth attribute of instances rq and ri, respectively. And the k nearest neighbors are the k instances with smallest distances against the query instance. However, the classification powers of different attributes are commonly not the same, e.g., Xcorr and ∆Cn usually provide more discriminating performance than matched ion percentage.14 To reflect the classification relevance, the importance weights are assigned to all attributes so that important attributes are assigned with bigger weights. In this way, closeness or similarity for important attributes becomes more critical than that for other attributes. The distance metric with the consideration of attribute discriminating weights, wj, is defined as

∑ n

d(rq, ri) )

wj(rqj - rij)2

(5)

j)1

Shannon information entropy is commonly used to measure the disorder degree of a system.30 Here, we used the similar Shannon entropy (SE) based strategy as described by Ayan34 and Zou et al.35 to evaluate the weight vector. For a given data set with N peptide identifications in which Nd are from the decoy database and Nt from the target database, under the assumption that the chance of spectrums randomly matching peptides with target and decoy sequences are the same when the combined database contains the same number of proteins with target and decoy sequences,17,18 the number of correct identifications can (34) Ayan, N. F. In 8th Turkish Symposium on Artificial Intelligence and Neural Networks (TAINN’99), Istanbul, Turkey, June 23-25, 1999. (35) Zou, Z. H.; Yun, Y.; Sun, J. N. J. Environ. Sci. 2006, 18, 1020–1023.

be estimated as Nt - Nd, and there will be 2Nd incorrect identifications. Thus the Shannon entropy (SEo) can be calculated as

( )

(

2Nd Nt - Nd 2Nd Nt - Nd log2 log2 SEo ) N N N N

)

N1 N2 SE1 + SE2 N N

i

fdrq ) 1 -

i)1 k

∑ dw

(6)

(7)

Several methods have been developed for weighting evaluation in KNN.37-39 Here we used a simple strategy which defined the distance weighting as

dwi )

b b + d(rq, ri)

where b is a constant boundary value which was defined in the following paragraph. An initial local FDR, fdr(k) was calculated based on the local space occupied by the query peptide itself and its k ) 199 nearest neighbors. We defined the boundary value b in function in eq 11 as the distance between the query peptide and its 199th nearest neighbor. Thereby, the fdr(k) can be calculated using the function in eq 10. Start with fdr(k), fdr(k + 1) was calculated by its k+ 1 nearest neighbors, this iterative was heuristically repeated until the fdr reached a specific precision. Here the precision value was set as 0.001. The peptide posterior probability p can then be calculated by the following equation: p ) 1 - fdr

Max(LSEi) m

∑ Max(LSE )

(11)

(8)

where SEo is the information entropy before setting of any filter, and SEi is the total entropy of two data sets after setting of the criterion. The more the entropy lost, the more is the discriminating performance ofthis criterion. Therefore, the relative discriminating weights for the ith attribute can be evaluated as

wi )

(10)

i

where N1 and N2 are the number of instances in these two sub data sets and SE1 and SE2 are the corresponding entropies calculated from the function in eq 6. The loss of Shannon entropy (LSE) which is also known as information gain can be calculated as LSEi ) SEo - SEi

∑ dw f(i) i)1

If Nd > N/2, all the peptide identifications in the data set will be considered as false positive identifications and SEo will be set as 0. When a score filter of attribute i is applied, two subdata sets are generated. Peptide identifications in one of the data sets are with scores bigger than the threshold score and in the other with smaller scores. Therefore, the total information entropy after filtering is

SEi )

k

(9)

i

i)1

where Max(LSEi) is the maximum LSE when the ith attribute score filter is set after traversing all the values of the ith attribute. The weight wi indicates the relative decimating performance of the ith attribute. After that, attributes with very weak discriminating powers (weights less than 0.01) were removed from the KNN space and will not be used for the probability calculation. In KNN, the k nearest neighbors around the query instance are used to estimate the approximate values of the query instance. Therefore, fdr of a peptide identification can be determined from its k nearest neighbors. However, one problem is to determine the number of nearest neighbors, k. If k is too small, the fdr may be with large deviation, and if it is too large, the calculated FDR cannot be considered as “local FDR” as far-away instances are less relevant. One proper way to solve this problem is to use the distance weighted k-nearest-neighbor algorithm,36 in which nearer instance provides more weight as it is more “similar” to the query instance. Considering the distance weighting dwi, fdr for peptide q, fdrq, can be calculated as (36) Mitchell, T. M. In Machine Learning ; McGraw-Hill: Boston, MA, 1997; pp 230-236.

(12)

In this study, the calculation of probability was performed separately for each charge state. After that, all the peptides were pooled together for the calculation of protein probability. Since the low-resolution mass spectrometer (e.g., ion trap mass spectrometer) cannot distinguish between [M + 2H]2+ and [M + 3H]3+ precursor ions, each spectrum was searched by SEQUEST against the database and assigned a peptide separately for each precursor ion charge. Here, if a spectrum is identified by peptides with both 2+ and 3+ charge states, only the peptide with the larger probability is retained; if they were with the same probability, the peptide identified as the doubly charge state was retained. In this way, only one most probably matched peptide for a single spectrum is retained. Toevaluatetheperformanceofournewstrategy,PeptideProphet10,16 and ProteinProphet,25 which were downloaded as parts of TransProteomics Pipeline (TPP, v3.5, rev.4)40 from The Seattle Proteome Center, were also used to process these data sets. All peptides assigned from database searching were processed through the Trans-Proteomics Pipeline to generate peptideprobability by the original unsuperivsed model10 using default parameters. Then the Pepxml formatted files were used to calculated the protein probabilities by ProteinProphet. Software. PENN, probability estimator by nearest neighbors algorithm, was implemented in Java using Java 2 standard edition (37) Atiya, A. F. Neural Comput. 2005, 17, 731–740. (38) Li, C. Q.; Jiang, L. X. In Pricai 2006 Trends in Artificial Intelligence, Proceedings; Springer-Verlag Berlin: Berlin, Germany, 2006; Vol. 4099, pp 375-384. (39) Dudani, S. A. IEEE Trans. Syst. Man Cybern. 1976, 6, 327–327. (40) Chepanoske, C. L.; Richardson, B. E.; von Rechenberg, M.; Peltier, J. M. Rapid Commun. Mass Spectrom. 2005, 19, 9–14.

Analytical Chemistry, Vol. 80, No. 23, December 1, 2008

9329

Figure 1. Overview of the computational method.

develop kit 6.0. It is available from http://bioanalysis.dicp.ac.cn/ proteomics/software/PENN.html freely for academic users. RESULTS Overview of the Computational Method. In this study, we developed an instance based probability evaluation method for peptide identifications based on the local FDR determined by the target-decoy strategy. The flowchart is shown in Figure 1. Briefly, peptides which are digested from proteins are first separated by LC and sprayed into the mass spectrometer for the collection of MS/MS spectra. Then these MS/MS spectra, which constitute the primary source of information used to identify the peptides in the original samples, are searched against a composite protein database containing both the original and decoy versions of sequences of the corresponding organism using SEQUEST or a similar tool. Only the best matching peptide to each MS/MS spectrum is selected for further analysis. Four mostly used scores, Xcorr’ (cross-correlation value after log transform and normalization for peptide length),10 ∆Cn, Sp, and ln(Rsp), as well as delta MH+ and other computed scores, MPF,14 and Sim31,33 were used to construct the score matrix S. Then the normalized score matrix R was used to determine the relative discriminating weights for different attributes by the loss of Shannon entropy after setting of filters of different attributes. The k nearest neighbors around the query peptide were used for the calculation of fdr. The peptide posterior probability was calculated after the iteration of fdr to a 9330

Analytical Chemistry, Vol. 80, No. 23, December 1, 2008

specific precision. The probability can be used to filter the peptide identifications to an acceptable confidence level (e.g., over all FDR < 0.01) or to generate the protein identification probability using protein level statistic models.25,27 Discriminating Weights for Different Attributes. Compared with other database searching algorithms, SEQUEST uses a crossvalidation strategy for the assignment of spectra to peptides and outputs several scores to determine how well the actual and theoretical spectra match. Because all of these scores assigned by SEQUEST provide discriminating power, how to determine the relative discriminating power for different scores and set proper filters is very complicated. Therefore, most of the researchers using SEQUEST as a database search engine commonly adopt empirical filters with the target-decoy strategy to generate peptide identifications at a specific confidence level.41 Keller et al. calculated a discriminating score from several of the scores using a line discriminate analysis. This score was further used to calculate posterior probability.10,16 As the computed discriminating score contains all information of the scores used for the calculation, it can be used to represent all of these scores. However, because the coefficients in the line discrimination function used for the discriminating score calculation was trained from a single standard protein data set, there may be a high risk when it is applied to other data sets with different characteristics. How to determine the relative discriminating weights and incorporate these scores properly in a safe way is still a problem. With this issue in mind, we employed the strategy based on Shannon entropy, which was a commonly used strategy in the decision tree for the selection of important attributes, to evaluate the relative discriminating performance of different score attributes. Incorporated with the target-decoy strategy, this method needs no preknowledge about the data sets and the discriminating weights are automatically generated based on the characteristics of the data set. Therefore, this is a safer strategy and is applicable for the determination of tailored discriminating weight for data sets with very different characteristics. Actually, a similar strategy has also been used by Ulintz et al. for the evaluation of the relative importance of discrimination for SEQUEST scores in boost tree and random tree.14 The discriminating weights for different attributes on different data sets are shown in Figure 2. Evidently, the discriminating weights of the same score attributes are different for different samples and different charge states. Recent works also showed that filter scores were very different for data sets from samples with different characteristics even though the final identifications were with the same confidence level.19,42 Thus tailored filters for different data sets can improve the positive identification without an increase of FDR at the peptide level.19 Therefore, the tailored weights of different score attributes should be used for the incorporation of these scores properly to improve the true positive identifications. Validation of This Strategy Using Standard Protein Mixture. To evaluate the performance of the probability estimation for this instance based algorithm, a model system of standard protein mixture was used. In this system, spectra generated from (41) Xie, H. W.; Griffin, T. J. J. Proteome Res. 2006, 5, 1003–1009. (42) Qian, W. J.; Liu, T.; Monroe, M. E.; Strittmatter, E. F.; Jacobs, J. M.; Kangas, L. J.; Petritis, K.; Camp, D. G., II; Smith, R. D. J. Proteome Res. 2005, 4, 53–62.

Figure 3. Actual probability plotted against the calculated probability. The probabilities calculated by PENN (dotted) and PeptideProphet (solid) were represented. The expected probability was also plotted by a 45° line (dashes).

Figure 2. The relative discriminating weights of the attributes for data sets from three different samples (standard protein, liver tissue, and plasma) of different charge states.

protein standards are searched against a composite database constructed by sequences of standard proteins, a large enough inhomogeneous database (such as yeast database), and the decoy version of the above sequences. When a “blind test” is performed on a newly developed algorithm based on the target-decoy strategy, both peptides identified from standard proteins and the original version of the inhomogeneous database are considered as “correct” by the algorithm, and peptides from the decoy database are considered as incorrect. This evaluation system is very similar to the actual large scale proteome identification in which the original database and a same size of the decoy database are mixed to perform the target-decoy database searching. When

this system is used for the evaluation of a new developed algorithm, because the size of incorrect sequence database (both of the inhomogeneous database and decoy database) is much larger than the size of the standard protein database, all peptide assignments corresponding to standard proteins can be considered as correct identifications and that with sequences from the inhomogeneous forward or decoy database are incorrect. Therefore, the actual situation of this data set can be easily investigated. The model system in this study was built using a control protein mixture of seven standard proteins generated from the LTQ mass spectrometer. Totally, there were 6 388 spectra of [M + H]+ ions, 105 211 [M + 2H]2+ spectra, and 98 601 [M + 3H]3+ spectra collected by the mass spectrometer. After a database search by SEQUEST, there were 275, 2869, and 1445 peptides identified with sequences of standard proteins or trypsin for singly, doubly, and triply charged ions, respectively. All of these peptide identifications with sequences of standard proteins or trypsin were considered as correct identifications while that with sequences from the yeast or decoy database were considered as incorrect for the calculation of actual probability. Then this data set was used to evaluate the instance based algorithm. To demonstrate the accuracy of the calculated probability, which is the most important issue for probability based algorithm, the actual probability that peptide assignments are correct as a function of calculated probability for the model data set were plotted in Figure 3. Peptide assignments were first sorted by the calculated probability, and then the mean calculated probability and actual probability were determined within a slide window of 100 spectra. As the correct and incorrect identifications were known, the actual probability in each bin can be easily estimated as the proportion of correct identifications. If actual probability is plotted against the calculated probability within the same window, then close to the 45° line indicates good agreement between the calculated and actual probabilities. As shown in Figure 3, the accuracy of probability estimates from PENN is superior to that of PeptideProphet except in the region of probability below 0.1. It should be noted that peptide assignments in this low probability region are of very low confidence and hardly improve protein identification. The calculated probabilities by PeptideProphet were Analytical Chemistry, Vol. 80, No. 23, December 1, 2008

9331

Figure 4. Number of correct peptides of standard proteins as a function of false discovery rate (FDR) for PENN and PeptideProphet.

always smaller than the actual probabilities, indicating that PeptideProphet may underestimate the probability for peptide identifications for this data set. Therefore, the calculated probabilities that peptides are assigned to spectra correctly based on the local FDR of peptides “near” it precisely reflect the actual probabilities and could be used to estimate the likelihood that proteins corresponding to those peptides are present in the sample effectively. In addition to the precision of probability estimation, the discriminating performance of the calculated probability by PENN was also evaluated using the model data set of standard proteins. Figure 4 showed the number of correct identifications as a function of FDR. Various probability thresholds were set to generate peptide identifications with different overall confidence levels, and then the numbers of correct identifications with standard protein sequences were plotted by the global FDR of peptide identifications obtained by the probability thresholds. As can be seen, PENN allowed better separation between correct and incorrect identifications over PeptideProphet, indicating that PENN can identify more true positive peptides when the false positive rates were the same. For example, at the FDR of 1%, 3437 correct peptides were identified by PENN, while there were 2798 correct peptides identified by PeptideProphet. There were as much as 22.8% more peptides that could be identified by PENN without any increase of global FDR, indicating the improvement of discriminating performance for PENN. Compared with PeptideProphet, PENN provides more precise probability estimation and significantly improves the discriminating performance. It was observed that the accuracy of probability calculated by PeptideProphet is lower than what has been evaluated by the training data set on which the mixture model was founded.10 This may be because that the data set of the model system used in this study is very different from that which the PeptideProphet was founded on. This data set may be a “challenging dataset” for PeptideProphet on which the PeptideProphet did not work so well, as has been observed by Choi et al.16 As no training data set and no assumption is required, PENN can determine tailored parameters for different data sets regardless of the different distributions of true and false assignments. To show the benefits for the integration of other score attributes for the improvement of discriminating performance, 9332

Analytical Chemistry, Vol. 80, No. 23, December 1, 2008

Figure 5. Number of correct peptides of standard proteins as a function of false discovery rate (FDR) for PENN using different combinations of score attributes.

peptide identifications by PENN using different combinations of score attributes were also investigated. The first combination was Xcorr and ∆Cn, the two most widely used score attributes for the filtration of peptide identifications. The second combination was Xcorr, ∆Cn, Sp, ln(Rsp) and ions, five SEQUEST match scores. And the last combination was all of the score attributes used in this study as listed in Table 1, including MPF and Sim scores. The number of identified correct peptides of standard proteins at different FDR levels was shown in Figure 5. As can be seen, the last combination of all the score attributes identified most peptides at the same FDR levels. It is obvious that the discriminating performance of PENN was significantly improved when additional score attributes were integrated. Application to Peptide Identifications from Complex Human Samples. To further evaluate the performance of PENN in peptide identification, data sets of two complex human samples, liver tissue and plasma, were processed by PENN. MS/MS spectra collected from these two samples were searched against the composite database for the assignment of peptides by SEQUEST. The identified peptides were processed by both PENN and PeptideProphet afterward. The number of peptide identifications with target sequences according to different probability thresholds at different FDR levels by PENN and PeptideProphet were shown in Figure 6. Each point along the curve corresponds to the number of peptide identifications with target sequences after being filtered with a minimum probability threshold to achieve a specific global FDR. Obviously, PENN always identified more peptides than PeptideProphet for both of the two complex samples at a different FDR level, showing its better discriminating performance. At a FDR of 1%, 19%, and 12.7%, more peptides were identified by PENN than that by PeptideProphet for liver tissue sample and plasma sample, respectively. Conventional criteria based strategy was also applied on these two data sets. The conventional criteria were determined similar to that described by Xie et al.:41 Xcorr cutoffs were set as 2.0, 2.5, and 3.8 for singly, doubly, and triply charged peptides, and the ∆Cn cutoff was determined by the increase of its value until the FDR for peptide identifications was less than 1%. Finally the ∆Cn cutoffs were determined to be 0.164 and 0.265 for the data sets of liver tissue and plasma, respectively. The number of identified

Figure 6. Number of target peptide identifications from data sets of human liver tissue and plasma as a function of false discovery rate (FDR) for PENN and PeptideProphet. Table 2. Number of Peptides Identified with Global FDR < 1% by Three Strategies for Two Complex Samples

liver tissue plasma

PENNa

Peptide Propheta

conventional criteriaa

37 512 16 308

31 526 14 472

25 437 12 391

a The numbers listed in the table including peptides from the decoy database.

peptides by these strategies was listed in Table 2. Compared with the conventional criteria based strategy, PENN identified 47.5% and 31.6% more peptides for the tissue data set and the plasma data set, and PeptideProphet identified 23.9% and 16.8% more peptides for these two data sets. A closer look of the peptides identified by these three strategies was shown in Figure 7. Obviously, most of the peptides were identified by all of the three strategies. Take the liver tissue data set as an example, there were 41 071 peptides identified by at least one of the three methods. Among them, 91.3%, 76.8%, and 61.9% could be identified by a single method of PENN, PeptideProphet, and the conventional criteria, respectively. Also 78.7%, 96.1%, and 98.2% of the peptides identified by PENN, PeptideProphet, or conventional criteria could also be identified by one or both of the other two strategies. This means a significant fraction

Figure 7. Overlap of the peptides identified by PENN, PeptideProphet, and the conventional criteria based strategy for two complex samples, human liver tissue (A) and human plasma (B). The number of peptides identified by one or more strategy is indicated, and FDR of these peptides is also represented in the parentheses under it; e.g., for liver tissue, 21 935 peptides are identified by all of the three strategies (intersection), and the FDR of these 21 935 peptides is 0.2%.

of peptides identified by PENN cannot be identified by other methods. The FDRs of identified peptides in different regions of the venn-diagram in Figure 7 were also indicated. Obviously, peptides identified by all of the three strategies are the most reliable, the next are the peptides identified by two of the strategies, and peptides identified by only one strategy are the least reliable. Among the peptides only identified by one of the three strategies, PENN yield the lowest FDR. The above results showed that the peptides identified by PENN are with a high confidence level. Application to Protein Identifications. One of the major advantages for the generation of probability for peptide assignment is the facility of the determination of protein probability. Therefore, PENN also output the peptide results with probability in Pepxml format43 so that they can be used by ProteinProphet for the calculation of protein probabilities directly. The numbers of (43) Keller, A.; Eng, J.; Zhang, N.; Li, X. J.; Aebersold, R. Mol. Syst. Biol. 2005, 1, E1-E8.

Analytical Chemistry, Vol. 80, No. 23, December 1, 2008

9333

Figure 8. Comparison between the numbers of proteins identified by ProteinProphet from the original database vs the decoy database for PENN and PeptideProphet. Two complex data sets from human samples were processed: (A) for human liver tissue and (B) for human plasma. Every point in the curves represents the number of proteins identified in an original database search (y-axis) and the number of proteins identified in the decoy database search (x-axis) corresponding to the same probability threshold.

proteins identified by ProteinProphet with probabilities range from 1.0 to 0.2 using the peptide probability computed by PENN and PeptideProphet are shown in Figure 8. With the use of the peptide probability determined by PENN, ProteinProphet identified more target proteins than using the probability determined by PeptideProphet when the numbers of decoy proteins are the same, indicating that PENN also improved the discriminating performance at the protein level. DISCUSSION It is reasonable to think that peptides identified with similar scores should be with the similar probabilities to be true positive identifications. In the multidimensional score attribute space, these peptides with similar scores represent a local area. The probability for the peptide at the center of the local area is approximate to the local probability for this area. If the target-decoy database search strategy is applied, the local probability could be easily assessed based on the fraction of peptides from the forward database among all identified peptides in this area. This provides a solution to compute probability for each individual peptide 9334

Analytical Chemistry, Vol. 80, No. 23, December 1, 2008

identification. In this study, the k nearest neighbor algorithm was applied in PENN to calculate the local FDR for the determination of the posterior probability of individual peptide identification. It is an instance based algorithm for which algorithms are built for every individual data set. Therefore, compared with the model based probability evaluation strategy, PENN provides a more precise probability estimation and should be applicable for data sets with different characteristics regardless of which sample the data set is generated from or even which type of mass spectrometer the data set is collected by. Because the model based algorithm usually needs assumptions of independence between score attributes or distributions of data sets, accuracy highly relies on the fitness between the empirical model and the experimental data set. If the distribution of the data set matches the model precisely, the generative approach works well even for a small amount of the training data set; however, if the model is used on data sets with different characters from what the model is based on, a high risk may be introduced. For example, Lopez-Ferrer et al. introduced a statistical model with the assumptions that XCc() ln(Xcorr)) and DCc() (∆Cn)1/2) independently followed normal distributions.44 However, Zhang et al. found the data sets used in their study did not fit these assumptions well.21 There may be unexpected errors if the model is applied on their data set. While the instance based algorithm developed in this study does not rely on any assumptions upon the data characteristics, it should be equally applicable for variable type of data sets as long as the target decoy database search is performed. To properly integrate different score attributes, PENN used the loss of Shannon entropy to determine the relative discriminating weights. Because the weights are automatically determined for data sets with different characteristics, additional score attributes can be easily integrated into PENN to improve the peptide identifications. For example, incorporated with the Sim score,31-33 PENN showed improved discriminating performance with significantly more peptide identifications (Figure 5). Other properties, such as the pI values obtained from the isoelectric focusing (IEF),45 hydrophobicity, or elution times obtained from reversed-phase LC separation,42 can also be used in the same way to improve the discriminating performance and confidence level. However, the situation for the model based algorithms such as PeptideProphet is different. The addition of score attributes may alter the discriminating score and thus the shape of the score distribution.14 Therefore, models may need to be retrained for an acceptable performance. Besides the property of easy integration of new score attributes for the improvement of discriminating performance, PENN itself can improve the discriminating performance using only the original defined scores by the database searching algorithm (Supporting Information, Figures S1-S3). As can be seen, when the criterion of FDR was set as 1%, the commonly used FDR criterion, PENN achieved better performance for all of the three data sets. When the FDR of the final identified peptides was between 5% and 20%, PeptideProphet achieved similar performance as PENN for the data sets of standard protein mixtures and human liver tissue sample; however, PENN performed better on the (44) Lopez-Ferrer, D.; Martinez-Bartolome, S.; Villar, M.; Campillos, M.; MartinMaroto, F.; Vazquez, J. Anal. Chem. 2004, 76, 6853–6860. (45) Krijgsveld, J.; Gauci, S.; Dormeyer, W.; Heck, A. J. R. J. Proteome Res. 2006, 5, 1721–1730.

human plasma data set. This may be because the data sets by which the model of PeptideProphet was trained were with similar characteristics to these two data sets. However, plasma is a very different sample in which only 22 high abundance proteins contributed ∼99% of the total protein mass, while thousands of relatively low-abundance proteins make up to