Kolmogorov−Smirnov Scores and Intrinsic Mass ... - ACS Publications

Dec 8, 2009 - using a statistical (Kolmogorov-Smirnov-based) test computed for a large mass error threshold to avoid the choice of appropriate mass ...
1 downloads 0 Views 523KB Size
Kolmogorov-Smirnov Scores and Intrinsic Mass Tolerances for Peptide Mass Fingerprinting Rachana Jain† and Michael Wagner*,†,‡,§ Department of Biomedical Engineering, University of Cincinnati, Cincinnati, Ohio 45219, Division of Biomedical Informatics, Cincinnati Children’s Hospital Research Foundation, Cincinnati, Ohio 45229, and Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio 45229 Received June 24, 2009

Peptide Mass Fingerprinting (PMF) uses proteolytic peptide masses and a prespecified search database to identify proteins. At the core of a PMF database search algorithm lies a quality statistic that gauges the level to which an experimentally obtained peak list agrees with a list of theoretically observable mass-to-charge ratios for a protein in a database. In this paper, we propose, implement and evaluate using a statistical (Kolmogorov-Smirnov-based) test computed for a large mass error threshold to avoid the choice of appropriate mass tolerance by the user. We use the mass tolerance identified by the Kolmogorov-Smirnov test for computing other quality measures. The results from our careful and extensive benchmarks using publicly available gold-standard data sets suggest that the new method of computing the quality statistics without requiring the end-user to select a mass tolerance is competitive. We investigate the similarity measures in terms of their information content and conclude that the similarity measures are complementary and can be combined into a scoring function to possibly improve upon the over all accuracy of PMF based identification methods. Keywords: Peptide Mass Fingerprinting • Protein Identification • Mass Tolerance

Introduction Mass spectrometry (MS) has become a ubiquitous tool for protein identification with proteomics core facilities employing MS data obtained for protein fragments (MS1 data) as well as tandem MS data obtained for peptide fragments (MS2 or MS/ MS data) for protein identification. Protein identification using MS1 data is also known as Peptide Mass Fingerprinting (PMF).1-5 While the state-of-the-art of the field arguably has moved primarily to using tandem mass spectrometry, two recent large-scale proteomics investigations published in Nature used PMF as the protein identification method, indicating that PMF is indeed still used routinely in many laboratories. Gavin et al.6 used PMF for protein identification, while Krogan et al.7 employed both PMF and tandem mass spectrometry data. PMF is often used in conjunction with MS2 methods such as Peptide Fragment Fingerprinting (PFF) or de novo sequencing for identifying a larger range of proteins as well as for gaining more confidence in their identifications.7 All these approaches rely heavily on computational methods as well as protein sequence databases to retrieve the most likely protein matches. Protein identification using PMF involves purification of the sample protein using methods such as liquid chromatography or 2D-gel electrophoresis, followed by its digestion into pep* To whom correspondence should be addressed. E-mail:michael.wagner@ cchmc.org. † University of Cincinnati. ‡ Cincinnati Children’s Hospital Research Foundation. § University of Cincinnati College of Medicine. 10.1021/pr9005525

 2010 American Chemical Society

tides by using a proteolytic enzyme such as trypsin. Mass-tocharge (m/z) ratios of the peptides are then measured using mass spectrometry (typically using MALDI-TOF or Electrospray Ionization (ESI)) and searched against a database of mass-tocharge ratios of theoretical peptides obtained by in silico digestion of proteins in the database (candidate proteins). The candidate protein that best “fits” the peptide masses derived from the sample protein is then reported by the database search algorithm. Several database search tools are commercially or publicly available such as MASCOT,8 MS-Fit,9,10 ProFound,11 and Aldente.12,13 Several other algorithms for PMF database search techniques have also been proposed in the literature.14-23 The performance of database search algorithms for PMF is dependent on several parameters (such as the mass accuracy to be used in the search, and the number of allowed missed cleavages24,25) which typically need to be set a priori by the end user. The optimal choice of mass tolerance and allowed missed cleavages is usually not obvious to the end user and is often made based on their experience with the data or by trial and error. In this article, we introduce a novel quality measure based on the Kolmogorov-Smirnov (KS) test26 which does not require the specification of a mass tolerance by the end user. We discuss using the KS Test as a method to automatically identify an appropriate (“intrinsic”) mass tolerance value, which can be used for computing other similarity measures like the number of hits. Our motivation for this work was to attempt to relieve the user from needing to specify a mass tolerance to Journal of Proteome Research 2010, 9, 737–742 737 Published on Web 12/08/2009

research articles be used as the cutoff when determining whether a particular observed peak can be attributed to a peptide or not. We implemented and tested the new similarity measure for protein identification using PMF using gold-standard data sets as a first step, with the intention of later generalizing our work to MS/ MS data. Our findings indicate that the KS Test is competitive with other statistics like the number of hits. We have also contrasted the performance of the similarity measures when computed with KS Test based mass tolerance against the similarity measures computed for a user-defined mass tolerance. Our results indicate that similarity measures when computed using KS Test based mass tolerance outperforms the similarity measures computed for fixed mass tolerance. We further investigate the complementarity of the various quality measures and show that the correlation between the KS statistic and a measure based on the number of hits is not perfect, indicating that it may hold complementary information which may be exploited by combining statistics into an overall score.

Jain and Wagner each included spectrum should account for only one identification, and that the proteins in the data set should be nonredundant with respect to each other. Given the limited reliability of this data but also since it is relatively large, we only used the Gavin data set for the derivation of background mass error distributions only (see Generation of Background Error Distributions). Database Searching Using Standard Similarity Statistics. For each mass fingerprint in the data sets, we ranked the database proteins according to four different similarity measures that are typically computed in the context of PMF either as inputs to a scoring function or merely as informative measures that are displayed with the results of a search. Of these similarity measures, the “number of hits” lies at the core of database search tools such as MASCOT and ProFound. Given a sample protein s and a database protein p, we define the “number of hits” H as the number of peaks in the experimental peak list dE(s) for which there is a peptide mass in dT(p) (the theoretical digest of p) within the prespecified error tolerance:

Methods Data Sets. To realistically evaluate and/or compare algorithms for any application, it is critical to obtain highconfidence (“gold standard”) data sets that are sufficiently large and diverse in order to allow for the collection of meaningful performance statistics. For the case of PMF, gold standard data set consists of a mass spectrum of a (say, tryptically) digested single protein whose identity has been positively confirmed by orthogonal methods, such as an antibody test, an ELISA assay, MS/MS-based identification, and so forth. While Chamrad et al.27 and Joo et al.28 report on benchmarking studies on various search engines on single data sets, unfortunately, these data sets are either very small or not available to the public. We report results on validated data from two different sources. The first data set (the “Aurum” data set by Falkner et al.29) comprises MS1 and MS2 spectra for 246 human proteins. To generate the data, the authors tested each protein sample for purity using SDS-PAGE and then digested the samples using trypsin. MS1 spectra were obtained for four replicates of each protein using a MALDI-TOF/TOF instrument by the authors. From the data, we used MS1 peak lists for 239 proteins (1 replicate per protein). Our other high-quality data set (the “Krogan” data set, Krogan et al.7) stems from a large-scale proteomic study of yeast protein complexes. Here spectra were acquired for bait and/or their prey proteins purified using Tandem Affinity Purification (TAP).30,31 From this data, we extracted the MS1 peak lists for 416 bait proteins identified by ProFound with high confidence, assuming that the identity of the bait proteins can be trusted with reasonable certainty. Of the extracted data, spectra for 313 nonredundant proteins (no sequence homology) were used for the algorithm development, while 103 were used for testing. Finally, we also downloaded a third data set from Gavin,6 which consists of MS1 mass spectra generated for 52 000 samples purified using TAP. Around 36 000 identifications were made by the authors using PMF with the ProFound/Knexus software.11 Since no orthogonal validation was performed, and the identification information relies solely on a single database algorithm (ProFound), thus, reflecting that particular software’s biases as well as false positive rates, this is clearly not as reliable a data set as the other two mentioned above. From the identifications made available by the authors, we obtained spectra for 878 proteins, making this the largest of the three data sets. Our criteria for the selection of peak lists were that 738

Journal of Proteome Research • Vol. 9, No. 2, 2010

H ) hits(dE(s), dT(p)) where hits is the (mass tolerance-dependent) function that computes the number of matching peaks. We define the “Digest Fraction” as the ratio of the number of hits with respect to the size of the theoretical digest of candidate protein, that is, Df )

H |dT(p)|

The “Sequence Coverage” is defined as the number of residues in the matching peptides, that is, H

C)

∑ length(H ) i

i)1

where length is the number of residues of the peptide that corresponds to the hit. Finally, we define “Coverage Fraction” as the ratio of “Sequence Coverage” with respect to the number of residues in the database protein. Cf )

C length(p)

A common observation with the similarity measures described above is their tendency to be biased toward proteins of extreme sizes. Therefore, raw values of the individual similarity measures were normalized with respect to the size of the protein (see Supporting Information). All searches for the peak lists from the Krogan and Gavin data sets were performed using the NCBInr yeast database consisting of 9672 proteins. The Aurum data set was searched against the human subset of the NCBInr database (140 562 proteins). The similarity measures were computed for no missed cleavages as well as missed cleavages of up to one. Since the modifications used for determining the identities of the proteins in the data sets were not described in the respective articles, we made the standard assumption of the presence of variable oxidation of methionine and complete carbamidom-

research articles

Kolmogorov-Smirnov Scores and Intrinsic Mass Tolerances for PMF

Figure 1. Example of mass error distributions. (A) Empirical PDFs of absolute mass errors for True (BSA) and False (CPVL) matches. (B) Empirical cumulative distribution functions (ECDF) for absolute error values for the matches of BSA and CPVL and background distribution.

ethylation of cysteine for our searches. Methionine oxidation is often observed in samples purified using SDS gels. Sample proteins are also often treated with iodoacetamide during the PMF process to cleave any disulfide bonds between two cysteine residues. For the database searches, we assume that only one protein is present in the sample. We believe this to be a reasonable assumption for our gold standard data sets since the samples proteins were purified using gel electrophoresis in the Krogan and Gavin data sets and therefore can be assumed to be purified proteins. As a consequence, we consider a top match to be a true match if its assignment is unique, that is, there is no other match with the same score. This is a more stringent criterion than typically would be used in practice, as search engine users quite often consider the highly ranked proteins with similar scores as viable candidates for further studies. Kolmogorov-Smirnov-Based Similarity Measure. As previously mentioned, tools such as MASCOT and ProFound, as well as the similarity measures described in previous section, require a user-specified mass tolerance to compute the scores. Varying mass tolerance values result in significant changes in the database search success (see Supporting Information). While the mass tolerance values are critical to database search, their choice is often not straightforward. With the idea of avoiding the difficult choice of mass tolerance altogether, we propose to compute the empirical distribution of mass errors (absolute differences between the experimental m/z values and their closest matches in the theoretical digest under consideration). We now expect this empirical error distribution (EED) of a true match (pairing of a given experimental peak list with the theoretical digest of the protein it was derived from) to exhibit significantly increased density close to 0 m/z compared to that of a false match (pairing of an experimental peak list with an unrelated protein), just like one would expect more hits in a true match than a false match. An example of this can be seen in Figure 1A, where the EED for the true match (BSA [GenBank: NP_851335]) has a sharp peak close to zero, while the EED for the false match (CPVL [GenBank: AAQ88913]) has a small peak of low mass errors due to random matches and a heavy tail, that is, a larger number of high mass errors. However, these EEDs are highly dependent on the number of theoretically possible tryptic peptides in a protein: the larger the theoretical digest of a protein, the higher the probability that an experimentally determined peak is close to the m/z value of an unrelated peptide. A direct comparison of EEDs is, thus, clearly not appropriate. We, therefore, compare a particular EED with a background distribution of mass errors

Figure 2. Empirical cumulative distributions for True (BSA) and False (NP_003487) matches as compared to their respective background distributions.

generated with known false matches of similar size, the idea being that the EEDs for false matches will tend to follow the background error distribution closely, while the EED for a true match will be significantly different from the background distribution. Figure 1B, illustrates one such example where the cumulative error distribution for true match (BSA) lies significantly above the background distribution, while the false match (CPVL) is closer to the background error distribution. One way of quantifying the difference between two distributions is by using a goodness-of-fit test such as the Kolmogorov-Simrnov test (KS Test).26 The KS Test is a nonparametric statistical test, which computes the maximum distance between the empirical cumulative distribution functions of two given samples as its test statistic. Since we are only interested in those matches where the experimental error distribution lies above the background error distribution, we perform a one sided KS Test and compute the D values as follows: D ) max{(F(x) - Fn(x)), 0)} where Fn(x) is the background distribution, and F(x) is the error distribution for the candidate protein. Further, since we are interested in only those matches with low mass errors, we limit the computation of the D statistic for an error range to some reasonable upper bound xmax on the mass accuracy (say, 0.5 Da). The D statistic obtained as described above is used as a similarity measure which indicates how much the experimental mass error distribution deviates from the background distribution. Our hypothesis is that larger values of D for small absolute error are an indication of a true match. Furthermore, as we will show in later sections, the location of the maximum deviation from the background distribution xˆ ) arg max{(F(x) - Fn(x)), 0} 0exexmax

(1)

can be used as an indication of an “intrinsic” mass tolerance. Generation of Background Error Distributions. To compute background error distributions, we used the 878 peak lists from the Gavin data set.6 For each peak list in the data set, we computed the distances to the closest peaks in the theoretical digest of false matches. An empirical background error distribution can then be computed by combining the mass errors Journal of Proteome Research • Vol. 9, No. 2, 2010 739

research articles for all false matches and all peak lists. We observed that the error distributions of proteins of different sizes are significantly different. For example, Figure 2 shows the cumulative mass error distributions for the matches between the example peak list for BSA (583 residues) and the theoretical digests of BSA [GenBank: NP_851335], and NP_003487 [GenBank: NP_003487] (3830 residues). Given the size difference, it is not surprising that the number of hits is significantly larger for NP_003487 at any mass tolerance cutoff. This example illustrates that one background distribution does not represent the variability observed in the error distributions across the entire range of candidate protein sizes. One way of accounting for the variation in error distribution based on the size is by generating different background distributions for different candidate protein size ranges. The KS Test statistic for each match is then computed using the background distribution corresponding to the candidate protein size. For example, in Figure 2, the D statistics of the error distribution for NP_003487 and BSA with respect to their corresponding background distributions are 0.09 and 0.22 respectively, thus, effectively normalizing the D statistic against protein size. In this example, BSA is compared with the background distribution for proteins with theoretical digest sizes in the range of 225-275 for computing the KS Test statistic, while NP_003487 is compared with the background distribution for proteins in the range of 1800-2000 theoretical digest size. The KS Test statistic values are 0.22 for BSA and 0.09 for NP_003487. The ranges of theoretical digests for generating background distributions were chosen by visual inspection of the distributions. Overlap was maintained between the adjacent ranges to ensure a smooth transition between adjacent distributions. Finally, we also checked the stability of the background distributions by randomly splitting the data sets into two. Quantile-Quantile plots and visual inspection of the subsequently generated distributions indicated that the differences in distributions were insignificant and stable even with half the data (see Supporting Information). Computing Similarity Measures without a Fixed Mass Tolerance. Since the KS Test is computed at a mass error value where the difference between the error distributions for a prospective match and a false match is the greatest, we propose to use this mass error (xˆ from eq 1) as the threshold level for computing the peptide matches or hits. The hits thus computed are then used for computing the similarity measures such as the number of hits, coverage, and so forth.

Results Performance of Similarity Measures. To assess the performance of each similarity measure computed using mass tolerance cutoff determined by KS Test, we computed the number of proteins each similarity measure was able to identify correctly. We then contrasted the performance of similarity measures computed with KS Test defined mass tolerance with similarity measures computed for a fixed mass-tolerance. For a user-defined mass tolerance, we chose a value of 0.5 Da since a majority of the similarity measures as well as the database search tools MASCOT and ProFound identified a maximum number of proteins for this mass tolerance value (see Supporting Information). It must be noted that, while we iteratively used various mass tolerances and selected the mass tolerance with best results for our comparison, such onerous iterative searches are not normally performed during protein identification process. The results for the Aurum and Krogan test data 740

Journal of Proteome Research • Vol. 9, No. 2, 2010

Jain and Wagner Table 1. Number of Proteins Correctly Detected at Topmost Position for the 5 Different Ranking Statistics Discussed and Different Settings of Missed Cleavage and Mass Tolerancesa missed cleavages

mce1 mc)0

data set

KS Test

no. hits

digest frac.

cov.

(a) User-Defined Mass Tolerance of 0.5 Da Krogan 102 100 93 93 Aurum 177 182 162 132 Krogan 103 103 95 94 Aurum 194 205 191 180

(b) KS Test Determined Intrinsic mce1 Krogan 102 100 Aurum 177 189 mc)0 Krogan 103 103 Aurum 194 204

cov. frac.

79 97 81 160

Mass Tolerance 96 96 97 167 170 173 100 99 99 198 201 205

a Note the competitive performance of the KS Test as well as the superior performance of the intrinsic mass tolerance.

Figure 3. The distribution of “intrinsic” mass tolerance values as computed by the KS-Test. The distributions were computed for true matches in the Aurum data set.

sets are shown in Table 1. Table 1a contains the performance statistics for similarity measures computed for 0.5 Da mass tolerance, while Table 1b contains the results for similarity measures computed for mass tolerance as computed by the KS Test. The table also contains the results for KS Test as a similarity measure and it must be noted that the KS Test results are identical in both sections of the table since KS Test is computed for a maximum cutoff level of 0.5 Da and is fixed in the algorithm. Figure 3 shows the distributions of mass error threshold obtained for the Aurum data set using KS Test for missed cleavages of 0 and 1.The distributions indicate that the average mass tolerance for missed cleavages of 0 is 0.13 Da, while for missed cleavages of up to 1, it is 0.15 Da. The median mass tolerances are 0.07 and 0.1 Da for missed cleavages of 0 and 1, respectively. The distributions indicate that, while most peak lists have true matches within mass errors threshold less than 0.1 Da, a significant number of peak lists have higher cutoff mass errors and identifications for these peak lists may be missed if low mass tolerance values are used. We also investigated the performance of commonly used tools MASCOT and ProFound on these data sets. As expected, we observed that MASCOT and ProFound perform better than the suggested features alone, although the similarity measures are fairly competitive. This evaluation further motivates us to believe that a scoring function which combines the suggested features may not only remove the requirement of a userdefined mass tolerance, but may also improve upon the performance of the database searches using PMF. Further investigations toward the complementarity of the features are discussed in the next section.

research articles

Kolmogorov-Smirnov Scores and Intrinsic Mass Tolerances for PMF Table 2. Number of True Positive Identifications at p < 0.05 missed cleavages

mce1 mc)0

data set

KS Test

no. hits

digest frac.

cov.

(a) User-Defined Mass Tolerance of 0.5 Da Krogan 77 60 41 84 Aurum 116 154 151 132 Krogan 78 91 96 62 Aurum 119 189 187 181

(b) KS Test Determined Intrinsic mce1 Krogan 77 75 Aurum 116 159 mc)0 Krogan 78 84 Aurum 119 187

a

cov. frac.

11 77 86 153

Mass Tolerance 77 76 74 155 149 148 84 88 88 190 193 194

a Significance estimates for individual quality measures where normalized similarity measure values are used to compute the number of true and false positives. Matches with scores (p-values) less than 0.05 were considered significant.

orthogonal information from the experimental data. Two example Venn diagrams for identification results for the Aurum data as illustrated in Figure 4 indicate that the similarity measures capture information which may not be captured by the other measure. Figure 4A illustrates that, for missed cleavages of 0, the similarity measures, KS test and “number of hits”, identify 1 and 10 proteins, respectively, which the other similarity measure could not identify. The coverage features which perform better when normalized with the length of the protein were also investigated with respect to KS Test and number of hits. We also observed that different proteins were identified by the same features for different values of missed cleavages indicating complementarity between same features computed using different missed cleavage values. The difference in the information content of the features was also investigated by computing the correlation between each pair of the similarity measure. The correlation was computed between similarity measures for true matches and is shown in Table 3. The correlation was computed between similarity measures computed for zero missed cleavages (Table 3a). Further, we also computed the correlation between similarity measures computed for zero missed cleavages and missed cleavages up to 1 (Table 3b). These results provide us motivation for the final goal of our project, which will be to combine the complementary features into a single scoring function for more accurate protein identification.

Discussion Figure 4. Venn diagrams illustrating the complementarity between KS test, hits and coverage fraction similarity measures computed for missed cleavages of 0 and 1.

As discussed previously, the normalized feature values are effectively p-values computed using null distributions generated from a large number of top scoring false matches. We also investigated the efficacy of each of the feature in terms of identifying a protein with p-values less than 0.05. To compute the efficacy of the features, we computed the number of true positives, that is, the proteins correctly identified at the top position with p-values less than 0.05. The results are given in Table 2. Complementarity of Similarity Measures. Next, we investigated if the similarity measures computed with mass tolerances determined by KS Test as discussed in this article capture

A well-designed PMF database search tool would be userfriendly and require the users to input only that information which is known to the users. Parameters such as mass tolerance are important and yet often unknown to the users of the database search tools. Ad-hoc values input by users may dramatically affect the performance of the database search tools. In this article, we have introduced a method to eliminate the necessity of a user-defined mass error cutoff for determining peptide matches. The similarity measures when computed using the mass tolerance by this method are able to identify more proteins at the topmost position when compared to similarity measures computed using a fixed mass tolerance, as we demonstrate using careful validation methods on gold standard data sets. We also show our novel quality statistic to be complementary to other features such as the “number of hits”. The results are encouraging and suggest that, by incor-

Table 3. Correlation Coefficients for Pairs of the 5 Similarity Measures Computed Using Different Missed Cleavage Settings KS Test

KS Test no. Hits Digest Frac. Coverage Cov. Frac.

no. hits

digest frac.

(a) Correlation between Similarity Measures for Zero Missed Cleavages Missed cleavage ) 0 1 0.85 0.85 X 1 1 Missed cleavage ) 0 X X 1 X X X X X X KS Test

no. hits

digest frac.

coverage

cov. frac.

0.62 0.84 0.83 1 X

0.61 0.74 0.72 0.79 1

coverage

cov. frac.

(b) Correlation between Similarity Measures Computed for Allowed Missed Cleavages of 0, And up to 1 Missed Cleavage ) 0 KS Test 0.92 0.69 0.7 0.62 no. Hits 0.71 0.94 0.93 0.82 Digest Frac. Missed cleavage e 1 0.71 0.93 0.93 0.81 Coverage 0.62 0.81 0.81 0.93 Cov. Frac. 0.58 0.66 0.66 0.68

0.64 0.80 0.79 0.85 0.87

Journal of Proteome Research • Vol. 9, No. 2, 2010 741

research articles porating the Kolmogorov-Smirnov similarity measure with other measures such as sequence coverage and the number of hits into a single scoring function (e.g., using machine learning tools), one can potentially improve on the overall accuracy of PMF database search tools. Further, by using the mass tolerance level as determined by the Kolmogorov-Smirnov test, not only can we design a search tool that avoids the difficult choice of mass tolerance, we can also potentially improve upon the overall efficacy of the algorithm.

Acknowledgment. The authors gratefully acknowledge expert technical support from Prakash Velayutham (Cincinnati Children’s Division of Biomedical Informatics) as well as support from the Division of Biomedical Informatics which enabled the use of its Linux cluster for all computations related to this work. We thank Drs. Patrick Limbach and Ken Greis for valuable discussions and insights. Finally, we would like to thank the 2 anonymous referees for insightful comments that have led to significant improvements in the presentation of the material. Supporting Information Available: A description of normalization of individual similarity measures in supplementary Section A. Supplementary Section B, an investigation of the effect of varying mass tolerance values on the database search success. Supplementary Section C, stability investigations for background error distributions. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Henzel, W. J.; Billeci, T. M.; Stults, J. T.; Wong, S. C.; Grimley, C.; Watanabe, C. Identifying proteins from two-dimensional gels by molecular mass searching of peptide fragments in protein sequence databases. Proc. Natl. Acad. Sci. U.S.A. 1993, 90 (11), 5011– 5015. (2) James, P.; Quadroni, M.; Carafoli, E.; Gonnet, G. Protein identification by mass profile fingerprinting. Biochem. Biophys. Res. Commun. 1993, 195 (1), 58–64. (3) Mann, M.; Hojrup, P.; Roepstorff, P. Use of mass spectrometric molecular weight information to identify proteins in sequence databases. Biol. Mass Spectrom. 1993, 22 (6), 338–345. (4) Pappin, D. J.; Hojrup, P.; Bleasby, A. J. Rapid identification of proteins by peptide-mass fingerprinting. Curr. Biol. 1993, 3 (6), 327–332. (5) Yates, J. R., III; Speicher, S.; Griffin, P. R.; Hunkapiller, T. Peptide mass maps: a highly informative approach to protein identification. Anal. Biochem. 1993, 214 (2), 397–408. (6) Gavin, A.-C.; Aloy, P.; Grandi, P.; Krause, R.; Boesche, M.; Marzioch, M.; Rau, C.; Jensen, L. J.; Bastuck, S.; Dumpelfeld, B.; et al. Proteome survey reveals modularity of the yeast cell machinery. Nature 2006, 440 (7084), 631–636. (7) Krogan, N. J.; Cagney, G.; Yu, H.; Zhong, G.; Guo, X.; Ignatchenko, A.; Li, J.; Pu, S.; Datta, N.; Tikuisis, A. P.; et al. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature 2006, 440 (7084), 637–643. (8) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probabilitybased protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551– 3567. (9) Clauser, K. R.; Baker, P.; Burlingame, A. L. Role of accurate mass measurement (( 10 ppm) in protein identification strategies employing MS or MS/MS and database searching. Anal. Chem. 1999, 71 (14), 2871–2882.

742

Journal of Proteome Research • Vol. 9, No. 2, 2010

Jain and Wagner (10) Baker P. R. CKR: http://prospector.ucsf.edu/. (11) Zhang, W.; Chait, B. T. ProFound: an expert system for protein identification using mass spectrometric peptide mapping information. Anal. Chem. 2000, 72 (11), 2482–2489. (12) Tuloup M., Hernandez, C., Coro, I., Hoogland, C., Binz, P.-A., Appel, R. D. Aldente and BioGraph: An improved peptide mass fingerprinting protein identification environment. In Swiss Proteomics Society 2003 Congress Ed. FontisMedia: Basel, Switzerland, 2003. (13) Gasteiger E., Hoogland, C., Gattiker, A., Duvaud, S., Wilkins, M. R., Appel, R. D., Bairoch, A. Protein Identification and Analysis Tools on the ExPASy Server. In The Proteomics Protocols Handbook; Walker, J. M., Ed.: Humana Press: Totowa, NJ, 2005. (14) PepMapper. http://www.bioinf.manchester.ac.uk/mapper/. (15) PeptideSearch. http://www.narrador.embl-heidelberg.de/GroupPages/PageLink/peptidesearchpage.html. (16) Berndt, P.; Hobohm, U.; Langen, H. Reliable automatic protein identification from matrix-assisted laser desorption/ionization mass spectrometric peptide fingerprints. Electrophoresis 1999, 20 (18), 3521–3526. (17) Egelhofer, V.; Bussow, K.; Luebbert, C.; Lehrach, H.; Nordhoff, E. Improvements in protein identification by MALDI-TOF-MS peptide mapping. Anal. Chem. 2000, 72 (13), 2741–2750. (18) Egelhofer, V.; Gobom, J.; Seitz, H.; Giavalisco, P.; Lehrach, H.; Nordhoff, E. Protein identification by MALDI-TOF-MS peptide mapping: a new strategy. Anal. Chem. 2002, 74 (8), 1760–1771. (19) Kaltenbach, H.-M.; Wilke, A.; Bocker, S. SAMPI: protein identification with mass spectra alignment. BMC Bioinf. 2007, 8 (1), 102. (20) Magnin, J.; Masselot, A.; Menzel, C.; Colinge, J. OLAV-PMF: a novel scoring scheme for high-throughput peptide mass fingerprintin. J. Proteome Res. 2004, 3 (1), 55–60. (21) Palagi, P. M.; Hernandez, P.; Walther, D.; Appel, R. D. Proteome informatics I: Bioinformatics tools for processing experimental data. Protemics 2006, 6 (20), 5435–5444. (22) Parker, K. Scoring methods in MALDI peptide mass fingerprinting: ChemScore, and the ChemApplex program. J. Am. Soc. Mass Spectrom. 2002, 13 (1), 22–39. (23) Siepen, J. A.; Keevil, E. J.; Knight, D.; Hubbard, S. J. Prediction of missed cleavage sites in tryptic peptides aids protein identification in proteomics. J. Proteome Res. 2007, 6 (1), 399–408. (24) Ossipova, E.; Fenyo¨, D.; Eriksson, J. Optimizing search conditions for the mass fingerprint-based identification of proteins. Protemics 2006, 6 (7), 2079–2085. (25) Yong-Bo, P.; Yong-Ping, M.; Yong-Peng, X.; Zong-Yin, Q. Optimization of bioinformatics analysis conditions by peptide mass fingerprint identify proteins. Chin. J. Anal. Chem. 2008, 36 (4), 467– 472. (26) Daniel W. W. Biostatistics: A Foundation for Analysis in the Health Sciences, 8th ed.; J. Wiley: Hoboken, New Jersey, 2004. (27) Chamrad, D. C.; Korting, G.; Stuhler, K.; Meyer, H. E.; Klose, J.; Bluggel, M. Evaluation of algorithms for protein identification from sequence databases using mass spectrometry data. Proteomics 2004, 4 (3), 619–628. (28) Joo, W.-A.; Lee, J.-B.; Park, M.; Lee, J.-W.; Kim, H.-J.; Kim, C.-W. Comparison of search engine contributions in protein mass fingerprinting for protein identification. Biotechnol. Bioprocess Eng. 2007, 12 (2), 125–130. (29) Falkner, J. A.; Kachman, M.; Veine, D. M.; Walker, A.; Strahler, J. R.; Andrews, P. C. Validated MALDI-TOF/TOF mass spectra for protein standards. J. Am. Soc. Mass Spectrom. 2007, 18 (5), 850– 855. (30) Puig, O.; Caspary, F.; Rigaut, G.; Rutz, B.; Bouveret, E.; BragadoNilsson, E.; Wilm, M.; Seraphin, B. The tandem affinity purification (TAP) method: a general procedure of protein complex purification. Methods 2001, 24 (3), 218–229. (31) Rigaut, G.; Shevchenko, A.; Rutz, B.; Wilm, M.; Mann, M.; Seraphin, B. A generic protein purification method for protein complex characterization and proteome exploration. Nat. Biotechnol. 1999, 17 (10), 1030–1032.

PR9005525