Estimating the Statistical Significance of Peptide Identifications from Shotgun Proteomics Experiments Richard E. Higgs,* Michael D. Knierman, Angela Bonner Freeman, Lawrence M. Gelbert, Sandeep T. Patil, and John E. Hale Lilly Research Laboratories, MS 1533, Lilly Corporate Center, Indianapolis, Indiana 46285 Received October 10, 2006
We present a wrapper-based approach to estimate and control the false discovery rate for peptide identifications using the outputs from multiple commercially available MS/MS search engines. Features of the approach include the flexibility to combine output from multiple search engines with sequence and spectral derived features in a flexible classification model to produce a score associated with correct peptide identifications. This classification model score from a reversed database search is taken as the null distribution for estimating p-values and false discovery rates using a simple and established statistical procedure. Results from 10 analyses of rat sera on an LTQ-FT mass spectrometer indicate that the method is well calibrated for controlling the proportion of false positives in a set of reported peptide identifications while correctly identifying more peptides than rule-based methods using one search engine alone. Keywords: peptide identification ‚ false discovery rate ‚ Sequest ‚ X! Tandem ‚ statistical significance ‚ proteomics
Introduction A frequently stated goal of proteomics is the identification and quantification of proteins contained in a biological sample. Large-scale experiments with highly complex protein mixtures are now routinely done using LC-MS/MS methods for both peptide/protein identification and quantification. Protein digests are separated in a single- or multidimensional chromatography step that is coupled to a mass spectrometer capable of dynamically generating peptide fragmentation spectra. Peptide identifications are typically made by comparing measured MS/MS spectra with predicted MS/MS spectra derived from a protein sequence database using programs like MASCOT, Sequest, or X! Tandem.1-4 One of the many challenges with these large-scale experiments is to find the correctly identified peptides while maintaining some control over false positive identifications. The determination of a correct identification is a subjective assessment based on parameters like the number of MS/MS ions explained by the proposed peptide sequence, biochemistry rules (e.g., the “proline rule”), delta mass values between the measured peptide and the proposed peptide sequence, predicted versus measured peptide retention, and so forth.5 This determination becomes less subjective with high mass accuracy measurements using Fourier transform ion cyclotron resonance (FT-ICR) instruments, but for most investigators, nominal mass accuracy from ion trap mass spectrometers is typical. Many rules for interpreting the scores produced by programs like Sequest have been reported, but in general, there has been a lack of consensus within the broader scientific community as to which set of identification * To whom
[email protected].
1758
correspondence
should
be
addressed.
Journal of Proteome Research 2007, 6, 1758-1767
Published on Web 03/31/2007
E-mail:
rules to use.6-9 Furthermore, it is not clear that the rules developed under one specific scenario of sample complexity, instrumentation, protein database, allowable peptide modifications, and so forth will be applicable to other experimental conditions. Eriksson and Fenyo demonstrated that the number of protein sequences in the database used for identification has a significant effect on estimates of statistical significance for identifications.10 A more recent report also highlights inconsistent performance with rules for Sequest on the basis of identifications between human plasma samples and human cell culture samples.11 The lack of control for false positive identifications has led to a proliferation of reports listing incorrectly identified peptides, no statistical estimate of false positive error rates in the report, and missing details on the search parameters used for the identifications. This perception prompted a meeting of opinion leaders and journal editors in May 2005 (the Paris meeting) to discuss guidelines for the publication of identified peptides.12 Several recent reports have discussed the estimation of various probabilities as a guide for peptide identification.8,13 Some approaches have sought to control the false positive rate (FPR), Pr(correct identification predicted given truly incorrect identification), whereas we advocate here the false discovery rate (FDR), roughly defined as the proportion of significant results that are expected to be false discoveries, as the more natural error rate to control.14 The FDR is now routinely used in highly multiplexed assays (e.g., microarrays and imaging) to control the proportion of false discoveries in a claimed set of findings. The FDR is similar to one minus the positive predictive value (PPV), Pr[correct identification given correct identification predicted] from the classification and diagnostic literature. The FPR and FDR are 10.1021/pr0605320 CCC: $37.00
2007 American Chemical Society
research articles
Estimating Statistical Significance of Peptide Identifications
fundamentally different probabilities that are conditioned on different events. The FDR, which represents the proportion of false discoveries within a set of peptide identifications, is arguably the most interesting error rate to an investigator: “What proportion of these identifications is correct?” Weatherly et al. have taken an FDR based approach with MASCOT results while Peng et al. have reported on a fixed set of rules to control the number of false identifications for Sequest results.8,13 In this report, we describe a general framework and statistical strategy to estimate and control false positive peptide identifications using output from multiple MS/MS search engines and other relevant features (e.g., |∆[M + H]+|) in a data-driven manner that does not directly rely on a predefined set of rules. We take a hypothesis testing approach with a null hypothesis, H0: the match between the spectrum and peptide is due to chance versus an alternative hypothesis, HA: the match between the spectrum and peptide is causal. In contrast to previously reported hypothesis testing approaches like SCOPE,15 ProbID,16 Probity,17 and the more recently reported work by Cannon et al.18 and Fridman et al.,19 we are not deriving probabilistic models of MS/MS peaks but rather are utilizing existing spectral search engines in a manner that may be considered an extension of PeptideProphet.21 The motivation for this decision is based on the fact that these search engines are rapidly evolving and becoming quite specialized with encoded biochemical knowledge regarding peptide fragmentation, chemical modifications, mutations, and so forth, and therefore our interest is in a modular and flexible strategy that incorporates multiple search engines as generators of scores for MS/MS identifications. The results presented in this report were generated using Sequest and X! Tandem MS/MS searching tools, although the described approach is amenable to any number of different searching algorithms. The combination of Sequest and X! Tandem provides redundancy for correct identifications that may be missed by one of the algorithms, and the addition of X! Tandem allows for the identification of mutations as well as chemical modifications that would be impractical or impossible to identify using Sequest alone. Similar to previously reported work on postprocessing classifiers like support vector machines, artificial neural networks, and random forests of trees, we also use a supervised approach by training a random forest tree model to recognize correct and incorrect peptide identifications by combining the outputs from Sequest, X! Tandem, as well as other features like charge state, the number of missed cleavages, features of the MS/MS spectrum, and so forth.20,22,23 We use classification model output scores from a reverse database search using the identical searching parameters as the proper search to obtain a null distribution of scores from which p-values for the identifications can be estimated. Last, FDR estimates (q-values) can be made directly from the p-values using any of the methods described in the statistical literature.24,25,26 Peptides with a q-value less than some user-defined threshold, for example, 0.05, can then be taken as a set of identifications for which approximately, for example, 95%, would on average prove to be correct upon confirmation. Features of this strategy include (1) the statistical assessment is decoupled from the MS/MS scoring algorithm, (2) the flexibility to accommodate any number of peptide identification algorithms, (3) p-values are estimated in a data-dependent manner from a null distribution tied directly to the experimental conditions, protein database, and searching parameters used for the study and not from an a priori defined set of rules derived on a potentially very
different experimental system, (4) all identifications can be rank ordered on the basis of the estimated p-values, and (5) FDR estimates (q-values) are used to control the proportion of false discoveries in a reported set of peptide identifications from a study. Using FT-ICR accurate mass data and deviations from predicted peptide retention values from a set of rat serum digests, we estimate the degree to which the proposed method of estimating FDRs is calibrated and how it compares to an a priori defined rule-based procedure designed to control false discoveries. The advantages of using multiple search engines, Sequest and X! Tandem in this example, are demonstrated with an analysis of human cerebrospinal fluid from 16 human subjects in which we compare the results of a peptide mutation identified by X! Tandem to DNA genotyping results obtained from each subject. The potential for combining the identification of peptide mutations with relative, label-free, peptide quantification is discussed.
Materials and Methods Human Cerebrospinal Fluid (CSF) Samples. Clinical procedures were carried out at the California Clinical Trials Phase I Unit, Glendale, CA, under a protocol approved by the California IRB, Inc., Beverly Hills, CA institutional review board. After obtaining informed consent and subsequent screening, baseline cerebrospinal fluid (CSF) samples were collected by lumbar puncture (LP) from 16 mixed race males, aged 20-46, at 7-8 a.m. following a nutritionist-supplied evening meal and overnight inpatient stay. Subjects were fasted for approximately 12 h following dinner with water permitted until morning sampling procedures and remained nonambulatory for a minimum of 1 h prior to procedures. Fourteen days after the baseline procedures, the subjects were again admitted to the inpatient unit on the afternoon prior to sample collection and were provided with an identical evening meal, and samples were collected the next morning at the same time of day as the baseline collection (30 min. CSF and Serum Sample Preparation. Aliquots of CSF (50µg protein) and rat serum (1.2-mg protein) were diluted with Montage equilibration buffer (Montage Albumin Deplete Kit, Millipore) to a volume of 300 µL. Protein G Sepharose bead suspension (20 µL of a 50% suspension) was added (Amersham Biosciences), and the mixture was rocked for 1 h at room temperatue (RT). Protein G Sepharose beads were pelleted at 2000 rpm for 2 min, and 280 µL of the effluent was transferred to a pre-equilibrated Montage column (Montage Albumin Deplete Kit, Millipore). Pre-equilibration was performed via manufacturer specifications. The Montage column was then centrifuged at 500g for 2 min, and the flow-through was reapplied to the column and was centrifuged again. Two consecutive 100-µL washes of Montage wash buffer (Montage Albumin Deplete Kit, Millipore) were passed over the column via 500g centrifugation for 2 min (final volume approximately 500 µL). Samples were speed vacuumed to approximately 3050 µL. Volumes were monitored during speed vacuuming to prevent evaporation to dryness. Albumin and immunoglobulindepleted CSF or serum was spiked with chicken lysozyme (approximately 1% of the initial total protein concentration) and was mixed with 40 µL of 8 M urea in 100 mM ammonium carbonate pH 11.0. Then, 100 µL of reduction/alkylation cocktail (2% iodoethanol and 0.5% triethylphosphine in 97.5% acetonitrile) was added. The solutions were capped and incubated for 1 h at 37 °C after which the solutions were speed vacuumed to dryness, at least 3 h. The pellet was then Journal of Proteome Research • Vol. 6, No. 5, 2007 1759
research articles redissolved in 200 µL of a trypsin solution (Worthington TPCK treated, in 100 mM ammonium bicarbonate pH 8.0) to produce a 1.6 M urea solution and an enzyme:substrate ratio of 1:50 (w/w), assuming a 50% reduction in protein concentration after IgG/albumin depletion. The digestion was carried out at 37 °C overnight. Typically, 100 µL of this digest was injected onto the mass spectrometer. LTQ-FT Mass Spectrometer Conditions. Tryptic digests (1020 µL) were injected onto a Zorbax SB300 1 × 50 mm C-18 reversed-phase column (Agilent) at a flow rate of 50 µL/min on a Surveyor high-performance liquid chromatography (HPLC) system (ThermoFinnigan). The gradient conditions were 1095% B (90-5% A) over 120 min, followed by a 0.1-min ramp to 100% C, followed by 5 min at 100% C, followed by a 0.1-min ramp to 10% B (90% A), and hold for 17 min at 10% B (90% A), where A was 0.1% formic acid in water, B was 50% acetonitrile 0.1% formic acid in water, and C was 80% acetonitrile 0.1% formic acid in water. The effluent was diverted to waste for the first 5 min to keep the source clean. The total column effluent was connected to the electrospray interface of an LTQFT ion trap mass spectrometer (ThermoFinnigan). The source was operated in positive ion mode with 4.8 kV electrospray potential, a sheath gas flow of 20 arbitrary units, and a capillary temperature of 225 °C. The source lenses were set by maximizing the ion current for the 2+ charge state of angiotensin. Data were collected with a triple-play method using the following parameters: Ion trap centroid parent scan was set to 1 microscan and 50-ms maximum injection time, FT centroid zoom scan was set to 3 microscans and 500-ms maximum injection time, and an ion trap centroid MS/MS scan was set to 1 microscan and 500-ms maximum injection time. Dynamic exclusion settings were set to a repeat count of one, exclusion list duration of 2 min, and rejection widths of -0.75 m/z and +1.5 m/z. Collisional activation was carried out at a relative collision energy of 35% and an exclusion width of 3 m/z. DNA Genotyping. DNA genotyping of 16 X001 clinical trial patients and 47 normal individuals (Coriell Repository, Camden, NJ) for the R335G SNP (rs3206824) was done to verify a single amino acid polymorphism identified by MS/MS searching with X! Tandem. The clinical patient DNA was prepared and sent directly from Covance, Inc. (Princeton, NJ). A PCR reaction volume of 25 µL was composed of 0.2 µM primers (F: 5′-GCCTGGTGTATGTGTGCAAG-3′, R: 5′-CTCTTCCCCTCCCAGCAG-3′), 25 ng DNA, 0.6 u of AmpliTaq Gold Master Mix (Applied Biosystems, Foster City, CA), 4% DMSO, and 0.13 u of PfuTurbo Hotstart Polymerase (Stratagene, La Jolla, CA). The reactions were amplified (PTC-225 programmable thermocycler; MJ Research, Inc., Waltham, MA) using the following PCR cycling conditions: 94 °C for 10 min (denaturation), (1) 94 °C for 30 s, 67 °C for 30 s with -0.5 cycle, 72 °C for 1 min and repeated 15× (extension), (2) 94 °C for 30 s, 60 °C for 30 s, and 72 °C for 1 min, repeated 30× (amplification), and (3) 72 °C for 5 min (final extension). PCR products were immediately frozen and sent to Agencourt Bioscience Corporation (Beverly, MA) for sequencing with the same primers used for amplification. DNA sequence chromatogram data was analyzed using the PolyPhredPhrap software program to identify and genotype the R335G polymorphism. Preprocessing and MS/MS Filtering. The first step in our proposed framework is filtering. The results presented here were obtained by eliminating triple-play scan events with a zoom scan quality index 0.08.
Table 3. Number of Peptides Correctly and Incorrectly Identified by a Sequest Rule-Based (SRB) Method Using a Trypsin-Specific Search (Top), a No Enzyme Specificity Search (Bottom), and the Proposed Model-Based Approach (Model) Which Controls the False Discovery Ratea
sample
1 2 3 4 5 6 7 8 9 10
SRB incorrectly identified
SRB correctly identified
SRB PPV
52 43 55 43 63 43 61 45 51 37 52 37 63 37 64 46 56 38 54 41
542 497 538 495 616 569 610 573 578 534 572 531 601 560 605 574 640 596 584 537
91% 92% 91% 92% 91% 93% 91% 93% 92% 94% 92% 93% 91% 94% 90% 93% 92% 94% 92% 93%
model incorrectly identified
model correctly identified
model PPV
51
709
93%
52
727
93%
55
774
93%
48
784
94%
50
761
94%
37
755
95%
39
791
95%
38
756
95%
42
793
95%
50
746
94%
a Positive predictive value rates (PPV) are estimated by the number of correctly identified peptides divided by the total number of peptides identified by each method. The model-based approach results reported using a q-value threshold of 0.10. Correct identification defined by |∆[M + H]+|< 6 ppm and |∆ACN| < 10%.
presumably correct identification for this peptide required a nontryptic search and inclusion of more than the top scoring Sequest match in some samples. Using the described approach, we have not observed a significant increase in false positive identifications by including multiple matches from Sequest. This specific example illustrates the inherent challenges associated with peptide identification from shotgun proteomics experiments as well as the value of accurate mass information provided by FT-ICR instruments as well as a priori biological knowledge regarding the likelihood of detecting certain lowlevel proteins directly from sera. Identifying Single Amino Acid Polymorphisms. A feature of the proposed peptide identification method is the framework for combining the results from multiple MS/MS search engines, including tools that can identify peptide polymorphisms. For example, a single amino acid polymorphism of the DKK3 protein, R335G SNP (rs3206824), was identified by X! Tandem using the proposed approach in an analysis of cerebrospinal fluid from 16 healthy volunteers. The DKK3 genotype of this single nucleotide polymorphism (SNP) for each subject could be inferred by an examination of the peptide ion current 1764
Journal of Proteome Research • Vol. 6, No. 5, 2007
Figure 3. Estimated proportion of incorrect IDs (|∆[M + H]+| > 6 ppm or |∆ACN| > 10%) vs estimated q-value for 10 replicate injections of a rat serum digest (a). Proportion of total correct identifications (|∆[M + H]+| e 6 ppm and |∆ACN| e 10%) included vs estimated q-value (b).
specific to the Arg335 and Gly335 containing peptides (Figure 5). DNA genotyping of this SNP resulted in 100% agreement with the inferred genotype on the basis of the observed peptides in the cerebrospinal fluid. While in some cases it is possible to infer genotype from proteotype, it is important to confirm proteotyping results as several factors can result in discrepancies when comparing proteomic and genomic data. Factors that can result in such discrepancies include structural changes in the DNA sequence that cause altered protein expression and epigenetics.33-38 The identification of peptide mutations using tools like X! Tandem in the proposed approach coupled with label-free relative quantification enables the testing of relative peptide levels that may be modulated by epigenetic, genetic plus epigenetic, or other factors associated with a specific disease state and thus could serve as another source of biomarkers.
Conclusions A straightforward modular approach for combining information generated by multiple peptide search engines and assessing their statistical significance has been described. Advantages of this reported approach include the flexibility to leverage multiple search engines, to add or replace search engines without learning a new set of rules for acceptable identifications, to incorporate orthogonal identification information (e.g., predicted vs actual elution time), to generate outputs that are calibrated and interpretable probabilities (p and q values) rather than abstract scoring parameters, and explicit control of false positive identifications at a specified false discovery rate. The proposed method builds upon previously reported approaches
Estimating Statistical Significance of Peptide Identifications
research articles
Figure 4. Incorrect match to FGF-11 (IPI00201918.1) peptide [.M*AALASSLIR.Q (a) and correct match to R-1 Inhibitor III (IPI00201262.1) peptide Y.AFALAGNQEK.R (b) identified from the same MS/MS spectrum. Both matches are plausible on the basis of the MS/MS yand b-ion matches but the FGF-11 match is likely incorrect on the basis of a priori biological knowledge of the samples and the accurate mass difference. * ) oxidized methionine.
to peptide identification that use reversed database results as references and machine learning approaches to combine multiple parameters. The proposed method is most similar to PeptideProphet21 but with several important distinctions: (1) multiple peptide search engines are used, (2) additional peptide identification information (e.g., |∆[M + H]+|, charge state, etc.) are used in the scoring model, (3) a more flexible modeling method (RandomForest vs linear discriminant analysis) is used to model correct versus incorrect identifications, and (4) a simple permutation testlike p-value is estimated from the
reversed database search results and is used to estimate a false discovery rate for each identification. We have used the distribution of classifier scores from the reversed database search as a null distribution to estimate p-values in a manner similar to a permutation test. With p-values estimated, we then estimate the false discovery rate (q-value) to account for the multiple testing occurring in these high throughput experiments. As a result of this general approach, investigators have an interpretable probability, a q-value, which can be used to rank order and threshold a set of peptide identifications. The Journal of Proteome Research • Vol. 6, No. 5, 2007 1765
research articles
Higgs et al.
Figure 5. Extracted ion chromatograms for the Arg335 DKK3 peptide SLTEEMALREPAAAAAALLGGEEI (top panels) and the Gly335 peptide SLTEEMALGEPAAAAAALLGGEEI (bottom panels). Peptide chromatograms from a homozygous Arg335/Arg335 subject are shown in panels (a) and (d), a heterozygous Gly335/Arg335 subject are shown in panels (b) and (e), and a homozygous Gly335/Gly335 subject are shown in panels (c) and (f). Ion current areas for these peptides can be used to infer genotype as well as relative expression ratios of the different alleles.
approach described here is modular in nature and can be adapted by different laboratories using different search engines and different peptide identification classifiers. This approach also provides a mechanism to report identification statistics, p-values and q-values, which could conceivably be compared across different laboratories, search engines and parameters, and experimental conditions. While mapping peptide identifications to protein identifications and estimating the statistical significance of protein identifications is out of scope for this report, we anticipate that the p- and q-values derived from the proposed procedure could be used in a similar manner to that previously described by Nesvizhskii and co-workers.39,40 Using accurate mass deviations from an FTICR measurement to assess the validity of peptide identifications, we have shown that the q-values estimated from this procedure appear to be well calibrated and hence may be used to estimate the degree of false positive identifications in a set of reported peptides. A comparison of peptide identification results between the proposed method and a Sequest rule-based (SRB) method indicated that while the rule-based method was designed on a specific data set to yield a positive predictive value of 99%, we observed positive predictive values of approximately 93% for a series of rat serum digests. Using the proposed method with a q-value cutoff of 0.10, we estimated a positive predictive value 1766
Journal of Proteome Research • Vol. 6, No. 5, 2007
of 94% with 170-210 additional peptides identified relative to the SRB method. We have demonstrated the feasibility of identifying peptide mutations by incorporating tools designed for their identification (e.g., X! Tandem) as well as the validation of a specific example (DKK3) via genotyping. While epigenetic and other factors are potential causes for errors in a genotype inferred by proteotyping, the ability to not only identify but quantify these peptides could be a source of biomarkers that reflect not only the genetic but also the epigenetic effects on relative levels of peptides.
Acknowledgment. We thank Drs. Jude Onyia, Gary Sullivan, and W. Scott Clark for supporting us in the development of these methods. We are also grateful to Dr. Jimmy Eng for supplying the MS/MS viewer used in our research and to Dr. Olga Vitek for stimulating discussions on statistical approaches to peptide identification. Supporting Information Available: A PDF file containing the search parameters used for Sequest and X! Tandem. This material is available free via the Internet at http:// pubs.acs.org.
research articles
Estimating Statistical Significance of Peptide Identifications
References (1) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (2) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (3) Craig, R.; Beavis, R. C. Rapid Commun. Mass Spectrom. 2003, 17, 2310-2316. (4) Craig, R.; Beavis, R. C. Bioinformatics 2004, 20, 1466-1467. (5) Wang, Y.; Zhang, J.; Gu, X.; Zhang, X. M. J. Chromatogr., B 2005, 826, 122-128. (6) Washburn, M. P.; Wolters, D.; Yates, J. R. Nat. Biotechnol. 2001, 19, 242-247. (7) Florens, L.; Washburn, M. P.; Raine, J. D.; Anthony, R. M.; Grainger, M.; Hayness, J. D.; Moch, J. K.; Muster, N.; Sacci, J. B.; Tabb, D. L.; Witney, A. A.; Wolters, D.; Wu, Y.; Garnder, M. J.; Holder, A. A.; Sinden, R. E.; Yates, J. R.; Carucci, D. J. Nature 2002, 419, 520-526. (8) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. J. Proteome Res. 2003, 2, 43-50. (9) Elias, J. E.; Haas, W.; Faherty, B. K.; Gygi, S. P. Nat. Methods 2005, 2, 667-675. (10) Eriksson, J.; Fenyo, D. J. Proteome Res. 2004, 3, 979-982. (11) Qian, W. J.; Liu, T.; Monroe, M. E.; Strittmatter, E. F.; Jacobs, J. M.; Kangas, L. J.; Petritis, K.; Camp, D. G., II; Smith, R. D. J. Proteome Res. 2005, 4, 53-62. (12) Bradshaw, R. A.; Burlingame, A. L.; Carr, S.; Aebersold, R. Mol. Cell Proteomics 2006, 5, 787-788. (13) Weatherly, D. B.; Atwood, J. A.; Minning, T. A.; Cavola, C.; Tarleton, R. L.; Orlando, R. Mol. Cell. Proteomics 2005, 4, 762772. (14) Havilio, M.; Haddad, Y.; Smilansky, Z. Anal. Chem. 2003, 75, 435444. (15) Bafna, V.; Edwards, N. Bioinformatics 2001, 17, S13-S21. (16) Zhang, N.; Aebersold, R.; Schwikowski, B. Proteomics 2002, 2, 1406-1412. (17) Ericksson, J.; Fenyo, D. J. Proteome Res. 2004, 3, 32-36. (18) Cannon, W. R.; Jarman, K. H.; Webb-Robertson, B. J.; Baxter, D. J.; Oehmen, C. S.; Jarman, K. D.; Heredia-Langner, A.; Auberry, K. J.; Anderson, G. A. J. Proteome Res. 2005, 4, 1687-1698. (19) Fridman, T.; Razumovskaya, J.; Verberkmoes, N.; Hurst, G.; Protopopescu, V.; Xu, Y. J. Bioinf. Comput. Biol. 2005, 3, 455476.
(20) Anderson, D. C.; Li, W.; Payan, D. G.; Noble, W. S. J. Proteome Res. 2003, 2, 137-146. (21) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal. Chem. 2002, 74, 5383-5392. (22) Baczek, T.; Bucinski, A.; Ivanov, A. R.; Kaliszan, R. Anal. Chem. 2004, 76, 1726-1732. (23) Ulintz, P. J.; Zhu, J.; Qin, Z. S.; Andrews, P. C. Mol. Cell Proteomics 2006, 5, 497-509. (24) Benjamini, Y.; Hochberg, Y. J. R. Stat. Soc. B 1995, 57, 289-300. (25) Storey, J. D. Ann. Statist. 2003, 31, 2013-2035. (26) Efron, B.; Tibshirani, R.; Storey, J.; Tusher, V. J. J. Am. Stat. Assoc. 2001, 96, 1151-1160. (27) Higgs, R. E.; Knierman, M. D.; Gelfanova, V.; Butler, J. P.; Hale, J. E. J. Proteome Res. 2005, 4, 1442-1450. (28) Bern, M.; Goldberg, D.; McDonald, W. H.; Yates, J. R. Bioinformatics 2004, 20, i49-i54. (29) Strittmatter, E. F.; Kangas, L. J.; Petritis, K.; Mottaz, H. M.; Anderson, G. A.; Shen, Y.; Jacobs, J. M.; Camp, D. G., II; Smith, R. D. J. Proteome Res. 2004, 3, 760-769. (30) Norbeck, A. D.; Monroe, M. E.; Adkins, J. N.; Anderson, K. K.; Daly, D. S.; Smith, R. D. J. Am. Soc. Mass Spectrom. 2005, 16, 1239-1249. (31) R Development Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing. http://www.R-project.org (accessed 2004). (32) Breiman, L. Mach. Learning 2001, 45, 5-32. (33) Feinberg, A. P.; Tycko, B. Nat. Rev. Cancer 2004, 4, 143-153. (34) Rocco, J. W.; Sidransky, D. Exp. Cell Res. 2001, 264, 42-55. (35) Claus, R.; Lubbert, M. Oncogene 2003, 22, 6489-6496. (36) Yan, H.; Yuan, W.; Velculescu, V. E.; Vogelstein, B.; Kinzler, K. W. Science 2002, 297, 1143. (37) Shahbazian, M. D.; Zoghbi, H. Y. Am. J. Hum. Genet. 2002, 71, 1259-1272. (38) Sharma, R. P. Schizophr. Res. 2005, 72, 79-90. (39) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. Anal. Chem. 2003, 75, 4646-4658. (40) Nesvizhskii, A. I.; Aebersold, R. Mol. Cell. Proteomics 2005, 4, 1419-1440.
PR0605320
Journal of Proteome Research • Vol. 6, No. 5, 2007 1767