Improving Sensitivity by Probabilistically Combining Results from Multiple MS/MS Search Methodologies Brian C. Searle,*,† Mark Turner,† and Alexey I. Nesvizhskii§ Proteome Software Inc., 1340 S.W. Bertha Boulevard, Suite 201, Portland, Oregon 97219-2039, and Department of Pathology and Center for Computational Medicine and Biology, University of Michigan, 1301 Catherine Road, Ann Arbor, Michigan 48109-0602 Received August 17, 2007
Database-searching programs generally identify only a fraction of the spectra acquired in a standard LC/MS/MS study of digested proteins. Subtle variations in database-searching algorithms for assigning peptides to MS/MS spectra have been known to provide different identification results. To leverage this variation, a probabilistic framework is developed for combining the results of multiple search engines. The scores for each search engine are first independently converted into peptide probabilities. These probabilities can then be readily combined across search engines using Bayesian rules and the expectation maximization learning algorithm. A significant gain in the number of peptides identified with high confidence with each additional search engine is demonstrated using several data sets of increasing complexity, from a control protein mixture to a human plasma sample, searched using SEQUEST, Mascot, and X! Tandem database-searching programs. The increased rate of peptide assignments also translates into a substantially larger number of protein identifications in LC/MS/MS studies compared to a typical analysis using a single database-search tool. Keywords: Proteomics • mass spectrometry • peptide identification • protein identification • bioinformatics • database searching • SEQUEST • Mascot • X! Tandem • probability
experiment are then used to infer the proteins present in the original sample.9
Introduction Mass spectrometry has quickly become the method of choice for high-throughput identification of proteins in complex biological samples. The most commonly used strategy, shotgun proteomics, involves enzymatic digestion of sample proteins into shorter peptides, followed by peptide sequencing using tandem mass spectrometry (MS/MS).1 The resulting peptides commonly fragment in such a way that their amino acid sequences can be determined from the acquired MS/MS spectra, and a number of computational tools have been developed to automate this process. In a typical large-scale study, peptides are assigned to spectra by sequence database searching using one of the commercial (e.g., SEQUEST, Mascot) oropensource(e.g.,X!Tandem)database-searchingalgorithms.2–8 These search engines operate in a similar way in that they pick candidate peptide sequences from the searched protein sequence database and generate theoretical spectra on the basis of their most likely fragmentation in a mass spectrometer. Each experimental spectrum is compared to these theoretical spectra, and the best scoring match is selected as the mostly likely peptide sequence to explain the observed spectrum. The sequences of the identified peptides from all spectra in the * Author to whom correspondence should be addressed [e-mail
[email protected]; telephone (503) 244-6027; fax (503) 245-4910]. † Proteome Software Inc. § University of Michigan. 10.1021/pr070540w CCC: $40.75
2008 American Chemical Society
It has been recognized, however, that computational analysis of MS/MS spectra represents a significant challenge.10 In a typical analysis, only a fraction (typically 50% peptide probability from the complementary search engine) or by 0.5 in the case of low search engine agreement (