ARTICLE pubs.acs.org/jpr
Probabilistic Consensus Scoring Improves Tandem Mass Spectrometry Peptide Identification Sven Nahnsen,*,† Andreas Bertsch,† J€org Rahnenf€uhrer,‡ Alfred Nordheim,§ and Oliver Kohlbacher† †
Center for Bioinformatics, Eberhard Karls University, 72076 T€ubingen, Germany Department of Statistics, Dortmund University of Technology, 44221 Dortmund, Germany § Department of Molecular Biology, Eberhard Karls University, 72076 T€ubingen, Germany ‡
bS Supporting Information ABSTRACT: Database search is a standard technique for identifying peptides from their tandem mass spectra. To increase the number of correctly identified peptides, we suggest a probabilistic framework that allows the combination of scores from different search engines into a joint consensus score. Central to the approach is a novel method to estimate scores for peptides not found by an individual search engine. This approach allows the estimation of p-values for each candidate peptide and their combination across all search engines. The consensus approach works better than any single search engine across all different instrument types considered in this study. Improvements vary strongly from platform to platform and from search engine to search engine. Compared to the industry standard MASCOT, our approach can identify up to 60% more peptides. The software for consensus predictions is implemented in C++ as part of OpenMS, a software framework for mass spectrometry. The source code is available in the current development version of OpenMS and can easily be used as a command line application or via a graphical pipeline designer TOPPAS. KEYWORDS: peptide identification, multiple search engines, sequence similarity, database search, OpenMS
’ INTRODUCTION Over the past decade, the importance of mass spectrometry (MS)-based proteomics has grown immensely as a method for identifying and quantifying proteins. Recent technological advances form the basis for highly sensitive characterizations of complex protein mixtures, such as whole cell lysates. New mass spectrometers enable data acquisitions with high dynamic range and allow high-throughput analyses.1 Multidimensional chromatographic peptide separation followed by mass spectrometry, usually referred to as shotgun proteomics, is the method of choice in large-scale proteomics studies.2 This continuously growing research area was made possible by the availability of sequence databases.3 Apart from experimental efforts that are made all over the world, bioinformatics is becoming more and more important in supporting proteomic research. Current research achievements in this field are far-ranging and include approaches for peptide and protein identification based on de novo4,5 and database retrieval algorithms,612 data sharing in repositories13,14 and downstream analysis tools.15 MS-based peptide identification is one of the computationally most intensive steps in the whole workflow. Peptide identifications are usually inferred from matching observed and theoretically calculated spectra.16 This matching is performed by various tools that have been made available in recent years. The most frequently used search engines are the commercial tools Mascot,9 Sequest,8 and Phenyx.7 Popular noncommercial solutions include X!Tandem,10 OMSSA,11 InsPect12 among many r 2011 American Chemical Society
others. The methods used by these algorithms are diverse and have recently been reviewed.17 Typically, all database search algorithms for spectral assignments produce a list of peptides that are ranked according to their scores. Due to shortcomings of the scoring, the first sequence in the list does not necessarily correspond to the correct identification. In fact, there are many cases where the correct sequence is not even contained in the result list of the search engine. In addition, there is a high variation in search results from different search engines.18 Scores produced by search engines are often difficult to interpret and to compare. There have been several approaches for converting the search engine scores into more easily interpretable numbers, for example probabilities. Keller et al.19 offered one of the first statistical approaches for converting Sequest scores into probabilities. Their algorithm is based on maximum likelihood estimation of empirically assumed probability distributions and an Expectation Maximization (EM)20 framework to assess optimal parameters for the mixture model deconvolution. The PeptideProphet method has been widely accepted for statistical assessment and recently been extended to support not only Sequest but also Mascot and X!Tandem results.21 Complementary approaches to EM-based mixture modeling include target-decoy database searches.16 Peptides are searched against a concatenated database consisting of the usual forward Received: February 24, 2010 Published: June 06, 2011 3332
dx.doi.org/10.1021/pr2002879 | J. Proteome Res. 2011, 10, 3332–3343
Journal of Proteome Research database and a reversed (randomized or shuffled) version of the original database. Essentially, the null hypothesis is the same as in the mixture model approach. All peptide scores that fall into the distribution described by the decoy results or into the first component in the mixture model, respectively, are assigned to the null hypothesis, random chance identification. Despite all the effort put into the conversion of intransparent scores from search engines into probabilities, or the assignment of false discovery rates (FDR),22 advantages gained by using different search engines remain largely unused. According to Kapp et al.,18 only one-third of all peptides in an experiment are identified by all engines. Simple consensus identification by voting can enhance the reliability of the identification, but at the cost of a lower number of identified peptides. Combination of the results of search engines becomes difficult, however, if peptides are not scored by all search engines, that is, if the candidate sequence is not reported by all search engines. There have been several approaches describing methods for the combination of different search engine results. In 2008, Searle et al.23 suggested a combination method based on Mascot, X!Tandem and Sequest scores. This approach is implemented in the commercial software Scaffold (http://www.proteomesoftware. com/). Peptide probabilities are individually estimated for the search engines. To combine the results from the different search engines, an agreement score is used to account for differences in the significance of the individual scores, if the same peptide was assigned by several engines. Another search engine combiner is the PepArML tool.24 PepArML uses machine learning approaches to account for both the statistical significance and the combination of the scores from the different engines. This approach relies on an iterative process of a random forest learning method. Another machine learning based tool is iProphet,25 which is integrated in the Trans-Proteomic Pipeline (TPP) (http://tools.proteomecenter.org). PeptideProphet19 builds the basis of iProphet to estimate peptide probabilities. iProphet perfoms additional EM estimation to derive a common probability for multiple search runs. Common to most combination approaches is the analysis of the results from the same search conducted in parallel using different search engines. The combination of the search results is usually done under the assumption that search engine agreement increases the likelihood for correctness.26 Another approach that has recently gained attention in the postprocessing of tandem MS search results is the “multipass analysis”.26 Multipass methods combine multiple searches conducted from one engine. The subsequent searches use information from previous searches and adjust parameters for more sensitive peptide identifications. A commonly used multipass strategy is the refinement mode of X!Tandem. The refinement function of X!Tandem automatically contructs a new database, using proteins that have already been identified in a primary run. This database is then searched with an increased number of potential modifications, missed cleavages and even polymorphisms. Here we describe a new generic framework to integrate results from various search engines. The final consensus identification results are written in a standardized XML format and each peptide hit is annotated by a q-value indicating its statistical significance. The algorithm consists of three steps: first, we apply mixture modeling to convert search engine scores into probabilities. Second, we account for missing peptide sequences in the search engine output by estimating the corresponding peptide score. This score estimation is based on the similarity between sequences that are suggested by different engines. Applying these
ARTICLE
steps, we obtain probability-like scores for all search engines and each peptide candidate. The following third step is the combination into a joint consensus score. The calculated consensus score is a weighted average-like score integrating true and/or estimated information. The weights are either one if the same sequences were assigned by different engines or they correspond to the similarity if only similar sequences were assigned by the different engines. The performance of the method is evaluated on mixtures of known proteins, that have been measured on different instruments and on a complex mixture resulting from a whole proteome digest of Escherichia coli. Besides increased confidence in the peptide identifications, using our novel consensus scoring, we can significantly improve peptide identification rates. We show that the identification rates are consistently improved over the identification rates of the individual search engines on data sets from Orbitrap, FT Ultra, and LCQ instruments. Furthermore, we demonstrate better performance compared to other search engine combiner methods using the same database and modification settings. Our new method is generic and can easily integrate novel tandem MS search engines and is not limited in the number of different search engines. Our method is well suited for the application to multipass search strategies and it allows easy integration of further information that can be used in a probability-like manner for peptide identification, such as retention time prediction and accurate mass.
’ EXPERIMENTAL PROCEDURES The overlap between search results from different engines is rather poor and a high percentage of true peptides are not scored by all search engines. Therefore, combining different engines holds the promise to increase sensitivity and specificity in peptide identification by tandem mass spectrometry. Furthermore, we argue that spectra that have been correctly identified by search engine k, but not by the others, should at least be assigned to sequences with similarity to the correct peptide, if the spectral quality is good enough to trust the identification by engine k. Our new strategy to make use of several search engines for peptide identification via tandem mass spectrometry relies on the similarity of peptide sequences. Peptide similarity scoring is applied in cases of sequences missing in result lists of search engines. The quality of tandem mass spectra varies, and often a single or a few missing peaks can lead to the loss of a candidate. Wrong peptide assignements are also often due to bad spectral quality. Spectral quality is related to the similarity of the corresponding peptide sequences. Partial (prefix and/or suffix) sequence identity leads to partial identity of the spectra (partial matches of the b/y ion series). Sequence similarity thus implies spectral similarity and integrates additional evidence if only partially correct sequences are suggested by any search engine. Note that the likelihood for partial correctness increases if spectra are searched against large target/decoy databases. More formally, a sequence s that has been assigned to a spectrum by at least one search engine is not assigned by search engine k, the sequence with highest similarity to s from the list of candidates suggested by engine k is determined. Global pairwise sequence alignment (NeedlemanWunsch) is used to determine the similarity and the score from the most similar sequence is used as a substitute. For the final consensus score, these estimated scores will be weighted according to the similarity of their corresponding peptide sequences. The influence of the score is thus reduced proportionally to the 3333
dx.doi.org/10.1021/pr2002879 |J. Proteome Res. 2011, 10, 3332–3343
Journal of Proteome Research
ARTICLE
Figure 1. Three search engines assign peptide sequences to a given spectrum. This spectrum is taken from the E. coli data set (measured on an LTQOrbitrap) and corresponds to a doubly charged ion with RT: 2233.105 and MZ: 695.3646. Mascot and X!Tandem suggest the same peptide sequence as their top hit. The consensus score is calculated by the weighted combination of the real and estimated scores for the given sequences. The formula on the right side shows the way consensus scores are calculated; R and β correspond to the similarity of a given sequence from one engine with the most similar sequence from another engine. r is the rank of the peptides according to their score.
sequence similarity. This method allows assigning scores to each peptide sequence per search engine and ultimately to combine these scores to a consensus score. For the sequence alignment, gap opening and extension are stringently penalized. In summary, we implemented a mixture modeling approach in our own freely available software framework OpenMS. Scores for peptide sequences not appearing in engine m, but in engine n are imputed for engine m by peptide similarity scoring. We combine the search results by calculating the similarity-weighted average score for each peptide.
incorrectly assigned sequences. The function f2 is modeled as a Gaussian density. To perform the estimation of the parameters, an EM framework was implemented. The Expectation step (Estep) comprises the estimation of posterior probabilities, as formalized in eq 2, using initial guesses for π^, Θ^1 and Θ^2. This step is followed by the Maximization step (M-step), where the estimated posterior probabilities are used to refit the distributions fi. With this iteration, the log-likelihood function eq 3 is maximized and the algorithm terminates if there is no further improvement of the log likelihood function.
Mixture Modeling of Score Distributions
^pi ðxÞ ¼
For each search engine, we consider n spectra. The scores from engine k x k ¼ ðxk 1, :::, xk nÞ can be modeled as n independent and identically distributed (i.i.d.) random variables. The distribution of these scores is modeled by a two-component mixture model with the function f given by f ðx; Θ1 ; Θ2 Þ ¼ πf1 ðx, Θ1 Þ + ð1 πÞf2 ðx, Θ2 Þ
ð1Þ
where π corresponds to the prior probability of the scores being incorrect. Here, incorrect, means that the spectrum is assigned to a wrong peptide sequence. The functions f1 and f2 are the densities for incorrectly and correctly assigned sequences, respectively. The parameters Θ1 and Θ2 are used to specify the exact shape of the densities. The function f1 is modeled as a density of a Gumbel distribution. The use of extreme value distributions as a model for the function f1 has been introduced as a generic method for the statistical assessment of peptidespectrum matching scores27 and successfully applied to X! Tandem and Mascot searches.21,23 An extreme value distribution is a natural candidate for modeling maximized scores from
log L ¼
n
^ iÞ ^ i fi ðx, Θ π , i ∈ f1, 2g ^ 1Þ + π ^ 2Þ ^ 1 f1 ðx, Θ ^ 2 f2 ðx, Θ π
∑ logðπ^ 1 f1 ðxi ; Θ^ 1 Þ + ð1 π^ 1 Þf2ðxi ; Θ^ 2ÞÞ i¼1
ð2Þ
ð3Þ
where Θ1 is the set of all parameters for the probability distribution f1. f1 corresponds in our case to an extreme value distribution with the location parameter R and the scale parameter β. Θ2 includes the parameters for the f2 function (the mean μ and variance σ2 for a Gaussian distribution). Initial parameters for our model are found by employing an ordinary Gaussian mixture model (two Gaussian distributions), as implemented in the flexmix function28 to the scores for each search engine. This method allows accurate conversion of different search engine scores into probabilities. The input for the mixture modeling is given by discriminant scores from the search engine output. For all search engines we used the negative common logarithms of the search engines’ e-values as discriminant scores. Routinely we do not calculate new models for different charge states, however, the command line tool easily allows charge state 3334
dx.doi.org/10.1021/pr2002879 |J. Proteome Res. 2011, 10, 3332–3343
Journal of Proteome Research separation. For the data sets used here charge state separation did not reveal significant improvements. Consensus Scoring. Different ways of calculating consensus scores for peptides candidates are evaluated. Figure 1 describes the typical workflow for generating consensus scores on the basis of three single search engines and the similarity based consensus measures (peptide sequence similarity or spc). All engines suggested the QRESTATDILQK peptide. Despite the top hit suggestion by X!Tandem and Mascot, OMSSA only suggested this peptide on rank four. The Supporting Information contains all similarity matrices and a detailed description for the calculation of the consensus scores. In the following, the different combination methods will be defined. We will evaluate results based on peptide sequence similarity, shared peak count and average scoring methods. Peptide Sequence Similarity Scoring (SeqSim). Comparing search results from different engines, we observe that for some spectra peptide sequences occur only in a subset of the search engine results. Often, these peptides are true hits. To account for this, we implemented a method to estimate missing scores for peptides that appear in the candidate list of at least one engine but not in all. This method ensures that we have an estimated value for each peptide occurring in any of the search engines’ candidate lists. Tandem MS spectra characterize peptide sequences based on their fragmentation patterns, for example, y and b ions for CID (collision-induced dissociation),29 which are used by database search engines for the assignment of peptides to spectra. Sequences that contain isobaric amino acids, such as I and L, are not distinguishable based on their tandem MS spectra. We use the similarity of peptide sequences to determine the probability that a “missing” peptide can be assigned to a spectrum. The converted score (from mixture modeling) of the peptide showing the highest similarity is used to assign values to the missing sequence and the estimated score will be weighted according to its similarity. As a similarity measure, we used a global alignment with the identity matrix adapted for I/L and Q/K ambiguity or the PAM30MS substitution matrix, respectively. The NeedlemanWunsch algorithm30 is used to calculate the sequence similarity. Other PAM scoring matrices are also evaluated. Results from different scoring matrices as shown in Figure 3. The PAM30MS matrix was previously introduced and intended for cross species proteomics.31 This matrix is based on the PAM30 matrix and modified to account for Ile/Leu and Gln/Lys ambiguities associated with determination of peptide sequences using tandem mass spectrometry. The alignment score for two peptide sequences pi and pj is then normalized as follows: 8 scoreðpi , pj Þ > < ð4Þ simðpi , pj Þ ¼ max minðscoreðpi , pi Þ, scoreðpj , pj ÞÞ > :0 The adapted identity matrix was used for all analysis, except fot the low accuracy LTQ data. Here, the results gained by using the PAM30MS matrix were superior to those gained from using the identity matrix. In order to penalize different sequence lengths several gap opening and gap extension penalties were evaluated. For the calculation of the consensus results for a given spectrum all candidate peptide sequences suggested by a given engine are used. E corresponds to the set of search engines in use and CSk corresponds to the set of all peptide candidates that are suggested by the engine k for the spectrum S. The consensus score for the
ARTICLE
peptide sequence pi assigned by engine e for the spectrum S is calculated as follows: se ðpi Þ + SeqSime ðpi Þ ¼
∑
k ∈ E\feg
^sk ðpi Þ
^sk ðpi Þ 1+ ∑ k ∈ E\feg sk ðpi Þ
!2
with ^sk ðpi Þ ¼ sk ðpj Þ 3 simðpi , pj Þ and pj ¼ arg max simðpi , pm Þ m ∈ CSk
The list of candidates for spectrum S consists of a ranked list of all consensus results. Shared Peak Count (spc). As a comparable measure for peptide candidate similarity we compared the theoretical b and y ion ladders of different sequences. In a similar way as outlined above for the sequence similarity, we calculate the ion ladder similarity. Fragment ions from two sequences are shared, if they have to same mass with given tolerance window (the tolerance window was set to 0.5 Da). The shared peak count (spc) is then calculated as follows: spcðpi , pj Þ ¼
#shared peaksðpi , pj Þ minð#ions pi , #ions pj Þ
The final consensus score is calculated in the same way as described for the peptide sequence similarity. The spc method uses the number of overlapping theoretical fragments as a measure of similarity. Note that for pi = pj we obtain for both similarity measures sim(pi,pj) = spc(pi,pj) = 1. Average Scoring. The Average method calculates the average score, if the same peptide is suggested by several scores. L corresponds to the number of search engines that suggest the peptide pi as a candidate. The score Sk(pi) is zero if the engine k ∈ E does not suggest the peptide sequence pi. se ðpi Þ + Averagee ðpi Þ ¼
∑
sk ðpi Þ
k ∈ E\feg
L
Data Sets
To properly assess the performance of the method, we used data sets of known protein mixtures and a complex mixture. The first data set is the ISB data set,32 which is a mixture of 18 proteins acquired on an LCQ DECA XP instrument (ThermoFinnigan, San Jose, CA). Additionally, we used two data sets from the newer ISB collection.33 To capture a variety of instruments, two high-accuracy FT instruments were included, namely the LTQOrbitrap (Thermo Finnigan) and an FT Ultra (Thermo Finnigan) mass spectrometer. To further evaluate the performance of the new method, we additionally included a complex data set from an E. coli lysate. This data set was generated inhouse. The peptides were separated on an easyLC HPLC (Proxeon) system, online coupled to an LTQ-Orbitrap (Thermo). The peptide mixture was eluted from the column with a 224-min segmented gradient of 5% to 80% HPLC solvent B (80% ACN in 0.5% acetic acid) at a flow rate of 200 nL/min. 3335
dx.doi.org/10.1021/pr2002879 |J. Proteome Res. 2011, 10, 3332–3343
Journal of Proteome Research For the generation of peptide spectrum matches, we used Mascot (version 2.2) OMSSA (version 2.1.4) and X!Tandem (version 2008.02.01.3). X!Tandem’s refinement mode was disabled. The modification settings were carbamidomethylation of cysteine as fixed and oxidation of methionine as variable modification. The peptide identification of our analysis was done in a two-step process. First, the identifications were performed using a precursor mass tolerance of 3.0 Da and a fragment mass tolerance of 0.5 Da. These settings are wide enough to cover also the needs from the low-resolution instruments and thus appear to be the most appropriate settings to provide an instrument-independent identification pipeline. After the first identification run, the optimal tolerance values were estimated by using the peptide identifications to calculate the distribution of the errors of the precursor and fragment masses. Precursor masses were compared to the m/z values contained in the precursor information of the tandem MS spectra. Fragment mass errors were calculated using singly charged b and y-ions derived from the peptide sequence and the nearest peak within the mass tolerance (in our case 0.5 Da) in the experimental tandem mass spectra, if available. To avoid wrong error distributions, only peptide spectrum matches with an FDR (q-value)34 of 0.01 or better were used, to estimate the optimal tolerance settings. Results from the primary runs are shown in Supplemental Figure 2 (Supporting Information). The final tolerances were estimated manually using the error distributions. Precursor mass correction or broader mass windows were not necessary for the data, used in this study. On high-resolution instruments, relative tolerances in ppm were preferred over absolute tolerances. Except for OMSSA, the search engines allow precursor settings in ppm. Fragment tolerances were always set in Da, because all tested instruments record tandem mass spectra in low-resolution mode. The tolerance settings used in the final identification runs are listed in Table 1. The whole ISB1 data set contained 22 LCMS/MS runs (two sets of technical replicates), resulting in 18 999 spectra. The Orbitrap data set contained 10 LCMS/MS runs and 4 LCMS/MS runs were included from the FT Ultra data set. In total, we had 47 292 spectra for the Orbitrap data and 54 551 for the FT Ultra data. Search engine runs were performed against a concatenated protein database containing forward and reversed sequences of the 18 proteins, contaminants and a whole organism proteome database from the bacterium Sorangium cellulosum.35 The contaminant proteins were trace level contaminants, as listed in Klimek et al.33 and additional keratin and trypsin sequences. Altogether the protein database contained 18 812 sequences. For the E. coli data set two different databases were used. The whole data set contained 30 358 spectra and was searched against a small organism specific data set, based on the recent genome sequences of E. coli K-12 bacteria.36 This concatenated data set contained forward and reversed protein sequences of the 4136 E. coli proteins. The second database that was used for the E. coli data set was searched against a concatenated (forward/reverse) version of the Swiss-Prot release 2010_06 database. This database contains 712 388 protein sequences in total. Trypsin was set as protease for all the search engines. We allowed one missed cleavage and no semitryptic peptides. All identifications were conducted using the respective search engine adapters available in TOPP.37 Comparison to PepArML
We evaluated our novel consensus scoring method against another state-of-the-art search engine combiner. For the PepArML24
ARTICLE
Table 1. Precursor and Fragment Mass Tolerances that Were Estimated by Analyzing the Error Distributions of HighConfidence Peptide Spectrum Matches before the Production Identification Runs Were Performed Orbitrap
FT Ultra
LCQ
FT-FT
Precursor tolerance
10 ppm
30 ppm
1.5 Da
10 ppm
Fragment tolerance
0.5 Da
0.5 Da
0.5 Da
0.01 Da
searches the PepArML web service was used (https://edwardslab. bmcb.georgetown.edu/pymsio/). PepArML was set to search the E. coli data using OMSSA, X!Tandem and Mascot. We adjusted the instrument selection to “Thermo Fisher Scientific Orbi-LTQ” and trypsin was selected as the proteolytic agent. Carbamidomethylation of cysteines was selected as fixed and oxidized methionine as variable modification. The Swiss-Prot database (release 2010_06) was chosen. Further PepArML parameters were specific peptide candidate selection, two decoy replicates and the search chunk size was set to 500. The webserver precursor mass tolerances are not adjustable, but for “specific peptide candidate selection” the precursor mass tolerance is by default set to 2 Da. Statistical Assessment
For the statistical assessment of our results we use the notion of q-values, as described by K€all et al.34 The q-value in this case corresponds to the lowest false discovery rate (FDR) at which a given PSM can be accepted as correct identification. For any given score threshold the FDR can be calculated using the decoy hits that are given scores better than the threshold as false positives. For our analysis the FDR for any given threshold t is done as follows: FDR ¼
FPt Pt
where FPs corresponds to false positives (number of decoy hits) and P denotes the number of positives. For the final evaluation and visualization only the top ranked peptides are used and the number of positives is plotted as a function of the FDR (q-value). This procedures allows evaluating the results of the individual search engines, as well as the consensus results without previous knowledge of the peptide IDs.
’ RESULTS Database Searching
For each data set a presearch is performed using broad tolerance windows of 3 Da as a precursor tolerance and 0.5 Da as a fragment tolerance. The OMSSA presearch results are shown in in the Supporting Information (Supplemental Figure 2). The results of the presearches are evaluated and the final error tolerances are chosen based on this evaluation. This evaluation revealed that stringent search windows of 10 ppm are appropriate for all high accuracy data sets, except the FT-Ultra data set where the analysis of the presearches revealed 30 ppm as an appropriate setting for the final search. True Peptide Sequences on Different Ranks
The number of true peptides and their corresponding ranks in the individual search engine output is shown in the Supporting Information (Supplemental Figure 1). The lack of hits on positions 310 in the X!Tandem results is due the implementation of X!Tandem. By default the output only contains the top 3336
dx.doi.org/10.1021/pr2002879 |J. Proteome Res. 2011, 10, 3332–3343
Journal of Proteome Research
ARTICLE
one engine is high for Mascot and OMSSA, but most peptides annotated by X!Tandem are also annotated by other engines. In the Supporting Information, a comparison of annotated and identified spectra is shown (see Supplemental Table 1 and 2). An example of successful consensus scoring, incorporating true hits at lower ranks, is shown in Figure 1. The consensus scoring was examplified in Figure 1, where the QRESTATDILQK was found as the top hit in the list of consensus candidates. None of the single engine scores was good enough to yield significant results for the QRESTATDILQK peptide, but the consensus scoring reranks this sequence based on the newly calculated consensus scores. Sequence Similarity
Figure 2. (a) Comparison of majority voting (only peptides that are suggested by two or three engines are considered) versus the similaritybased consensus (peptides with summed similarity greater than one are considered). The height of the bars corresponds to the number of top hit sequences that are unambiguously assigned to target (forward) proteins. (b) Comparison of different similarities between the top hit and sequences at lower ranks.
ranked peptide and the peptide on the second rank is only indicated of both peptides were given equal scores. In general, we found expected and additional peptides. As expected peptides we denote the subset of peptides that theoretically result from a tryptic digest of the 18 proteins and known contaminants (including carbamidomethylated cysteines as well as optionally one missed cleavage and oxidized methionines). As additional peptides we denote subsequences of the 18 proteins or known contaminants, other than the subset of expected peptides. Additional peptides from the OMSSA and Mascot searches are peptides where the N-terminal methionine is missing. Additional peptides from X!Tandem searches fall into a variety of categories. Those peptides contain pyroglutamic acids, different N-acetylated amino acids, amonia losses or acetylations. The contribution of those additional peptides gained from X!Tandem searches amounts to an additional 14% for the LCQ and the FT Ultra data sets and even 27% for the Orbitrap data set. These peptides are partly due to the refinement mode of X!Tandem and to its ability to check for neutral losses on Q, E or C residues. Note that the refinement mode was disabled for all subsequent analyses and consensus calculations. Correct sequences may appear at the lower range of the rankings from individual searches. We observed that if the three search engines agree on a peptide identification (all three engines suggest the same sequence on rank one), approximately 94% of those peptides are correct sequences. However, this agreement only corresponds to approximately 10% of all annotated spectra. It can also be observed that some search engines are more conservative than others. A high number of spectra that are given candidates by OMSSA and X!Tandem are not annotated by Mascot (see Supplemental Figure 3, Supporting Information). Interestingly the percentage of spectra that are annotated by all engines decreases for high-accuracy FT instruments. It can also be observed that the number of spectra that are only annotated by
We observed that if search engines disagree on a spectrum, the suggested peptide candidates by the disagreeing search engines show sequence similarity and the degree of similarity correlates with the reliability of the peptide assignments. This sequence similarity directly reflects spectral similarity, since partial sequence overlaps produce shared fragment peaks. Figure 2 shows a comparison of simple majority voting and the sequence similarity using the output of the three search engines. For the majority voting approach a peptide was accepted if it was assigned by two or more engines; in comparison, using the similarity-based method, all similarities for a given candidate peptide were added up. Note that a summed sequence similarity greater than one implies that at least one similar peptide is suggested by one of the other engines. Summed similarity of 2 may correpond to scenarios, where two engines agreed and one suggested a peptide that had no similarity to the one suggested by the others or three different peptides might be suggested, showing sequence similarity of 30% and 70%, respectively. The similarity of sequences, can thus increase the number of target sequences as top hit canditates (see Figure 2). Note that the difference between the two measures is given by spectra where the search engines disagree, but the similarity of their suggested candidates contributes to the reranking of sequences in the consensus list. Consensus Scoring
We evaluated different methods for the combination of the scores. Figure 3 shows results for the comparison of different methods. The sequence similarity (SeqSim) has been calculated using either the identity matrix, the PAM30MS or other PAM matrices. The identity matrix and the PAM30MS matrix yielded better results than the normal PAM30 matrix or higher PAM matrices. The performance of the SCP method was only marginally below the matrix-based methods. The results gained from matrices with increasing PAM numbers are generally worse than lower PAM numbers. We also evaluated BLOSUM matrices and found inferior performance compared to the PAM matrices (data not shown). If scores for the same sequence are simply averaged, the results are even below the result for the Mascot search. We further evaluated the product (posterior error probabilities (PEP scores) are multiplied of the same sequence is assigned) and the “minimal PEP” (the minimal PEP is used as the consensus score, if engines agree on a spectrum). These methods were found to be worse than the SPC and the matrix-based methods (data not shown). This evaluation was done on the E. coli data set searched against the small organism-specific database. The Needleman Wunsch algorithm uses parameters for gap penalization in the sequence alignment. As gaps in candidate sequence are very rare, we use stringent penalization for gap opening and gap extension for the global sequence alignment. We found that high gap 3337
dx.doi.org/10.1021/pr2002879 |J. Proteome Res. 2011, 10, 3332–3343
Journal of Proteome Research
ARTICLE
Figure 3. Comparison of different combination methods. All blue bars correspond to similarity-based methods. IDENT corresponds to the identity matrix.
penalties are more suitable than less stringent penalization. Our observations were found to be consistent at various error rates (110%). The shared peak count scoring (spc) performs almost equally well as the best matrix-based methods and averaging is consistently below the matrix-based methods. All combination methods, except the averaging method outperform the single engines. ROC (receiver operating characteristics) analysis was performed to compare the results of the single engines and to visualize the benefit of the consensus scoring. For our purposes we adapted this analysis and plotted the number of correctly identified spectra as a function of the error rate based on q-values. The results of this analysis are shown in Figure 4. The consensus scoring method was significantly better than the single engines for all data sets. The most significant improvements at low error rates are observed for the high accuracy FT instruments. For the Orbitrap data, the consensus scoring identified 30% more than OMSSA, 23% more than Mascot and 18% more peptides compared to X!Tandem at a 1% FDR. For the FT Ultra data set the improvements rates were even higher: Mascot was outperformed by 60%, OMSSA by 56% and X!Tandem was outperformed by 16%. On the LCQ data at 1% FDR Mascot was outperformed by 54%, followed by OMSSA by 21%, and X! Tandem by 20%. The E. coli data set was the most complex among our four data sets. For the consensus calculation this data
set was searched against the large Swiss-Prot database that contained a considerable number of homologous protein sequences. For these data, the following improvements were achieved: X!Tandem was outperformed by 27%, Mascot by 20% and 15% more peptides were found compared to a single OMSSA analysis. At higher error rates, the improvement rates are even more significant compared to the rates obtained from the comparison at 1% FDR. In database search, mass tolerance windows are known to be crucial parameters for peptide identification and their settings are usually inferred from the accuracy of the mass spectrometer. Using wider tolerance settings, for example, for the annotation of lowaccuracy ion trap data, the set of candidates is much larger compared to small tolerance windows and the risk for false positives is larger. The design of our scoring method does not depend on a specific choice of the scoring matrix and it does not require training data to build the models. Thus, the scoring scheme can be easily adapted to be suited for the most commonly used instrument types. We evaluated different values for the gap penalties and found that the performance tends to improve with the stringency of penalization for high accuracy data, whereas the results for low-accuracy data are usually better if moderate penalization and/or the PAM30MS matrix instead of the identity matrix is used. In summary, we applied mixture modeling to convert arbitrary search engine scores into probabilities and peptide similarity 3338
dx.doi.org/10.1021/pr2002879 |J. Proteome Res. 2011, 10, 3332–3343
Journal of Proteome Research
ARTICLE
Figure 4. ROC analysis demonstrates the performance of consensus scoring on data sets from different instrument types and data complexities.
scoring to assign scores to peptide sequences that were originally not assigned by a given search engine. The scores from the single engines are then combined by a weighted average. Other combination approaches, such as normal average or product scoring (where single scores are multiplied), were also evaluated. These approaches were consistently worse than the weighted average method and some of these approaches produce results worse than those obtained using only the original single search engines. We thus chose to use the similarity weighted average score for combining single scores. The average is proportional to the sum of the scores and can thus loosely be interpreted as the
accumulation of evidence obtained from the different single sources. Comparison to Related Methods
Our method was compared to state-of-the-art combiner tools using the PepArML web server. We used the same parameter settings for all search engine combiners, except for the mass tolerance (which was not adjustable on the PepArML webserver). Furthermore, the calculation of error rates is also slightly different in PepArML. It uses two independently shuffled decoy databases; one is used for FDR estimation and the other 3339
dx.doi.org/10.1021/pr2002879 |J. Proteome Res. 2011, 10, 3332–3343
Journal of Proteome Research
ARTICLE
Table 2. Consensus Scoring Yields Increased Identification Rates in Comparison to Other State-of-the-Art Combination Methods PepArML:heuristic FDR
OpenMS:ConsensusID
PepArML
combiner
0.005
11393
11517
7370
0.01
12283
11913
8943
0.02
13263
12122
9625
0.05
15113
12291
11485
one for score recalibration.24 At error rates below 1% the machine learning based method implemented in the PepArML program identifies more spectra than the consensus scoring, but at error rates greater or equal than 1% FDR the consensus scoring outperformed the PepArML combiner and the PepArML heuristic combiner method. The gained improvements on peptide identification rates vary in comparison to PepArML between 3 and 22% and in comparison to the PepArml heuristic combiner between 37 and 54% for error rates from 1 to 10%. Interestingly, the number of identified peptides by PepArML increases rapidely, but allowing more false positives (higher error rates) PepArML only moderately increases its identification rates. The consensus scoring, in contrast, shows increasing numbers of identified spectra as a function of increasing q-value thresholds. Results from the comparative analysis are shown in Table 2 and the ROC analysis can be found in the Supporting Information (Supplemental Figure 5).
’ DISCUSSION Proteomics has traditionally been a dynamic area with a broad spectrum of experimental techniques and rapidly evolving instrumentation. This is accompanied by an accumulation of computational tools, as an indispensable part in the analysis workflow. Our consensus scoring approach aims to take advantage of the large diversity of commonly used peptide identification tools. Our approach is designed to incorporate any number of different tandem MS search engines into the consensus scoring. For each tool, the scores need to be converted into probabilities in order to transform them to the same scale. Missing values can be estimated by our sequence similarity approach. The scores can then be combined, for example, by weighted averaging. We included data sets that were generated by a variety of MS instruments. The instrumentation in laboratories is changing rapidly and new MS instruments are continuously entering the market and more and more laboratories are equipped with several mass spectrometers to enable high-throughput and to benefit from special features available on specific instruments. This implies that software for peptide identification needs to cope with this rapid evolution. The method proposed in this work is very robust with respect to the origin of the data. Independent of which individual search engine performs best on any given data set, the consensus approach always yields a superior performance in our tests. Previous studies aiming at combining search engine scores focused on low mass accuracy ion traps.23 The importance of more accurate and more sensitive instruments is obvious. To our knowledge, our consensus scoring is the first approach that offers significant improvements in peptide identification rates using peptide or spectral similarity as an additional measure. We show improved identification rates on data generated by state-of-the-art mass spectrometry platforms,
e.g., on high resolution FT data. Similar to the appraoch suggested by Searle et al.,23 we convert the raw search engines scores into probabilities. For all search engines the E-values are used as discriminant scores to fit the mixture model. We calculate only one model for each search engines. We evaluated the benefit of separate models for different charge states and found no significant improvements for the consensus score. Using the search engines’ E-values as discriminant scores, all suggested candidates can be fit by the model. This would not be possible of the discriminat score incorporated terms that are specific for top hits, such as the ΔCn score from Sequest8 results. Candidates that are found behind the top hit fall most frequently into the distribution of false hits, because their E-values are accordingly worse than the top hit E-values, however if several hits were assigned significant E-values by the search engines, the converted PEP scores will reflect this significance and rerank this PSM among significant candidates. By integrating the information gained by the other engines, the assigned peptide probability-like scores become more accurate and in cases where specific peptides were ranked best with poor scores, the information from the other engines helps to improve this score and ultimately to bring it in a range where it is accepted as correctly identified. Assuming a given mass spectrum has been recorded without any major technical bias, then the likelihood that several search engines assign the same wrong peptide to rank one is smaller than the likelihood that a subset of engines fails for this spectrum. This putative failure is corrected by integrating information from other engines via similarity scoring. We evaluated different methods and found that the identity matrix, as well as different PAM matrices based methods perform best in most cases. While the adapted identity matrix was found to be the most suitable matrix for high accuracy data, the PAM30MS matrix was found as the most suitable scoring matrix for low accuracy data. The PAM30MS has previously been successfully applied to the identification of homologous proteasomal proteins from Trypanosoma brucei31 and as a method for peptide identification in unsequenced organisms.38 From our experiments it became evident the strong penalization of gaps in the alignment works better for high accuracy data sets. On low accuracy data sets it is of advantage to reduce the strength of the penalization. This can be explained by the increased number of candidates in greater search spaces. The likelihood that similar peptides are present as candidates is much bigger compared to smaller sets, as defined for high accuracy searches. Although global sequence alignment methods classically use two different parameters for gap opening and extension, our experiments revealed that both penalties can be set equally. This implies that if gaps are opened, they are also extended in the majority of the cases. Generally, the improvements gained by consensus scoring are more significant, if the database search is conducted against a large database that contains a considerable number of homologous sequences. The E. coli data set was searched against both, a small (organism specific) and a large (Swiss-Prot) database. Using a large database there is a higher chance that different search engines assign very similar (homologous) sequences. This misassignment is then corrected by the consensus approach as homologous sequences will result in high sequence similarities. The spc method performed almost as well as the matrix-based methods, however it was for all error rates slightly under the results gained from the identity matrix. As shown, the sequence similarity correlates with the number of shared fragment ion 3340
dx.doi.org/10.1021/pr2002879 |J. Proteome Res. 2011, 10, 3332–3343
Journal of Proteome Research masses, although the extent of similarity is lower for the sequence-based measure. By a random alignment of arbitrary spectra, the occurrence of a basal number of shared peaks is much higher than a random similarity of two arbitrary peptide sequences. Another alternative for the combination of these data would be the product of the individual probabilities. The product is often useful in the context of applications with a probabilty theory framework. Using the product as a consensus score, we would have to assume independent scoring of the individual search engines. It is common sense that the ultimate prerequisite for database search algorithms is the presence of fragmentationspecific product ions and assuming independence is thus not realistic. Furthermore, a large score obtained by multiple search engines indicates a high probability of a correct hit and should not be heavily penalized by a low score from a single search engine. The average, in contrast to the product, accounts for this underlying property. We used the heuristic modification of the weighted average by further division with the summed similarity. This modification favors the peptides that were scored by a high number of engines or showed high overall similarity to the other peptides suggested for the spectrum under consideration. Our suggested strategy outperformed state-of-the-art methods for the combination of search results. In comparison to the a machine learning based method we significantly improve the identification rates at error rates greater than 1%. The ROC reults from the PepArML show a very steep slope at the very low error rates, but the increase in spectrum identifications is very low if an increasing number of false positves is accepted. The PepArML heuristic combiner was consistently below PepArML and our consensus scoring. The search results that build the basis for the consensus scoring were searched with a precursor mass tolerance of 10 ppm, whereas PepArML24 routinely uses 2 Da as precursor mass tolerance. High mass error settings allows to identify spectra where the C13 peak was selected for fragmentation, whereas narrow search tolerances prevent the identification of such spectra. On the other hand, the advantage of well calibrated high accuracy mass spectrometers is to measure peptide masses at very high presicion. Database searches with low error tolerances avoid false positive identifications. Low error tolerances were highly beneficial for the data used in this study and significantely contribute to the better performance of our consensus scoring in comparison to other methods. If X!Tandem’s refinement mode is used, the number of peptide sequences identified by X!Tandem increases the numbers that can be obtained for Mascot and OMSSA. The consensus approach generally allows enabling the refinement function, but it is routinely disabled for two reasons: first it would result in a unfair comparison of the three search engines, as they would not compete for the same sets of peptides anymore; second, in remains unclear how to treat the statistical significance of the peptides gained from the refinement mode. The barplots in the supplemetary material show that X!Tandem’s refinement identifies a significant number of peptides, that are a priori not expected, if one simply accepted the tryptic peptides of the 18 proteins or known contaminants (Supplemental Figure 1, Supporting Information). However, accepting every subsequence of the proteins, known to be in the mixture, the number of identified sequences increases. The most pronounced improvements by X! Tandem’s refinement searches were observed on the high accuracy FT data. Recording precursor masses with very accurate mass results in significant improvements for peptide identification. It can be concluded that the incorporation of refinement
ARTICLE
searches might be beneficial for consensus approaches, as it allows additional peptide dependent search runs, that complement the primary run by including further modifications (e.g., acetylations, amonia losses, Glu conversion to pyroGlu, etc.) and missed cleavages. Mascot and OMSSA do in their current versions not automatically allow similar approaches. However, to this end, future research will need to evaluate methods to incorporate statistical significance in cases where only a subset of engines are used in a multipass setting. The imputation procedure for peptide sequences is based on a pairwise global alignment and uses the NeedlemanWunsch algorithm. This algorithm uses substitution matrices to score the peptide alignment. If the identity matrix is used as a substitution matrix, the similarity corresponds to the percentage of identity. The PAM30MS matrix similarity measure is less stringent. Matrices, such as PAM matrices, are used in evolutionary biology to determine similarity of proteins, based on mutation probabilities, thus substitution matrices are per se not constructed to account for spectral similarity. However, the calculation of sequence similarity favors similar residues, such as I and L over unrelated residues and can thus be used as a method that strongly correlates with spectral similarity. Global alignments are, in general, used to align protein or nucleotide sequences if they are expected to be similar and have roughly the same length. The lengths of the peptide candidates should not differ greatly, since precursor masses are recorded and used to restrict the space of candidate hits. The quality of MS spectra strongly correlates with the presence of all expected fragment masses. Missing masses or imprecise precursor masses are reasons that contribute to incorrect assignments of peptide sequences by search engines. With growing accuracy and sensitivity in MS instrumentation, those inadequacies are diminishing, however, they are still present to some extent. The alignment penalization parameter is a very suitable value to adjust the scoring scheme to different instruments. In lists of putative candidate peptides, there are hardly any peptides that have very large gaps, however, we find peptide sequences that miss amino acid masses. State-of-the-art FT instruments allow acquiring tandem MS data with mass deviation below 1 ppm,39 which implies that the experimental spectra can be more easily correlated with theoretical spectra. Sequences that show only partial correctness are rare and need to be highly penalized. If the mass accuracy is not as high as for FT instruments, we can assume that peptides that show small deviations from the correct sequence are more frequent, since the mass tolerance in relation to the theoretical spectra allows more putative candidates. The proteomics community stongly relies on database search engines. Peptide identification is both, the most fundamental and the most important step in a tandem MS based proteomics study. To fullfill the high demands attributed to proteomics, combining different strengths and reducing weaknesses of individual peptide identification approaches is strongly needed. The algorithms introduced in this manuscript are available through OpenMS,40 a framework for mass spectrometry, and also as a command line tool of the TOPP pipeline.37
’ ASSOCIATED CONTENT
bS
Supporting Information Supplemental figures and tables. This material is available free of charge via the Internet at http://pubs.acs.org.
3341
dx.doi.org/10.1021/pr2002879 |J. Proteome Res. 2011, 10, 3332–3343
Journal of Proteome Research
’ AUTHOR INFORMATION Corresponding Author
*E-mail:
[email protected]. Phone: +49-7071-2970461. Fax: +49-7071-29-705152.
’ ACKNOWLEDGMENT Part of this research has been funded by the BMBF/QuantPro 0313842. ’ REFERENCES (1) Hu, Q.; Noll, R. J.; Li, H.; Makarov, A.; Hardman, M.; Cooks, R. G. The Orbitrap: a new mass spectrometer. J. Mass Spectrom. 2005, 40, 430–443. (2) Wolters, D. A.; Washburn, M. P.; Yates, J. R. An automated multidimensional protein identification technology for shotgun proteomics. Anal. Chem. 2001, 73, 5683–5690. (3) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422, 198–207. (4) Bandeira, N. Spectral networks: a new approach to de novo discovery of protein sequences and posttranslational modifications. Biotechniques 2007, 42, 687, 689, 691 passim. (5) Bertsch, A.; Leinenbach, A.; Pervukhin, A.; Lubeck, M.; Hartmer, R.; Baessmann, C.; Elnakady, Y. A.; M€uller, R.; B€ocker, S.; Huber, C. G.; Kohlbacher, O. De novo peptide sequencing by tandem MS using complementary CID and electron transfer dissociation. Electrophoresis 2009, 30, 3736–3747. (6) Edwards, N. J. Novel peptide identification from tandem mass spectra using ESTs and sequence database compression. Mol. Syst. Biol. 2007, 3, 102. (7) Colinge, J.; Masselot, A.; Giron, M.; Dessingy, T.; Magnin, J. OLAV: towards high-throughput tandem mass spectrometry data identification. Proteomics 2003, 3, 1454–1463. (8) Eng, J. K.; McCormack, A. L.; Y., J. R., III An Approach to Correlate Tandem Mass Spectral Data of Peptides with Amino Acid Sequences in a Protein Database. J Am. Soc. Mass Spectrom. 1994, 5, 976–989. (9) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–3567. (10) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20, 1466–1467. (11) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3, 958–964. (12) Tanner, S.; Shu, H.; Frank, A.; Wang, L.-C.; Zandi, E.; Mumby, M.; Pevzner, P. A.; Bafna, V. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 2005, 77, 4626–4639. (13) Jones, P.; C^ote, R. G.; Martens, L.; Quinn, A. F.; Taylor, C. F.; Derache, W.; Hermjakob, H.; Apweiler, R. PRIDE: a public repository of protein and peptide identifications for the proteomics community. Nucleic Acids Res. 2006, 34, D659–D663. (14) Martens, L.; Hermjakob, H.; Jones, P.; Adamski, M.; Taylor, C.; States, D.; Gevaert, K.; Vandekerckhove, J.; Apweiler, R. PRIDE: the proteomics identifications database. Proteomics 2005, 5, 3537–3545. (15) Kumar, C.; Mann, M. Bioinformatics analysis of mass spectrometry-based proteomics data sets. FEBS Lett. 2009, 583, 1703–1712. (16) Elias, J. E.; Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 2007, 4, 207–214. (17) Nesvizhskii, A. I.; Vitek, O.; Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 2007, 4, 787–797. (18) Kapp, E. A.; Sch€utz, F.; Connolly, L. M.; Chakel, J. A.; Meza, J. E.; Miller, C. A.; Fenyo, D.; Eng, J. K.; Adkins, J. N.; Omenn, G. S.;
ARTICLE
Simpson, R. J. An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: sensitivity and specificity analysis. Proteomics 2005, 5, 3475–3490. (19) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 2002, 74, 5383–5392. (20) Dempster, A. P.; Laird, N. M.; Rubin, D. B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. B (Methodological) 1977, 39, 1–38. (21) Choi, H.; Nesvizhskii, A. I. Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics. J Proteome Res. 2008, 7, 254–265. (22) K€all, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Posterior error probabilities and false discovery rates: two sides of the same coin. J. Proteome Res. 2008, 7, 40–44. (23) Searle, B. C.; Turner, M.; Nesvizhskii, A. I. Improving sensitivity by probabilistically combining results from multiple MS/MS search methodologies. J. Proteome Res. 2008, 7, 245–253. (24) Edwards, N. J.; Wu, X.; Tseng, C.-W. An Unsupervised, ModelFree, Machine-Learning Combiner for Peptide Identifications from Tandem Mass Spectra. Clin. Proteomics 2009, 5, 23–36. (25) Shteynberg, D.; Deutsch, E.; Lam, H.; Aebersold, R.; Nesvizhskii, A. iProphet: Improved Validation of Peptide Identification in Shotgun Proteomics. HUPO World Congress, Amsterdam, The Netherlands, 2008. (26) Tharakan, R.; Edwards, N.; Graham, D. R. M. Data maximization by multipass analysis of protein mass spectra. Proteomics 2010, 10, 1160–1171. (27) Feny€ o, D.; Beavis, R. C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 2003, 75, 768–774. (28) Leisch, F. FlexMix: A general framework for finite mixture models and latent class regression in R. J. Stat. Softw. 2004, 11, 1–18. (29) Steen, H.; Mann, M. The ABC’s (and XYZ’s) of peptide sequencing. Nat. Rev. Mol. Cell Biol. 2004, 5, 699–711. (30) Needleman, S. B.; Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 1970, 48, 443–453. (31) Huang, L.; Jacob, R. J.; Pegg, S. C.; Baldwin, M. A.; Wang, C. C.; Burlingame, A. L.; Babbitt, P. C. Functional assignment of the 20 S proteasome from Trypanosoma brucei using mass spectrometry and new bioinformatics approaches. J. Biol. Chem. 2001, 276, 28327–28339. (32) Keller, A.; Purvine, S.; Nesvizhskii, A. I.; Stolyar, S.; Goodlett, D. R.; Kolker, E. Experimental protein mixture for validating tandem mass spectral analysis. OMICS 2002, 6, 207–212. (33) Klimek, J.; Eddes, J. S.; Hohmann, L.; Jackson, J.; Peterson, A.; Letarte, S.; Gafken, P. R.; Katz, J. E.; Mallick, P.; Lee, H.; Schmidt, A.; Ossola, R.; Eng, J. K.; Aebersold, R.; Martin, D. B. The Standard Protein Mix Database: A Diverse Data Set To Assist in the Production of Improved Peptide and Protein Identification Software Tools. J. Proteome Res. 2008, 7, 96–103. (34) K€all, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J. Proteome Res. 2008, 7, 29–34. (35) Schneiker, S.; et al. Complete genome sequence of the myxobacterium Sorangium cellulosum. Nat. Biotechnol. 2007, 25, 1281–1289. (36) Riley, M.; et al. Escherichia coli K-12: a cooperatively developed annotation snapshot2005. Nucleic Acids Res. 2006, 34, 1–9. (37) Kohlbacher, O.; Reinert, K.; Gr€opl, C.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Sturm, M. TOPPthe OpenMS proteomics pipeline. Bioinformatics 2007, 23, e191–e197. (38) Shevchenko, A.; Sunyaev, S.; Loboda, A.; Shevchenko, A.; Bork, P.; Ens, W.; Standing, K. G. Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. Anal. Chem. 2001, 73, 1917–1926. (39) Olsen, J. V.; de Godoy, L. M. F.; Li, G.; Macek, B.; Mortensen, P.; Pesch, R.; Makarov, A.; Lange, O.; Horning, S.; Mann, M. Parts per 3342
dx.doi.org/10.1021/pr2002879 |J. Proteome Res. 2011, 10, 3332–3343
Journal of Proteome Research
ARTICLE
million mass accuracy on an Orbitrap mass spectrometer via lock mass injection into a C-trap. Mol. Cell. Proteomics 2005, 4, 2010–2021. (40) Sturm, M.; Bertsch, A.; Gr€opl, C.; Hildebrandt, A.; Hussong, R.; Lange, E.; Pfeifer, N.; Schulz-Trieglaff, O.; Zerck, A.; Reinert, K.; Kohlbacher., O. OpenMS - an open-source software framework for mass spectrometry. BMC Bioinform. 2008, 9, 163.
3343
dx.doi.org/10.1021/pr2002879 |J. Proteome Res. 2011, 10, 3332–3343