Binomial Probability Distribution Model-Based Protein Identification

School of Mathematics and Computer Science, Yunnan University of Nationalities, Kunming 650031, China. J. Proteome Res. , 2013, 12 (1), pp 328–335. ...
0 downloads 4 Views 1MB Size
Article pubs.acs.org/jpr

Binomial Probability Distribution Model-Based Protein Identification Algorithm for Tandem Mass Spectrometry Utilizing Peak Intensity Information Chuan-Le Xiao,†,§ Xiao-Zhou Chen,‡,§ Yang-Li Du,‡,§ Xuesong Sun,† Gong Zhang,*,† and Qing-Yu He*,† †

Institute of Life and Health Engineering, Key Laboratory of Functional Protein Research of Guangdong Higher Education Institutes, Jinan University, Guangzhou 510632, China ‡ School of Mathematics and Computer Science, Yunnan University of Nationalities, Kunming 650031, China S Supporting Information *

ABSTRACT: Mass spectrometry has become one of the most important technologies in proteomic analysis. Tandem mass spectrometry (LC-MS/MS) is a major tool for the analysis of peptide mixtures from protein samples. The key step of MS data processing is the identification of peptides from experimental spectra by searching public sequence databases. Although a number of algorithms to identify peptides from MS/MS data have been already proposed, e.g. Sequest, OMSSA, X!Tandem, Mascot, etc., they are mainly based on statistical models considering only peak-matches between experimental and theoretical spectra, but not peak intensity information. Moreover, different algorithms gave different results from the same MS data, implying their probable incompleteness and questionable reproducibility. We developed a novel peptide identification algorithm, ProVerB, based on a binomial probability distribution model of protein tandem mass spectrometry combined with a new scoring function, making full use of peak intensity information and, thus, enhancing the ability of identification. Compared with Mascot, Sequest, and SQID, ProVerB identified significantly more peptides from LCMS/MS data sets than the current algorithms at 1% False Discovery Rate (FDR) and provided more confident peptide identifications. ProVerB is also compatible with various platforms and experimental data sets, showing its robustness and versatility. The open-source program ProVerB is available at http://bioinformatics.jnu.edu.cn/software/proverb/. KEYWORDS: protein identification algorithm, tandem mass spectrometry, statistical model



identification;15 however, only limited details of these algorithms are released. Mascot is based on a probability model, whereas Sequest is based on an empirical scoring model that computes cross-correlation between experimental and theoretical spectra. Mascot selects the highest peak in each 14 Da mass interval and keeps the peaks with their intensities above the threshold. Sequest takes consecutive matches of ions and intensity information into account and then preprocesses the spectrum by keeping the top 200 peaks and separates the spectrum into ten bins for normalization.15 X!Tandem uses a hypergeometric scoring model, while OMSSA is based on a Poisson scoring model to assess the significance of peptide match. They select the 50 most intensive peaks by default. MassWiz divides the spectrum dynamically and takes a maximum of 5 most intense peaks from each bin. SQID keeps the top 80 peaks after deleting parent related peaks.9 However, none of these algorithms accurately uses the entire information in MS experiments.14,16,17 They share similar

INTRODUCTION Soft ionization techniques, e.g. matrix-assisted laser desorption ionization (MALDI) and electrospray ionization (ESI),1,2 are able to maintain the integrity of peptides, thus empowering the mass spectrometry (MS) methods to perform proteomic analysis.3−5 Protein identification is the most fundamental algorithm in the data processing pipeline, since the sensitivity and accuracy of the identification algorithm is crucial for downstream analyses.6,7 Generally, a peptide identification algorithm selects some peaks from the spectra, evaluates the similarity between the experimental and theoretical spectra, and then assigns the best match within the peptide error window as the result.8 The scoring models that evaluate the similarity between experimental and theoretical spectra should consider three aspects: the number of peak matches, the number of peak consecutive matches, and the intensities of matched peaks.9 A number of peptide identification algorithms with various concepts for MS data are available, e.g. Mascot,10 Sequest,11 OMSSA, 12 X!Tandem, 13 MassWiz, 14 Andromeda, and SQID.3−5 Mascot and Sequest are widely used commercial software and commonly adapted search tools in protein © 2012 American Chemical Society

Received: June 7, 2012 Published: November 19, 2012 328

dx.doi.org/10.1021/pr300781t | J. Proteome Res. 2013, 12, 328−335

Journal of Proteome Research

Article

200 °C; 35% normalized collision energy use for MS2; ion selection thresholds, 1000 counts for MS2; and activation q = 0.25 and activation time of 30 ms during MS2 acquisitions. The mass spectrometers were operated in positive ion mode with a data-dependent automatic switch between MS and MS/MS acquisition modes.19

methods to generate theoretical spectra. Considering six types of ions (b-, y-, b-H2O, b-NH3, y-H2O, and y-NH3) in CID (collision-induced dissociation) fragmentation mode, theoretical peak intensities are then set as three artificial values: 50 (band y-ions), 25 (b- and y-ions without H2O or NH3), and 10 (a-ions) for a theoretical spectrum that does not fully reflect the intensity characteristics of experimental spectra.11 Therefore, these algorithms do not use the peak intensity information obtained in the experiment to make the comparison of the experimental and theoretical spectra once the peaks are selected.10,12,13,15 The incomplete use of MS information compromises the sensitivity, robustness, and confidence of most of these algorithms. A recent algorithm, SQID, is attempting to address this issue by introducing the strength probability of the pairwise amino acid fragments to consider the intensity match quality.9 To make full use of the MS information and to maximize the universality, we present here a novel identification algorithm, the protein verification algorithm based on the binomial probability distribution (ProVerB), to enhance the accuracy, completeness, and robustness of the peptide identification. We tested ProVerB against other algorithms using multiple MS data sets, showing its higher ability and confidence to identify peptides from the mass spectrometry at 1% FDR, significantly and stably higher than those for the widely used Mascot and Sequest.



Mass Spectrometry Data Sets

The data sets (Mix 3) of standard mixtures of 18 proteins obtained from five types of instruments (Agilent XCT, Thermo Finnigan LTQ-FT, Thermo Finnigan LCQ DECA, Thermo Finnigan LTQ, and Micromass/Waters QTOF Ultima, abbreviated below as Agilent, FT, LCQ, LTQ, and QTOF, respectively) were obtained from https://regis-web. systemsbiology.net//PublicData sets/ to test the accuracy and dynamic range of the algorithms.22,23 The LTQ-Orbitrap data obtained from the S. pneumoniae D39 protein identification containing more than 270,000 spectra served as the training data set for parameters of the model. The data set of the E. coli proteome was obtained from http://marcottelab.org/MSdata/ Data_03/.24 Data Preprocessing

For the S. pneumoniae D39 and E. coli data sets, the raw format files were converted to dta file format by Bioworks 3.31 (Thermo Finnigan, San Jose, CA) and the dta format files were merged to Mascot generic format (mgf) using the merge.pl program (http://www.matrixscience.com/downloads/merge. zip). For the 18 proteins data set, the downloaded dta format files were merged to Mascot generic format (mgf) by the merge.pl program. The dta format files were the input files of our method and Sequest software.

MATERIALS AND METHODS

Cell Culture and Protein Extraction and Trypsin Digestion

Streptococcus pneumoniae D39 was cultivated in Todd−Hewitt broth with 0.5% yeast extract (THY) in a controlled incubator (37 °C, 5% CO2). Cells were harvested at OD600−0.6 by centrifugation at 5000g for 20 min at 4 °C and washed three times with prechilled PBS (10 mM, pH 7.4) and then resuspended in lysis buffer (15 mM Tris-HCl, pH 8.0).18 The mixture was frozen-thawed for three cycles and then sonicated 10 times each for 30 s. The lysate was centrifuged at 12000g for 10 min at 4 °C. Protein concentrations were determined using the Bradford assay and subjected to reduction with 10 mM DTT (37 °C, 3 h) and alkylation with 20 mM iodoacetamide (room temperature, 1 h in dark). Proteins were precipitated with four volumes of ice-cold acetone, pelleted by centrifugation, and washed twice with ethanol. The pellet was resuspended in 25 mM Tris-HCl buffer (pH 7.6) and digested with sequencing grade modified trypsin (1:25 w/w; Promega, Madison, WI) at 37 °C for 20 h.19

MS/MS Database Search

For target-decoy based FDR calculation, the forward and reverse databases were built for the three data sets as in Table 1. Table 1. Sequence Number of Forward and Reverse Databases Used for MS/MS Database Search S. pneumoniae D39 database

18 proteins database

E. coli database

3828

3644

8558

seq no. of forward and reverse database

The Mascot generic format (mgf) files were searched using Mascot 2.3 (Matrix Science, London, U.K.) against the forward and reverse database. The dta files were searched using Sequest 28.13 (Thermo Fisher Scientific, Waltham, MA) and our algorithm ProVerB. The following search criteria were applied for all three algorithms: full tryptic specificity; two missed cleavages were allowed; cysteine (+57.021464 Da, Carbamidomethylation) was set as fixed modification, whereas methionine (+15.994915 Da, oxidation) was considered as variable modification. The values of precursor ion mass tolerance and fragment ion mass tolerance were set as in Table 2 based on the instrument characteristics. The fragment ion tolerance of Sequest was set to 1.0 Da, since it requires an integer value for m/z in the preprocessing of MS data.11

SCX-RPLC-MS/MS Analysis

Dried peptides were reconstituted in 5% ACN/0.1% formic acid and analyzed with a Finnigan Surveyor HPLC system online coupled with a LTQ-Orbitrap XL (Thermo Fisher Scientific, Waltham, MA) equipped with a nanospray source.20 The peptide mixtures were loaded onto an SCX column and then eluted with 0, 0.02, 0.06, 0.15, 0.3, 0.5, and 1 M NH4Cl. Each fraction flowed in a C18 column (100 μm i.d., 10 cm length, 5 μm-size resin (Michrom Bioresources, Auburn, CA)) using an autosampler. Peptides were eluted with a 0−35% gradient (Buffer A, 0.1% formic acid, and 5% ACN; Buffer B, 0.1% formic acid and 95% ACN) over 90 min and analyzed online with the LTQ-Orbitrap MS using a data-dependent TOP10 method.21 The parameters used for the mass spectrometric analysis were as follows: spray voltage, 1.85 kV; no sheath and auxiliary gas flow; ion transfer tube temperature

False Discovery Rate (FDR)

The peptide spectrum matches (PSMs) were extracted from the Mascot’s data format file (.dat) with our in-house Matlab program, and PSMs with the highest rank were exported to calculate the FDR threshold. Sequest results were extracted 329

dx.doi.org/10.1021/pr300781t | J. Proteome Res. 2013, 12, 328−335

Journal of Proteome Research

Article

b-, y-fragment ions contained R,K,Q,N ions, a loss of b-NH3 or y-NH3 was considered.15 If the parent ion charge was +1 or +2, we considered +1/+2 fragment ion peaks. Only when the parent ion charge was not less than 2 and the fragment ions contained one of the R, K, H residues, were +2 fragment ion peaks considered.9

Table 2. Parameters of Precursor and Fragment Ion Tolerance Settings ProVerB and Mascot instrument AGILENT XCT LCQ_Deca LTQ LTQ-FT QTOF LTQ-Orbitrap

Sequest

precursor ion tolerance

fragment ion tolerance

precursor ion tolerance

fragment ion tolerance

2.0 Da

0.5 Da

3.0 Da

1.0 Da

3.0 Da 3.0 Da 10 ppm 0.2 Da 10 ppm

0.5 0.5 0.5 0.2 0.5

3.0 Da 3.0 Da 10 ppm 10 ppm 10 ppm

1.0 1.0 1.0 1.0 1.0

Da Da Da Da Da

Scoring Function

The scoring function is a critical part of the MS peptide identification algorithm. In our algorithm we applied the binomial probability density function to consider three aspects: simple fragment ion match, consecutive fragment ion matches, and the intensity of the b/y-ion peaks. Scoring Function for Simple Fragment Matches. It is difficult to propose a universal scoring function to fit various types of instruments and strategies, the variability in the fragmentation patterns, as well as the extent of fragmentation and the intensities of the peaks.29,30 We solved this problem by establishing a binomial distribution statistical model based on the nature of matching itself, independent of all the experimental factors listed above. The match probability of experimental and theoretical fragment ions reflects the confidence of the match:

Da Da Da Da Da

from Sequest output files (.out), and PSMs with the highest rank and ΔCn ≥ 0.1 were exported to calculate the FDR threshold. ProVerB results and the extracted result of Mascot and Sequest were written to csv format files. All target and decoy scores with rank 1 PSMs were sorted in ascending order to calculate their FDR values by Kall’s method.25,26 The different threshold is picked up to get the FDR from the following formula: FDR =

no. of decoy PSMs above threshold no. of target PSMs above threshold

⎧p = p + f 0 ⎪ ⎨ ⎛n⎞ ⎪ P = P(k|n , p) = ⎜ ⎟pk (1 − p)n − k ⎝k ⎠ ⎩

The score threshold was tuned to reach FDR ≤ 1%. The scoring functions vary in different search algorithms: for Mascot, the ion scores were sorted to calculate FDR when peptide length ≥ 6; for Sequest and SQID, the Xcorr scores were sorted to calculate FDR by different precursor ion charge when peptide length ≥ 6 and ΔCn ≥ 0.1 and 0.05, respectively; for ProVerB, the S scores (the final score of each peptide, see below) were sorted to calculate FDR when peptide length ≥ 6.

where p = the probability of random match; p0 = 0.06 (from each 100 Da interval we selected the highest six peaks; therefore, the random match probability is 0.06); f = the ratio between the number of selected peaks of the spectrum in the residue peaks and the range of experimental mass spectrometry in the m/z value; n = the number of theoretical fragment peaks; k = the number of matched peaks in the experimental spectrum; and P = the probability where k peaks match in the n theoretical peaks, calculated by the binomial distribution probability density function. Scoring Function for Consecutive Ion Matches. Multiple consecutive ion matches were converted into a series of ion pairs matches: x consecutive ion matches were converted into x − 1 ion pairs, and the matching probability of each pair was calculated as above. For example, if b1, b2, and b3 ions were consecutively matched, this consecutive ion match was converted into two consecutive pairs: b1−b2 and b2−b3. Additionally, the probability of consecutive fragment matches was calculated as follows:

Comparison of Algorithms

All algorithms were compared according to the number of identified MS/MS spectra and unique peptides at FDR ≤ 0.01. The same rate of unique peptides and MS/MS spectra were further analyzed according to the different identification results in the three algorithms.



RESULTS AND DISCUSSION

Peak Selection in Spectra

Peaks closer than 1 ± 0.25 Da are considered as isotope peaks and were filtered.9 The number of peaks for spectrum search was minimized in the algorithms to minimize random matches and enhance the accuracy. Sequest selected the highest 200 peaks from all fragment spectra.11 Mascot selected one peak from every 14 Da and the peak above a certain threshold as subsequent analysis peak.10 A maximum of 50 peaks was used by X!Tandem.13 Also, many other algorithms select the 1−10 highest ion peaks from the average 100 Da window for subsequent analysis.15,27,28 Our algorithm ProVerB selected the top six ion peaks in the 100 Da window, since we considered the matching condition of six types of fragment ions, namely b, y, b-H2O, y-H2O, b-NH3, and y-NH3. The fragment ions were selected only if their intensities are higher than 33% of the highest peak.9,15

⎧ r·k ⎪ p1 = n ⎪ ⎨ ⎛ n1⎞ k ⎪ n1− k1 1 ⎪ P1 = ⎜⎝ k ⎟⎠p1 (1 − p1 ) 1 ⎩

where p1 = the probability of the consecutive fragment matches; P1 = the probability where there are k1 peaks consecutive matching in the n1 consecutive theoretical peaks, calculated by the binomial distribution probability density function; n1 = the number of the consecutive matches in the theoretical spectrum; k1 = the number of the consecutive matches in the experimental spectrum; and r is the background constant. Trained from large amounts of identification results in the S. pneumoniae D39 data set, we derived r = 0.09083 using the following formula:

Theoretical Spectra

A theoretical spectrum was generated based on the chemistry of b/y-ions fragmentation. If the b-, y-fragment ions contained S,T,E,D ions, a loss of b-H2O or y-H2O was considered; if the 330

dx.doi.org/10.1021/pr300781t | J. Proteome Res. 2013, 12, 328−335

Journal of Proteome Research r=

Article

necessitating a correction.15 A background value B was subtracted from PEP_S:

avg value of consecutive match of exp spectrum avg value of consecutive match of theor spectrum

S = PEP _S − B

P1 reflects the probability of actual consecutive matching. It is necessary to add a background value for correction of the consecutive matches of more than two ions. Nevertheless, the probability of consecutive matches of three ions was far less than that of two ions, resulting in a small r value. Scoring Function for Spectrum Intensity of b/y-Ion Peaks. Another novelty of our algorithm is to consider peak intensity quantitatively for identification. The peak intensities of b/y-ions generated from the same peptide were correlated based on their physical and chemical properties.9 This provides important additional information to filter the noise and increase the sensitivity of identification. We introduced matrices Bij and Yij based on the chemical properties of bonds between each amino acid pair (AAP). The matrices Bij and Yij were calculated using the S. pneumoniae D39 data set and listed in Supporting Information Table 1. Yij =

M I(y)/2 , M E(y)/6

Bij =

The correction values for different classes of peptides were derived from the S. pneumoniae D39 data set with the Bayesian learning method. The statistical probability = 0.5 of PEP_S from the Bayesian network means that the forward and reverse peptide cannot be distinguished, where we defined S = 0. In this case the background value B equals the PEP_S. The background values B in different classes of peptides are listed in Table 3. S is the final score of each peptide. Table 3. Background Values Learned from Bayesian Networks number backgrnd values type modification sites missed cleavage sites peptide length parent ion charge state

M I(b)/2 M E(b)/6

,where M_I = the number of AAP b-ion or y-ion matches of the highest two peaks in every 100 Da; M_E = the AAP b-ion or yion matching number of the top six peaks in every 100 Da; and i and j stand for amino acids, ranging from 1 to 20. The peptide score function is defined as follows:

1

2

otherwise

8 25 40 49 8 15 35 35 precursor ion mass × 0.018 16 30 30 (charge >2)

Comparison of ProVerB with Mascot, Sequest, and SQID

Number of Identified Peptides and Spectra. We compared our algorithm ProVerB with two widely used MS identification algorithms, Mascot and Sequest, for their sensitivity in the Matlab version. The test data sets include the in-house generated S. pneumoniae D39 data set, the E. coli data set, and the data set from 18 standard proteins in the mixture. Under the criteria FDR ≤ 0.01,25,26 all three algorithms were able to identify more than 3000 peptides from the S. pneumonia D39 data set (Figure 1). The Venn diagram shows that most of the peptides (2702) and spectra (81243) could be identified by all three algorithms. The overlap ratio of identified peptides and spectra from Mascot and ProVerB was as high as 91.0% and 97.9%, showing a good consistency with other algorithms. Clearly, ProVerB identified more peptides and spectra than Mascot and Sequest. The advantage of ProVerB remained the same in the three E. coli data sets as well, showing its unwavering power of identification (Figure 2). We also compared ProVerB with SQID, which also considers the peak intensity information. Compared with the SQID result (3441 peptides and 96542 spectra), the overlap ratio of identified peptides and spectra from SQID and ProVerB was as high as 84.6% and 87.3%. The comparison plot of peptide identification number versus FDR for the four algorithms showed that ProVerB identifies the most peptides within the FDR range of 0.5−3% (Supporting Information Figure 2). Next, we tested the adaptability of ProVerB to various types of MS instruments, including Agilent, FT, LCQ, LTQ, and QTOF, using the downloaded 18 standard protein MS spectra. Again, ProVerB identified significantly more peptides and spectra than Mascot (up to 45.7%) and Sequest (up to 41.7%) in all instruments except Agilent (Figure 3). These data clearly indicate that ProVerB provided mostly significantly higher ability to identify peptides and spectra than the other two identification algorithms and it is also applicable in a wide variety of MS instruments. We used the background value r = 0.09083 in all analyses above. However, the precursor ion charge and peptide length

⎛ n2 ⎞ P2 = ⎜ ⎟p2 k 2 (1 − p2 )n2 − k 2 ⎝k 2 ⎠

p2 =

0

1+c (0.02 + f ) 1+T

where k2 = the number of the peaks matching b/y-ions; n2 = the number of b/y-ions in the theoretical spectra; T = the sum of Bij and Yij of the AAP b/y-ion peaks, which are the highest two peaks every 100 Da and matched to amino acids i and j; c = the number of the highest two peaks matching b/y-ions every 100 Da; and f = the ratio between the number of selected peaks and the m/z range of the experimental mass spectrometry. A constant 0.02 is added, since the random match probability of two ions in the 100 Da interval is 0.02. Here, p2 is the random match probability of b/y-ions match concerning the peak intensity. (1 + c)/(1 + T) indirectly reflects the peak intensity match quality of b/y-ions, and T should be greater than c. A detailed example is included in the Supporting Information. Overall Scoring Function and Background Value. The three scores above were then used to calculate the overall peptide score PEP_S: PEP _S = −10· lg(P·P1·P2)

To investigate the influence of the P1 and P2, we plotted the peptide number against the FDR considering these three scores P, P1, and P2 progressively by applying three different scoring methods −10·lg(P), −10·lg(P·P1), and −10·lg(P·P1·P2), in the S. pneumoniae D39 data set (Supporting Information Figure 1). The curves showed that both the consecutive ion matches P1 and the intensity matches P2 contribute to the improvement of identification. The peptide score can be affected by additional information including peptide length, number of modifications, number of missed cleavages, and charge of precursor ions, thus 331

dx.doi.org/10.1021/pr300781t | J. Proteome Res. 2013, 12, 328−335

Journal of Proteome Research

Article

Figure 1. Comparison of Mascot, Sequest, and ProVerB using S. pneumoniae D39 data set: (A) number of identified peptides; (B) number of identified spectra.

Figure 2. (A) Number of identified peptides from the E. coli data sets using ProVerB, Mascot, and Sequest. (B) Number of identified spectra from the E. coli data sets using three algorithms.

Figure 3. (A) Number of identified peptides from the 18 standard protein data set obtained from five types of MS instruments using three algorithms. (B) Number of identified spectra from the 18 standard protein data set obtained from five types of MS instruments using three algorithms.

may influence the background value r slightly (Supporting Information Figure 3). To address how much the fluctuation of the r value influences the identification performance, we tested ProVerB using the two-dimensional r value matrix (Supporting Information Table 2). In this case ProVerB identified only one peptide more than using the average r value, and 98.8% of the peptides overlap under two settings. Therefore, the precursor ion charge and peptide length generate only trivial influence, if at all. The r values vary depending on the instrument type: Agilent, FT, LCQ, LTQ, and QTOF give the r values 0.1261, 0.1475, 0.1328, 0.1236, and 0.09006, respectively. We tested ProVerB using r = 0.1475 to identify the data set generated by FT, which deviates most from the average r value, resulting in

only one more peptide identified, with all the other identified peptides being the same. These results confirmed that the average value r = 0.09083 can be used universally in ProVerB, insensitive to the precursor ion charge, peptide length, and instrument type. Number of Identified High-Confidence Peptides. Since different algorithms give different identification results, a crosscheck of results from different algorithms may reveal the confidence of identified peptides. The high-confidence peptides and spectra characterize the quality of identification of an algorithm.14 To calculate the number of high-confidence peptides, we first calculated the overlaps of the identified peptides of each two algorithms (Supporting Information Table 332

dx.doi.org/10.1021/pr300781t | J. Proteome Res. 2013, 12, 328−335

Journal of Proteome Research

Article

Table 4. Fractions of High-Confidence Peptides of the Three Algorithms SUM instrument

peptides

Mascot

Sequest

ProVerB

spectra

peptides

spectra

peptides

spectra

peptides

spectra

LTQ-Orbitrap 3267 18 Standard Protein Mixture

96575

3254 99.60%

95969 99.37%

2720 83.26%

82308 85.23%

3262 99.85%

96116 99.52%

Agilent

401

10352

FT

697

24336

LCQ

469

4831

LTQ

622

8277

QTOF E. coli Data Set LTQ-Orbitrap E. coli1 LTQ-Orbitrap E. coli2 LTQ-Orbitrap E. coli3

312

4072

373 93.02% 691 99.14% 423 90.19% 609 97.91% 310 99.36%

9324 90.07% 24080 98.95% 3708 76.75% 7819 94.47% 4014 98.58%

385 96.01% 579 83.07% 425 90.62% 489 78.62% 274 87.82%

9640 93.12% 18336 75.35% 4397 91.02% 6263 75.67% 3270 80.30%

388 96.76% 694 99.57% 468 99.79% 621 99.84% 311 99.68%

9781 94.48% 24238 99.60% 4794 99.23% 8207 99.15% 4037 99.14%

680

11231

576 515

10512 9228

668 98.24% 567 98.44% 504 97.86%

11153 99.31% 10422 99.14% 9127 98.91%

441 64.85% 399 69.27% 356 69.13%

7577 67.47% 7859 74.76% 6946 75.27%

677 99.56% 574 99.65% 513 99.61%

11146 99.24% 10413 99.06% 9128 98.92%

D39 Data Set

Figure 4. Scatter plot of ProVerB and Mascot scores identifying the S. pneumoniae D39 data set.

3). The high-confidence peptides can be calculated as (A ∩ B) ∪ (B ∩ C) ∪ (A ∩ C), where A, B, and C represent the identified peptides or spectra of ProVerB, Mascot, and Sequest, respectively. The fractions of high-confidence peptides identified by these three algorithms are listed in Table 4. In most cases, ProVerB undoubtedly exceeded Mascot and Sequest in identifying high-confidence peptides, showing its unmatched, robust, and instrument-/data set-independent identification power (Supporting Information Figure 4).

4). The Pearson correlation coefficient reached 0.8124 (p < 10−16), showing a good correlation between the two algorithms. This validates that ProVerB provided a scoring scheme compatible with Mascot.



CONCLUSIONS The boom of the proteomics applications and the wide variety of mass spectrometry technology on peptide identification necessitate a versatile and accurate peptide identification algorithm. In this paper, we present a new algorithm ProVerB based on a novel binominal distribution statistical model, and we validate its accuracy, robustness, and compatibility. ProVerB is an open source program so that no algorithmic detail is hidden as in the commercial software packages. Users may tune

Correlation between ProVerB and Mascot Scores

The scores in the MS identification algorithms quantitatively reflect the significance of the identification. We then compared the score values of ProVerB and Mascot using the S. pneumoniae D39 data set (more than 270,000 spectra) (Figure 333

dx.doi.org/10.1021/pr300781t | J. Proteome Res. 2013, 12, 328−335

Journal of Proteome Research

Article

(9) Li, W.; Ji, L.; Goya, J.; Tan, G.; Wysocki, V. H. SQID: an intensity-incorporated protein identification algorithm for tandem mass spectrometry. J. Proteome Res. 2011, 10 (4), 1593−602. (10) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20 (18), 3551−67. (11) Eng, J. K.; McCormack, A. L.; Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 1994, 5 (11), 976−989. (12) Geer, L. Y.; Markey, S. P.; Kowalak, J. A.; Wagner, L.; Xu, M.; Maynard, D. M.; Yang, X.; Shi, W.; Bryant, S. H. Open mass spectrometry search algorithm. J. Proteome Res. 2004, 3 (5), 958−64. (13) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem mass spectra. Bioinformatics 2004, 20 (9), 1466−7. (14) Yadav, A. K.; Kumar, D.; Dash, D. MassWiz: a novel scoring algorithm with target-decoy based analysis pipeline for tandem mass spectrometry. J. Proteome Res. 2011, 10 (5), 2154−60. (15) Cox, J.; Neuhauser, N.; Michalski, A.; Scheltema, R. A.; Olsen, J. V.; Mann, M. Andromeda: a peptide search engine integrated into the MaxQuant environment. J. Proteome Res. 2011, 10 (4), 1794−805. (16) Kapp, E. A.; Schutz, F.; Connolly, L. M.; Chakel, J. A.; Meza, J. E.; Miller, C. A.; Fenyo, D.; Eng, J. K.; Adkins, J. N.; Omenn, G. S. An evaluation, comparison, and accurate benchmarking of several publicly available MS/MS search algorithms: sensitivity and specificity analysis. Proteomics 2005, 5 (13), 3475−3490. (17) Dagda, R. K.; Sultana, T.; Lyons-Weiler, J. Evaluation of the Consensus of Four Peptide Identification Algorithms for Tandem Mass Spectrometry Based Proteomics. J. Proteomics Bioinf. 2010, 3, 39−47. (18) Song, J. H.; Ko, K. S. Detection of essential genes in Streptococcus pneumoniae using bioinformatics and allelic replacement mutagenesis. Methods Mol. Biol. 2008, 416, 401−8. (19) Sun, X.; Jia, H. L.; Xiao, C. L.; Yin, X. F.; Yang, X. Y.; Lu, J.; He, X.; Li, N.; Li, H.; He, Q. Y. Bacterial proteome of streptococcus pneumoniae through multidimensional separations coupled with LCMS/MS. OMICS 2011, 15 (7−8), 477−82. (20) Sun, X.; Ge, F.; Xiao, C. L.; Yin, X. F.; Ge, R.; Zhang, L. H.; He, Q. Y. Phosphoproteomic analysis reveals the multiple roles of phosphorylation in pathogenic bacterium Streptococcus pneumoniae. J. Proteome Res. 2010, 9 (1), 275−82. (21) Macek, B.; Mijakovic, I.; Olsen, J. V.; Gnad, F.; Kumar, C.; Jensen, P. R.; Mann, M. The serine/threonine/tyrosine phosphoproteome of the model bacterium Bacillus subtilis. Mol. Cell Proteomics 2007, 6 (4), 697−707. (22) Klimek, J.; Eddes, J. S.; Hohmann, L.; Jackson, J.; Peterson, A.; Letarte, S.; Gafken, P. R.; Katz, J. E.; Mallick, P.; Lee, H.; Schmidt, A.; Ossola, R.; Eng, J. K.; Aebersold, R.; Martin, D. B. The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools. J. Proteome Res. 2008, 7 (1), 96−103. (23) Fu, Y.; Xiu, L. Y.; Jia, W.; Ye, D.; Sun, R. X.; Qian, X. H.; He, S. M. DeltAMT: a statistical algorithm for fast detection of protein modifications from LC-MS/MS data. Mol. Cell. Proteomics 2011, 10, 5. (24) Ramakrishnan, S. R.; Vogel, C.; Prince, J. T.; Li, Z.; Penalva, L. O.; Myers, M.; Marcotte, E. M.; Miranker, D. P.; Wang, R. Integrating shotgun proteomics and mRNA expression data to improve protein identification. Bioinformatics 2009, 25 (11), 1397−403. (25) Kall, L.; Storey, J. D.; MacCoss, M. J.; Noble, W. S. Assigning significance to peptides identified by tandem mass spectrometry using decoy databases. J. Proteome Res. 2008, 7 (1), 29−34. (26) Elias, J. E.; Haas, W.; Faherty, B. K.; Gygi, S. P. Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nature Methods 2005, 2 (9), 667−675. (27) Beausoleil, S. A.; Villén, J.; Gerber, S. A.; Rush, J.; Gygi, S. P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 2006, 24 (10), 1285−1292.

the parameters according to their specific experimental setup to optimize the results. Also, it can be compiled in various operating systems with a user-friendly graphical user interface. Although ProVerB does not support ECD/ETD mass spectrometry data, we believe that ProVerB will find broad application in proteomics studies and provide more robust and accurate results than the currently available commercial algorithms, producing a more solid base of data for downstream analyses.



ASSOCIATED CONTENT

* Supporting Information S

Three supplementary tables and supplementary notes that support this article. This material is available free of charge via the Internet at http://pubs.acs.org. The ProVerB program, source code, and test data set are freely available at http:// bioinformatics.jnu.edu.cn/software/proverb/.



AUTHOR INFORMATION

Corresponding Author

*G.Z.: phone/fax, +86-20-85224031; e-mail, zhanggong@jnu. edu.cn. Q.-Y.H.: phone/fax, +86-20-85227039; e-mail, tqyhe@ jnu.edu.cn. Author Contributions §

C.X., X.C., and Y.D. contributed equally to this work.

Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS We are grateful to Shuai Liu and Chao Ma for the help with programming ProVerB and for the technical hints on performance optimization. This work was collectively supported by National “973” Projects of China (2011CB910700), National Natural Science Foundation of China (20871057, 31000373, and 31200612), the Fundamental Research Funds for the Central Universities (11610101 and 21611201), and “211” Projects and the Pearl River Rising Star of Science and Technology of Guangzhou City (2011048b).



REFERENCES

(1) Karas, M.; Hillenkamp, F. Laser desorption ionization of proteins with molecular masses exceeding 10,000 Da. Anal. Chem. 1988, 60 (20), 2299−301. (2) Fenn, J. B.; Mann, M.; Meng, C. K.; Wong, S. F.; Whitehouse, C. M. Electrospray ionization for mass spectrometry of large biomolecules. Science 1989, 246 (4926), 64−71. (3) Aebersold, R.; Mann, M. Mass spectrometry-based proteomics. Nature 2003, 422 (6928), 198−207. (4) Matthiesen, R. Extracting monoisotopic single-charge peaks from liquid chromatography-electrospray ionization-mass spectrometry. Methods Mol. Biol. 2007, 367, 37−48. (5) Washburn, M. P.; Wolters, D.; Yates, J. R., 3rd. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 2001, 19 (3), 242−7. (6) Matthiesen, R. Methods, algorithms and tools in computational proteomics: a practical point of view. Proteomics 2007, 7 (16), 2815− 32. (7) Colinge, J.; Bennett, K. L. Introduction to computational proteomics. PLoS Comput. Biol. 2007, 3 (7), e114. (8) Nesvizhskii, A. I.; Vitek, O.; Aebersold, R. Analysis and validation of proteomic data generated by tandem mass spectrometry. Nat. Methods 2007, 4 (10), 787−97. 334

dx.doi.org/10.1021/pr300781t | J. Proteome Res. 2013, 12, 328−335

Journal of Proteome Research

Article

(28) Olsen, J. V.; Mann, M. Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proc. Natl. Acad. Sci. U S A 2004, 101 (37), 13417−22. (29) Khatun, J.; Ramkissoon, K.; Giddings, M. C. Fragmentation characteristics of collision-induced dissociation in MALDI TOF/TOF mass spectrometry. Anal. Chem. 2007, 79 (8), 3032−40. (30) Kapp, E. A.; Schutz, F.; Reid, G. E.; Eddes, J. S.; Moritz, R. L.; O’Hair, R. A.; Speed, T. P.; Simpson, R. J. Mining a tandem mass spectrometry database to determine the trends and global factors influencing peptide fragmentation. Anal. Chem. 2003, 75 (22), 6251− 64.

335

dx.doi.org/10.1021/pr300781t | J. Proteome Res. 2013, 12, 328−335