Performance Evaluation of Existing De Novo Sequencing Algorithms

May 9, 2006 - existing de novo sequencing programs: Lutefisk,18,19 Novo-. HMM,29 PEAKS,20 PepNovo,40 and AUDENS,28 which have been selected for ...
1 downloads 0 Views 368KB Size
Performance Evaluation of Existing De Novo Sequencing Algorithms Sergey Pevtsov,†,§ Irina Fedulova,†,§ Hamid Mirzaei,‡ Charles Buck,‡ and Xiang Zhang*,‡ Department of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Moscow 119992, and Russian Federation and Bindley Bioscience Center, Purdue University, West Lafayette, Indiana 47907 Received May 9, 2006

Two methods have been developed for protein identification from tandem mass spectra: database searching and de novo sequencing. De novo sequencing identifies peptide directly from tandem mass spectra. Among many proposed algorithms, we evaluated the performance of the five de novo sequencing algorithms, AUDENS, Lutefisk, NovoHMM, PepNovo, and PEAKS. Our evaluation methods are based on calculation of relative sequence distance (RSD), algorithm sensitivity, and spectrum quality. We found that de novo sequencing algorithms have different performance in analyzing QSTAR and LCQ mass spectrometer data, but in general, perform better in analyzing QSTAR data than LCQ data. For the QSTAR data, the performance order of the five algorithms is PEAKS > Lutefisk, PepNovo > AUDENS, NovoHMM. The performance of PEAKS, Lutefisk, and PepNovo strongly depends on the spectrum quality and increases with an increase of spectrum quality. However, AUDENS and NovoHMM are not sensitive to the spectrum quality. Compared with other four algorithms, PEAKS has the best sensitivity and also has the best performance in the entire range of spectrum quality. For the LCQ data, the performance order is NovoHMM > PepNovo, PEAKS > Lutefisk > AUDENS. NovoHMM has the best sensitivity, and its performance is the best in the entire range of spectrum quality. But the overall performance of NovoHMM is not significantly different from the performance of PEAKS and PepNovo. AUDENS does not give a good performance in analyzing either QSTAR and LCQ data. Keywords: mass spectrometry • peptide identification • de novo sequencing • mass spectral quality

Introduction Protein identification is a key step for proteomics to contribute to the understanding of biological systems. Two approaches are generally deployed to interpret experimental MS/ MS datasdatabase searching and de novo sequencing. For database searching, which is most widely employed, protein databases are used to find a peptide for which a theoretically predicted spectrum best matches the experimental data. Different scoring schemes are used to evaluate the matches between candidate peptides and the given experimental spectrum.1-9 The determination of peptide sequences without the help of a protein database is called de novo sequencing. Identification of new proteins, proteins resulting from mutations, and proteins with chemical modifications previously not described (unexpected modifications) all require de novo sequencing. Many de novo sequencing algorithms based on different approaches have been developed; we will briefly describe some of them here. The exhaustive listing algorithm first finds all * To whom correspondence should be addressed. Xiang Zhang, Bindley Bioscience Center, Purdue University, West Lafayette, IN 47906. Tel: (765) 496-1153. E-mail: [email protected]. † Lomonosov Moscow State University. § These authors made equal contribution to this work. ‡ Purdue University.

3018

Journal of Proteome Research 2006, 5, 3018-3028

Published on Web 10/06/2006

possible candidate sequences according to the mass of parent ion.10,11 All generated candidate sequences are compared with the experimental spectrum to determine the best conformity. Because of paramount inexact parent mass determination, the main difficulty of this approach is the exponential growth in the number of possible candidate sequences with the increase of parent ion mass. Since the measurement of mass accuracy improves, the composition-based sequencing approach is very promising. For example, there are only about 100 possible peptide candidates with a mass of 1347 Da when the mass accuracy of the mass spectrometer is better than (1 ppm.12 Another approach involves examination of only small parts of the sequence following extension by adding amino acids at both sides of the small sequence tag until the full mass is accumulated.13-16 A disadvantage of this approach is possible discard of good peptides during subsequence extension caused by absence of some ions in the tandem spectrum. The main idea of the graph-theoretical approach is to represent the spectrum with a “spectrum graph”.17 Each peak in the spectrum corresponds to a vertex in the graph, and two vertexes are connected by an edge if the mass difference between them equals the mass of one or several amino acids. Vertexes corresponding to N- and C-termini are usually added to the graph. Finding the path from the N- to C-terminus provides a method to generate candidate sequences which are 10.1021/pr060222h CCC: $33.50

 2006 American Chemical Society

De Novo Sequencing Algorithms Performance Evaluation

next evaluated to the best match of experimental spectra.18-23 Thus, the de novo sequencing problem becomes an effort to find the longest path in a directed acyclic graph.7,24,25 Chen et al.26 proposed a dynamic programming approach to find the longest asymmetric path. Dynamic programming allows one to find an optimal solution, but due to sophisticated peptide fragmentation behavior and noise, the optimal solution may not be the correct sequence. Therefore, B. Lu and T. Chen27 developed an algorithm to find all suboptimal solutions, which probably contain the sequence that produces the experimental spectrum. Details of this algorithm can be found at http://msms.usc.edu/sub. The dynamic programming approach with variations is also proposed in other studies.4,20,28 Another approach to the de novo sequencing problem is the Hidden Markov Model (HMM).29 A trained HMM defines a generative model for mass spectra which, for instance, is used for scoring observed spectra according to their likelihood of matching a given peptide sequence. Besides predicting the most likely sequence, the HMM framework allows one to specify the confidence in the predictions. Implementation of this algorithm gave rise to the NovoHMM program for de novo sequencing. Moreover, a Bayesian approach was used in MassSeq by Micromass, where candidate sequences are generated in a random procedure.30 Savitski et al. designed a linear sequencing algorithm which combines high speed with proteomics-grade efficiency and reliability.31 This algorithm does not utilize the graph-theoretical approach but ideologically is similar to the algorithm suggested by Horn et al.32 In addition to the solutions described above, several studies have approached this problem using the Genetic Algorithm (GA)33,34 and artificial neural networks.35 Stable isotope labeling,36 MS/MS/MS usage,37,38 and prediction of peptide fragmentation39 have also been used to increase confidence in peptide identification. Although there have been many attempts to solve the de novo sequencing problem,4,17-21,26,28,29,40-42 accepted performance measures for de novo algorithms are not available.43 Although performance comparisons are included in almost all articles devoted to algorithm development, to our knowledge, there is no systematic study to evaluate the performance of multiple de novo algorithms by an independent laboratory. In this work, we present such a performance evaluation of five existing de novo sequencing programs: Lutefisk,18,19 NovoHMM,29 PEAKS,20 PepNovo,40 and AUDENS,28 which have been selected for validation based on their availability and popularity.

Experimental Methods Materials. Serum albumin (human), lysozyme (chicken), cytochrome C (bovine), apotransferrin (human), ribonuclease A (bovine), dithiothreitol (DTT), serum (human), HEPES buffer, trypsin, and N-R-tosyl-L-lysine chloromethyl ketone (TLCK) were obtained from Sigma Chemical Co. (St. Louis, MO). Agilent zorbax SB-C18 (5 µm; 150 × 0.5 mm) reversed phase column was purchased from Agilent. LC/MS analyses were done using an Agilent 1100 series capillary HPLC system and PE Sciex QSTAR hybrid LC/MS/MS Quadrupole TOF mass spectrometer. Proteolysis. Six molar urea and 10 mM dithiothreitol (DTT) were added to proteins dissolved in 50 mM HEPES (pH 8.0) at the final concentration of 1 mg/mL. After 1 h incubation at 65 °C, samples were diluted 6-fold by addition of 50 mM HEPES (pH 8.0) in 10 mM CaCl2. Sequence-grade trypsin (2%) was added, and the reaction mixture was incubated at 37 °C for at

research articles least 8 h. Proteolysis was stopped by addition of tosyl lysine chloroketone (TLCK) (trypsin/TLCK ratio of 1:1 (w/w)). LC/MS Analysis. Peptides from the protein tryptic digests were separated on a Agilent zorbax SB-C18 (5 µm; 150 × 0.5 mm) reversed phase column using an Agilent 1100 series capillary HPLC system at 4 µL/min. Solvent A was 0.01% TFA in deionized H2O (dI H2O), and solvent B was 95% CH3CN/ 0.01% TFA in dI H2O. The flow from the column was directed to the QSTAR workstation (Applied Biosystems, Framingham, MA) equipped with an ESI source. The peptides were separated in a 60-min linear gradient (from 0% B to 60% B). MS spectra were obtained in the positive ion mode at a sampling rate of one spectrum per second. Each protein digest was analyzed twice using the information-dependent acquisition (IDA) method, where the three most abundant ions were selected for fragmentation. In one IDA experiment, an ion was excluded from fragmentation if that ion had been fragmented in the last 2 min. In another IDA experiment, no ion was excluded from fragmentation, regardless of whether it had been fragmented before. Protein Identification from QSTAR Data. Each spectrum was analyzed by SEQUEST to identify peptides. Trypsin was used as the digestion protein, and the number of allowed missed-cleavages was set to one. Carbamidomethylation of the cysteine and oxidization of methionine were specified as variable chemical modifications. Monoisotopic mass value was used for peptide search. The m/z tolerance was set as 0.1 Da for both parent and fragment ions. A default setting was used for all other variables. Peptide identification results were evaluated by comparison to traditional SEQUEST scores and statistical probabilities. We set PeptideProphet probability higher than 0.95. It is known that SEQUEST could generate false-positive results in peptide identification. For this reason, we further evaluated the peptide identification results at the protein level, inasmuch as the original proteins were known. To peptides identified from serum sample, we further required that at least three peptides should be identified from a single protein. Keller’s Data. Besides the data set generated using the above-mentioned methods, we used a data set generated by Keller et al.44 Briefly, 18 pure proteins were mixed first. The protein mixture was then aliquoted into two mixtures, A and B. Each mixture, at approximately 1 mg/mL concentration, was digested using trypsin. The complex peptide mixtures were analyzed by mLC-MS on an ESI-ITMS (Thermo Finnigan, San Jose, CA) using a standard top-down data-dependent ion selection approach. In total, 14 LC/MS/MS runs were performed on mixture A, and eight LC/MS/MS runs were performed on mixture B. Peptides were identified using SEQUEST software. These data and analysis results were downloaded from http://www.systemsbiology.org/extra/protein_mixture. html with the authors’ permission. Calculation of Relative Sequence Distance. Comparing de novo sequence with a true peptide sequence can be treated as a similarity matching problem, where de novo sequence and true peptide sequence can be represented by strings where each amino acid residue is encoded by a letter from a 20-letter weighted alphabet Σ. To measure the similarity of two amino acid sequences, the distance function between strings should be defined. One of the frequent errors occurring in the de novo sequence analysis is swap of a neighboring pair of residues because there is no fragment peak indicating fragmentation occurred between Journal of Proteome Research • Vol. 5, No. 11, 2006 3019

research articles

Pevtsov et al.

the two neighboring amino acids, which could be generated because either the peptide did not fragment between the two neighboring amino acids or the fragments generated were not detectable in the mass spectrometer. Another problem in de novo sequence is the substitution due to the mass equivalency, either identical mass substitution (e.g., I/L) or substitutions within the instrument mass accuracy (e.g., K/Q if the mass accuracy is 0.1 Da). To accommodate typical de novo substitutions of segments with equivalent masses, an extension to the edit distance needs to be defined to allow swaps of adjacent letters at zero cost.45 Moreover, such an extension should deal with substitutions when one or more residues are replaced by one or more residues with equal total mass. With the length of aligned segment with possible de novo sequencing error, which is tolerable by our edit distance, denoted as h, an algorithm 1 explains our sequence distance computation. First, an optimal alignment of both strings (S˜ 1/S˜ 2) is built, where S˜ 1 is the peptide database sequence, while S˜ 2 is the de novo sequence. During alignment reconstruction, a substitution option should be checked first, because we expect obtaining strings of almost equal lengths in peptide matching problem. Next, both strings are searched for corresponding aligned segments S˜ 1[i...(i + h)] and S˜ 2[i...(i + h)] of length h which have a mismatch (nonzero edit distance dD between segments) and their masses fall into pre-defined mass tolerance θ (possible de novo error). If possible de novo error is detected, its contribution to total sequence distance is subtracted. Herein, we will use subscript P to denote our sequence distance dP. Algorithm 1. 1. k ) dD(S1,S2) 2. Build optimal alignment (S˜ 1S˜ 2)T 3. L ) |S˜ 1| () |S˜ 2|) 4. kp ) k 5. for i ) 1 to (L - h) 6. if (dD(S˜ 1[i...(i + h)], S˜ 2[i...(i + h)]) > 0 and |mass(S˜ 1[i...(i + h)]) - mass(S˜ 2[i...(i + h)])| < θ) 7. kP ) kP - dD(S˜ 1[i...(i + h)], S˜ 2[i...(i + h)]) 8. dP(S1,S2) ) kP An important feature of our sequence distance is its nonmonotone behavior over edit distance on which it is based. In some cases, a large edit distance can be reduced to zero due to replacements of segments of equal masses (de novo error); in other cases, our distance can remain the same as ordinary edit distance if no replacements are detected. When comparing peptide sequences of different lengths, it is more natural to use the relative sequence distance R(S1,S2) which is defined as R(S1,S2) )

d(S1,S2) max{|S1|,|S2|}

(1)

where d is our sequence distance. This equation satisfies 0 < R < 1. Description of Analyzed Software. 1. PEAKS. PEAKS20 takes an approach that is different from the spectrum graph model. The spectrum graph model attempts to find a path connecting the N- and C-termini, the absence of some ions may preclude success. In PEAKS, a reward/penalty score is computed for every possible mass value, regardless of the detection of a peak around that mass value. Thus, sequence identification can still be achieved in the absence of peaks. Also, the score accounts for many other 3020

Journal of Proteome Research • Vol. 5, No. 11, 2006

factors such as the peak abundance, rank of the peak, the mass errors, and the coexistence of other peaks, all of which noticeably improve the accuracy of the de novo sequencing. PEAKS uses a modified version of a de novo sequencing algorithm based on dynamic programming46 to compute the 10 000 sequences with the best scores. PEAKS also includes a spectrum preprocessing algorithm. The output of the software gives amino acid sequences with confidence scores for the entire sequences, as well as an additional novel positional scoring scheme for portions of the sequences. PEAKS online version 1.1 was used in this work (available from http:// www.bioinformaticssolutions.com/peaksonlin). 2. Lutefisk. This computer program uses a graph theory approach for de novo peptide sequence determinations from low-energy collision-induced dissociation (CID) data of tryptic peptides.18,19 First, Lutefisk converts all of the ions into their corresponding b-ion masses by making N- and C-terminal “evidence lists” that contain evidence for cleavage at every possible b-ion mass. Once the sequence spectrum has been established, the program proceeds by tracing sequences starting at the N-terminus. The process begins by finding all of the b-ion values that differ from the N-terminal value by an amino acid residue mass. The completed sequences are scored based on the fraction of current fragment ion in the mass spectrum that can be assigned to known fragment-ion types. This intensitybased score is attenuated depending on how often the cleavage evidence switches between N- and C-terminal ions. Next, the highest ranked sequences are subjected to a cross-correlation analysis,1,47,48 and both scores are combined and normalized to produce a final score and ranking. We used LutefiskXP v.1.0.5 for our evaluation (available from http://sourceforge.net/ projects/lutefiskxp). 3. PepNovo. The PepNovo algorithm is also based on spectral graph construction with a special method of graph vertex determination to find the asymmetric path with the best score in graph using dynamic programming.40 The main part of PepNovo’s scoring function is a hypothesis test, which compares two competing hypotheses concerning a spectrum S and a mass m of a possible cleavage site. The first hypothesis is the CID hypothesis, which states that m is a genuine cleavage in the peptide that created S. According to this hypothesis, rules that govern the outcome of a fragmentation can be described. In particular, there are certain combinations of fragments and intensities that are more probable than others. A probabilistic network models these fragmentation rules to determine the probability of detecting an observed set of fragment intensities, given that mass m is a cleavage site in the peptide that created S. A training data set is used for probabilities computation. The competing hypothesis is the random peaks hypothesis (RAND), which assumes that the peaks in the spectrum are caused by a random process. The score given to a mass m and spectrum S is the logarithm of the likelihood ratio of these two hypotheses. For each vertex, several scores are computed according to the different combinations of flanking amino acids. In the search for a high-scoring path, the PepNovo search algorithm selects a score for the vertex according to the combination of edges it uses in the path that goes through that vertex. We employed PepNovo version 1.01 (this program is available from http://www-cse.ucsd.edu/groups/bioinformatics/software. html#pepnovo). 4. AUDENS. AUDENS is a new platform-independent open source tool for automated de novo sequencing of peptides from MS/MS data.28 The authors of this tool implemented a dynamic

De Novo Sequencing Algorithms Performance Evaluation

programming algorithm and combined it with a flexible preprocessing module which is designed to distinguish between signal and other peaks. In a preprocessing step, several filters are applied to all measured peaks occurring in a spectrum, and a relevance value is assigned to each peak. AUDENS utilizes some of the characteristics of peptide fragmentation in CID to identify b- and y-ions as well as additional information contributed by other peaks before the de novo sequencing algorithm is applied. The algorithm constructs a sequence path through the MS/ MS spectrum using the peak relevancies to score each suggested sequence path, that is, the corresponding amino acid sequence. A score for each sequence is calculated as the sum of the relevancies of all peaks, that is, of the nodes in the spectrum graph which contribute to the path as either b- or y-ion to construct the suggested sequence. AUDENS v.1 was used for these studies (http://www.ti.inf.ethz.ch/pw/ software/audens). 5. NovoHMM. NovoHMM is a novel and completely generative statistical model for peptide mass spectra.29 The software derives a hidden Markov model that generates mass spectra as a finite automaton over states that correspond to masses. The proposed hidden Markov model emulates the generation process of mass spectra in a fully probabilistic way, which supports a clean separation between signal and noise in the complex mass spectra. There are two main parts in the model. In the first, the transition probabilities between the individual model states are derived. The second part describes the emission probabilities which specify the process of generating peaks of certain heights, given the individual states. Finally, the algorithm combines both sets of probabilities to a factorial hidden Markov model. We evaluated NovoHMM downloaded from http://people.inf.ethz.ch/befische/proteomics. Parameters Used in De Novo Sequencing. Here, we have summarized all parameters used for de novo sequencing. For PEAKS, we used trypsin as enzyme and considered carbamidomethylation and oxidization of methionine as variable chemical modifications. The m/z tolerance for LCQ data is 2.0 Da for parent ions and 0.3 Da for fragment ions. The m/z tolerance for QSTAR data was 0.1 Da for both parent and fragment ions. We did not include spectrum merge during the analysis. We used default parameters for Lutefisk, which can be found in files Lutefisk.lcq_params and Lutefisk.qtof_params delivered with the program. We also used default parameters for AUDENS described in refs 28 and 48. The default parameters of NovoHMM were used, where LCQ data was processed with grouping and QSTAR data was processed without grouping. Default parameters of PepNovo were also employed with fixed modification of cysteine (carbamidomethylation) and optional modification of methionine (oxidization of methionine).

Results and Discussion We acquired 10 858 tandem spectra on QSTAR instrument; a positive peptide identification was obtained from 1405 of these. Keller’s data includes 37 044 tandem spectra, from which 1788 peptides were identified.44 All of these tandem spectra that produced a peptide hit by database searching with SEQUEST were processed with the de novo sequencing programs. The sequences with the highest rank from each de novo sequencing software were extracted from the program output files and compared with SEQUEST results. Relative Sequence Distance as a Performance Measure. Relative sequence distance (RSD) between de novo sequence

research articles and true peptide sequence was calculated using a dynamic programming approach. The value of RSD is between 0 and 1. Zero means that the de novo sequence is identical to the true peptide sequence, one means that the de novo sequence is completely different from the true peptide sequence. There are two types of common errors in de novo sequencing: amino acid swaps of a neighboring pair of residues and amino acid substitution due to mass equivalency. The algorithm developed in this work for the calculation of RSD considers the amino acid swap cost as zero. The reason for this design is that we believe the amino acid swap is an intrinsic challenge to any de novo sequencing method, except when protein database information is used. To the second type of de novo error, for example, amino acid substitution, we considered allowed substitutions with certain values of maximum length of amino acid residues that could be substituted (h). We also set mass tolerance (θ), which is the mass accuracy of mass spectrometer. We consider four values of h: 0, 2, 3, and 4 and defined θ ) 0.3 Da for LCQ data and θ ) 0.1 Da for QSTAR data. If an amino acid or a combination of amino acids in a de novo sequence has the same mass (within mass tolerance θ) as an amino acid or a combination of amino acids in the database sequence, the de novo algorithm is considered to have correctly identified the amino acid or the combination of amino acids. The cost for substitution was treated as zero. For example, with an allowed maximum length of amino acid substitutions at 2 (h ) 2), a de novo sequence will be considered identical to the database sequence if (i) two or fewer amino acid substitutions are required in order to make the de novo sequence identical to the database sequence and (ii) the sum of masses of the substituting amino acids are equal to the sum of masses of the amino acids to be substituted. Figure 1 shows the results of peptide sequence identification for different lengths of allowed substitutions. Calculation of RSD for h ) 0, 2, 3, 4 helps to see how informative a de novo sequence is. If RSD decreases with increase of h for one sequence but RSD (RSD > 0) stays constant for another, this means that the first sequence is more informative. As we mentioned above, there are two common types of de novo sequencing errors: inversions and equivalent mass substitutions. Inversion of two amino acids and arbitrary replacement of any two amino acids with the same summary mass both make edit distance equal to 2 when h ) 0. However, the relative sequence distance (RSD) was suggested precisely to distinguish these situations. Thus, the RSD for a sequence with inversion will be 0 when h ) 2, while RSD for sequence will remain equal to 2 when two arbitrary amino acids were substituted. In general, all de novo sequencing algorithms perform about twice as well when analyzing QSTAR data rather than LCQ data. Relative distance equals to zero means exact identification, where the peptide sequence derived by the de novo sequencing algorithm is identical to the true peptide sequence. Figure 1A shows that the best exact identification was made by PEAKS (49.7%) for the QSTAR data set and by NovoHMM (18.3%) for LCQ data set. PEAKS was far better at exact sequence identification for QSTAR spectra, while PepNovo and PEAKS show only slightly poorer analysis accuracy than NovoHMM for the LCQ data set. We varied the maximum length of allowed substitutions to study the effect of amino substitution during de novo sequencing. Increasing the maximum length of allowed substitutions increases the chance of matching the de novo sequences to the database sequence with zero RSD. Figure 1B-D shows the Journal of Proteome Research • Vol. 5, No. 11, 2006 3021

research articles

Pevtsov et al.

Figure 1. Percent of identified peptides for allowed equal substring substitutions with different lengths for evaluated programs from QSTAR (top) and LCQ (bottom) data sets. (A) h ) 0, exact identification; (B) h ) 2; (C) h ) 3; (D) h ) 4.

identification results from maximum amino acid substitution lengths of two, three, and four. It is interesting that most de novo sequences that are very similar to the true peptide sequences have one or more sequence segments that can be replaced by other amino acid combinations with certain mass tolerance. The relative sequence distances of these de novo sequences are less than 0.3 (see Figure 1A,B). Increased length of the allowed amino acid substitution increases the success rate of identical identification, where the sequence distance between the identified de novo sequence and true peptide sequence is zero. For identification with these allowed substitutions, PEAKS remains most effective, although Lutefisk and PepNovo are comparable. However, AUDENS and NovoHMM are ineffective on the QSTAR data set. NovoHMM outperforms all other programs in analysis of LCQ spectra. PEAKS, PepNovo, and Lutefisk still significantly outperform AUDENS on the LCQ data. The overall identification quality of the studied software packages is not as good as expected. With maximum lengths of allowed substitutions set to 4, PEAKS gives a 80% success rate of identification for QSTAR data, while NovoHMM has 58% success rate for LCQ data. The remaining identified sequences have intolerable levels of sequence errors. This indicates that the current software packages for de novo sequencing have limited capability for protein identification. The peptide sequences derived directly from de novo sequencing should be matched to a protein database sequence, if possible. Dependency of Algorithm Performance on the Mass of Parent Ion. To study the dependency of algorithm performance on the mass of parent ion, we considered four ranges of relative sequence distance: [0.0, 0.1], [0.1, 0.2], [0.2, 0.3], and [0.3, 1.0]. Peptide masses were also divided into categories: 2.5 kDa. Most of analyzed peptides have mass in the range of 1-2 kDa. We calculated percent of identification with different RSD in each mass range for h ) 0 by dividing the number of correctly identified peptides by the total number of spectra in the corresponding category of mass range. Results are presented in Figures 2 and 3. By definition, the value of RSD indicates identification accuracy. A large value of RSD means a de novo sequence dissimilar to the true peptide sequence. Figure 2 shows that AUDENS and NovoHMM were not good at analyzing QSTAR data in all mass regions. PEAKS has the best accuracy among all five algorithms. The accuracy of PepNovo is slightly better than Lutefisk. In terms of analyzing LCQ data (Figure 3), AUDENS still has the worst accuracy. The performance of Lutefisk is slightly better than AUDENS. NovoHMM has the best accuracy, even though PepNovo and PEAKS have a very close performance. Figures 2 and 3 show that the five de novo sequencing algorithms have a better performance when analyzing tandem spectra with parent ion mass less than 2 kDa. The performance is significantly decreased when analyzing tandem spectra generated by large peptides. To those tandem spectra whose parent ion mass is less than 2 kDa, the dependency of analyzing LCQ data on the parent ion mass is not significant (Figure 3). However, these algorithms perform best when analyzing spectra with parent ion masses between 1 and 1.5 kDa (Figure 2). Algorithms’ Performance Correlation Analysis. We also carried out correlation analysis of identification abilities of de novo sequencing programs. We only focused on identification cases with RSD ) 0. Three values of parameter h were considered: h ) 0, h ) 2, and h ) 4. We calculated percent of

De Novo Sequencing Algorithms Performance Evaluation

research articles

Figure 4. Percent of simultaneous identification by one, two, three, four, or five algorithms for different lengths of aligned segment with possible de novo sequencing error which is tolerable by RSD for QSTAR (top) and LCQ (bottom) data sets. Table 1. Percent of Spectra Correctly and Incorrectly Interpreted by Each Pair of Algorithms for the QSTAR Data Set (h ) 2)a Figure 2. Percent of peptide identification with different values of RSD for various parent ion mass ranges in the analysis of the QSTAR data set.

a Subdiagonal elements colored in gray correspond to correct identifications (RSD ) 0); superdiagonal elements colored in orange correspond to incorrect identifications (RSD > 0.3).

Figure 3. Percent of peptide identification with different values of RSD for various parent ion mass ranges in the analysis of the LCQ data set.

spectra which were identified correctly by all five algorithms, by 4 of 5, 3 of 5, 2 of 5, and by 1 of 5 (see Figure 4). For exact identification (h ) 0), most spectra were interpreted correctly by one algorithm (35% for QSTAR data set and 12% for LCQ data set). Increase of h value leads to increase of the number of simultaneous identification. Thus, for QSTAR data set and for h ) 2, 31% of spectra were identified correctly by 3

algorithms simultaneously, and for h ) 4, this value reached 42%. Moreover, 8.5% of spectra were identified correctly by 4 of 5 algorithms. This indicates that different de novo algorithms are in fact succeeding on a distinct group of spectra, especially for the exact identification (h ) 0, RSD ) 0). For this reason, we recommend to use multiple de novo sequencing algorithms for peptide identification. As with QSTAR, there was a general tendency of simultaneous identifications for the LCQ data set; the only exception was that 9% of spectra were identified correctly by all five algorithms for h ) 4. Notably, despite low identification rate for LCQ spectra, if the spectrum has been interpreted correctly with h ) 4, it could be often confirmed by four and even five algorithms simultaneously. Percents of correctly (RSD ) 0) and incorrectly (RSD > 0.3) identified spectra by each pair of algorithms were also calculated to see the correlation of correct/incorrect calls. The correlation value between two algorithms was calculated as follows: counting the number of spectra which were identified with RSD ) 0 by each pair of algorithms and the number of spectra with RSD > 0.3, respectively. Obtained values were divided by the total number of spectra in a corresponding data set (1405 for QSTAR data and 1788 for LCQ data). Tables 1 and 2 display the analysis results of h ) 2 (full information on the correct/incorrect calls correlation values for each pairs of algorithms can be found in Supporting Information). AUDENS Journal of Proteome Research • Vol. 5, No. 11, 2006 3023

research articles

Pevtsov et al.

Table 2. Percent of Spectra Correctly and Incorrectly Interpreted by Each Pair of Algorithms for the LCQ Data Set (h ) 2)a

Table 3. Sensitivity and Positive Predictive Value for Evaluated Algorithms, QSTAR Data Set

Sn PPV

NovoHMM

AUDENS

PEAKS

Lutefisk

PepNovo

0.316 0.218

0.373 0.22

0.8 0.575

0.679 0.563

0.677 0.522

Table 4. Sensitivity and Positive Predictive Value for Evaluated Algorithms, LCQ Data Set

Sn PPV a Subdiagonal elements colored in gray correspond to correct identifications (RSD ) 0); superdiagonal elements colored in orange correspond to incorrect identifications (RSD > 0.3).

shows zero correlation with any other four algorithms in analyzing QSTAR data set (Table 1). It also shows poor correlation with other algorithms in analyzing LCQ data (Table 2). AUDENS and NovoHMM show high correlation of incorrect calls while analyzing QSTAR spectra for all considered values of parameter h (88% for h ) 0, 75% for h ) 2, and 50% for h ) 4). For the LCQ data set, Lutefisk and AUDENS show a high level of correlation for incorrect calls. There were no tandem spectra correctly identified by two de novo algorithms when RSD ) 0 and h ) 0. When h ) 2, peptides were correctly identified in 49% of tandem spectra by both PEAKS and Lutefisk (Table 1). This value increases to 65% when h ) 4, which indicates that the correct calls of PEAKS and Lutefisk are highly correlated in analyzing QSTAR data. NovoHMM and PepNovo correctly identified peptides from 24% of tandem spectra when analyzing LCQ data when h ) 2, which is the same as the correct calls made by both NovoHMM and PEAKS. However, the correct calls by both NovoHMM and PEAKS increased to 40%, while the correct calls by NovoHMM and PepNovo were both 36% when h ) 4. Sensitivity and Positive Predictive Value as Performance Measures. Methodology for a comprehensive performance evaluation of de novo sequencing algorithms is not available.43 We implemented two performance measures that include sensitivity and positive predictive value,50,51 calculated as follows: Sn )

TP TP + FN

PPV )

TP TP + FP

(2) (3)

where Sn is the sensitivity, TP is a number of true-positive identifications, FN is false-negative identifications, FP is falsepositive identifications, and PPV is positive predictive value. We employed the following method to calculate TP, FN, and FP: (1) Select an experimental spectrum from the data set and its SEQUEST match (i.e., the peptide database sequence). (2) Generate a theoretical spectrum according to fragmentation rules for a, b, c, x, y, and b/y-H2O/NH3 ions present in the experimental spectrum. (3) For each peak in the theoretical spectrum, identify a mass-matched peak in the experimental spectrum. If the experimental spectrum does not contain this peak, eliminate it from the theoretical spectrum. This procedure creates a list of peaks predicted and simultaneously present in the experimental spectrum. Denote this peak list as ‘LIST A’. 3024

Journal of Proteome Research • Vol. 5, No. 11, 2006

NovoHMM

AUDENS

PEAKS

Lutefisk

PepNovo

0.664 0.386

0.312 0.193

0.542 0.312

0.417 0.269

0.523 0.356

(4) Select a peptide sequence generated from the same experimental spectrum by the de novo sequencing program. (5) Generate a theoretical spectrum for the de novo sequence (the same ion types as in step 2). The resulting peak list is denoted as list ‘LIST B’. (6) If a peak presents in LIST A and is present in LIST B, then TP ) TP + 1. If a peak presents in LIST A and does NOT present in LIST B, then FN ) FN + 1. If a peak presents in LIST B and does NOT present in LIST A, then FP ) FP + 1. TP, FN, and FP are summarized for all peaks in LIST A and LIST B. 7) Repeat for each experimental spectrum. By definition, Sn approaches its maximum value (1.0) if FN is close to 0. In this case, all peaks that could be used in peptide sequence identification are input to the de novo sequencing algorithm. Hence, the more useful peaks from an experimental spectrum used by a de novo sequencing algorithm, the larger is the value of Sn for that algorithm. This indicates that sensitivity truly estimates the quality of a de novo sequencing algorithm regardless of other factors. Positive predictive value (PPV) does not depend only on the de novo algorithm. In these scenarios, FP increases if a peak presents in LIST B and is absent in LIST A. A peak may be absent in LIST A due to poor quality of the tandem spectrum. In this situation, FP will be increased because of poor quality of experimental data independent of the value of the de novo sequencing algorithm. Therefore, we conclude that sensitivity Sn is an appropriate performance measure for de novo sequencing algorithm evaluation. PPV depends on spectrum quality and fragmentation behavior and can be used as an additional performance measure. Each de novo sequencing algorithm uses different scoring schemes, but all scoring functions are based on presence/ absence of peaks in the experimental spectrum. These programs also utilize different ion types. We used a, b, c, x, y, b/y-H2O/NH3 ion types because the probability of their presence in experimental spectrum is high. Sn and PPV for QSTAR and LCQ data sets were calculated and presented in Tables 3 and 4. Table 3 shows that PEAKS outperforms all other programs in analysis of the QSTAR data set. The performance rank order of the de novo algorithms is PEAKS > Lutefisk, PepNovo > AUDENS, NovoHMM. Surprisingly, these five de novo algorithms have different performance characteristics when compared for analysis of the LCQ data set. In this case, NovoHMM

research articles

De Novo Sequencing Algorithms Performance Evaluation

Figure 5. Dependence of relative sequence distance on spectrum quality for QSTAR (left) and LCQ (right) data sets with evaluated algorithms according to the spectrum quality estimation scheme. For QSTAR spectra, relative distance decreases with spectrum quality growth for PEAKS, PepNovo, and Lutefisk; relative distance value for AUDENS and NovoHMM is insensitive to spectrum quality. For LCQ spectra, RSD decreases with increasing spectrum quality for all evaluated programs.

is the best performer (Table 4). In this case, the performance rank order is NovoHMM > PEAKS, PepNovo > Lutefisk > AUDENS. It should be noted that the performance rank order of the five de novo sequencing algorithms derived from the sensitivity study is consistent to the results of relative distance calculation (Figure 1). Relative Sequence Distance on Spectrum Quality Dependence. The quality of tandem spectra is one of many factors that can affect the performance of a de novo sequencing algorithm. We chose the quality evaluation scheme described in ref 53 to investigate the relationship between the performance of each de novo algorithm and spectrum quality. Calculations of spectrum quality were normalized to [0, 1]. The suggested evaluation scheme utilizes a spatial distribution pattern of spectral peaks. Spectral quality score Y is calculated according to eqs 4 and 5: 4

Y ) log{-k0 +

i i

i)1

X1 )

C2 , C1

X2 )

4

∑k X + ∑k′ X

C3 , C1

}

(4)

X4 ) C5

(5)

2

i i

i)1

X 3 ) C4 ,

where C1 is the number of peaks larger than a given peak intensity threshold; C2 and C3 are the number of peaks larger than 3% TIC (total ion current) and 2% TIC, respectively; and C4 and C5 are the average peak distance along m/z for the peaks larger than 2% TIC and within 1.0-1.5% TIC, respectively. Values ki and k′i, i ) 1, ..., 4 are the calibrating coefficients.53 We calculated the spectrum quality of every spectrum in both the data generated in our laboratory and from Keller44 and sorted in order of increasing spectrum quality. To discover global tendencies of sequence distance and spectrum quality, we used smoothing spline to approximate sequence distance information, where an average value of sequence distance was taken to represent the sequence distance distribution of a given spectrum quality.

Figure 5 shows the interdependence of relative sequence distance and tandem spectrum quality for both analyzed data sets (QSTAR and LCQ). For QSTAR spectra, relative sequence distance decreases with increased spectrum quality for PEAKS, Lutefisk, and PepNovo. However, the relative sequence distance is insensitive to the spectrum quality if the tandem spectra are analyzed by AUDENS and NovoHMM algorithms. PEAKS, Lutefisk, and PepNovo perform much better than AUDENS and NovoHMM in analyzing tandem spectra with high quality. Among the five algorithms, PEAKS is the top performer for analysis of tandem spectra regardless of their quality. Lutefisk and PepNovo perform similarly on spectra of any quality. The performance rank order of these five algorithms is PEAKS > Lutefisk, PepNovo > NovoHMM, AUDENS. As with the QSTAR data, all algorithms demonstrate poor performance in analyzing tandem spectra with low quality, and performance improves with the increase of spectrum quality for LCQ spectra. However, all software are insensitive to the spectrum quality when the spectrum quality increased to a range of 0.17-0.4. Compared with other software, NovoHMM has the best performance across the entire range of spectrum quality, but PepNovo and PEAKS have similar performance characteristics. The performance rank order of these five algorithms is NovoHMM > PepNovo, PEAKS > Lutefisk > AUDENS. It should be noted that the performance of all de novo sequencing algorithms decreases with spectrum quality higher than 0.35 for LCQ data set. This includes about 350-400 tandem spectra. Manual review of these spectra revealed clusters of peaks with abnormally high intensities of 600-700 counts, compared with usual spectral intensities not greater than 200 counts for significant peaks. On the other hand, the total number of peaks in these spectra is large even though the majority of peaks have an intensity of 10 counts or less. It is likely that the variable and poor performance of de novo algorithms is mainly caused by the methods of calculating spectrum quality, particularly for the LCQ data set. In the suggested quality evaluation scheme, a simple peak intensity Journal of Proteome Research • Vol. 5, No. 11, 2006 3025

research articles

Figure 6. Distribution of number of spectra in the QSTAR data set depending on spectrum quality (top) and percent of correct calls (RSD e 0.2) depending on spectrum quality for the QSTAR data set for length of aligned segment with possible de novo sequencing error which is tolerable by RSD h ) 2 (bottom).

threshold was used to filter noise peaks. Some intense noise peaks were therefore likely to have been considered as peaks generated from fragment ions. Even though such noise peaks may increase the value of spectrum quality, they do not make a positive contribution to the peptide identification during de novo sequencing. The presence of these noise peaks actually misleads the de novo sequencing algorithms to identify a peptide sequence with larger sequence variations and, therefore, increases the relative sequence distance of the de novo sequence. We also studied the distribution of spectrum quality and the number of correct calls each algorithm makes. Spectra in both data sets were divided into 20 groups according to spectrum quality, so each spectrum in group number i has quality in range [0.05 ‚(i - 1), 0.05 ‚i], i ) 1, ..., 20. For each group of spectra, we calculated the number of spectra in this group (shown as gray bars in Figures 6 and 7) and percent of correct calls by any algorithm for h ) 2 (shown as colored lines with markers in Figures 6 and 7). As shown in Figures 6 and 7, most spectra in the QSTAR data set have quality in the range 0.3 to 0.4, and in the range 0.2-0.4 for the LCQ data set. Nevertheless, for the QSTAR data set, there is a strong tendency for an increase of correct calls number with an increase of spectrum quality for leading algorithms PEAKS, Lutefisk, and PepNovo. AUDENS and NovoHMM are not sensitive to QSTAR spectra quality. For the LCQ data set, a majority of spectra have quality between 0.2 and 0.5. Figure 7 shows that all five algorithms have the best identification rate for spectra with quality in range 3026

Journal of Proteome Research • Vol. 5, No. 11, 2006

Pevtsov et al.

Figure 7. Distribution of number of spectra in the LCQ data set depending on spectrum quality (top) and percent of correct calls (RSD e 0.2) depending on spectrum quality for the LCQ data set for length of aligned segment with possible de novo sequencing error which is tolerable by RSD h ) 2 (bottom).

of 0.25-0.5. However, the success rate of correct identification decreases with the increasing the spectra quality. This indicates the weakness of the suggested spectrum quality evaluation scheme as applied to spectra obtained on the LCQ instrument. A number of model proteins were used here to study the performance of de novo sequencing algorithms. Even though we included a serum sample in the QSTAR data, only the most abundant proteins were identified. All proteins identified by SEQUEST and used in this study do not have extensive chemical modifications. A key emerging problem in proteomics is that many proteins are post-translationally modified (PTM) at multiple sites, dramatically increasing the already complex situation presented by the tens of thousands of proteins expressed by most cells. Although all de novo sequencing algorithms investigated in this study show limited analysis power for model proteins, de novo sequencing methods will continue to play a significant role in analysis of peptide spectra derived from new proteins, peptides with various PTMs, or peptides containing mutations, because it will be computationally expensive to include all of the PTMs into database searching methods. Even if this is computationally achievable, excessive false-positive identification may be introduced if all positive chemical modifications are employed in a database search.

Conclusions More than 15 de novo sequencing algorithms have been developed in the last two decades, but these have never been systematically compared in the same study. We have evaluated

De Novo Sequencing Algorithms Performance Evaluation

the performance of five of the most popular and readily available de novo algorithms: AUDENS, Lutefisk, NovoHMM, PEAKS, and PepNovo. We calculated the relative sequence distance (RSD) value for each de novo sequence and used this value to study peptide identification quality. The RSD value indicates similarity between the identified de novo sequence and the true peptide sequence. Comparison of RSD values obtained with different magnitudes of parameter h allows to understand which algorithm produces more informative sequences. In general, each algorithm performed better in analysis of QSTAR data than of LCQ data. PEAKS demonstrated the best exact identification with a success rate of 49.7% for QSTAR data, while NovoHMM showed the best exact identification with a success rate of 18.3% for LCQ data. We further introduced sensitivity, which does not depend on spectral quality, as a universal performance measure. The PEAKS and NovoHMM de novo sequencing algorithms also demonstrate the best sensitivity in the analysis of QSTAR data and LCQ data, respectively. Even though positive predictive value depends on spectral quality, it can be utilized as an additional performance measure. The quality of tandem mass spectra makes a substantial contribution to the success of de novo sequencing. In general, de novo algorithms have a better chance to identify the true peptide sequence from a tandem spectrum of high quality. Across the range of spectrum quality, PEAKS and NovoHMM also perform best in the analysis of QSTAR and LCQ data, respectively. Nevertheless, spectrum quality evaluation schemes for LCQ spectra are object of further improvement. Similar results have been obtained from the three analysis methods: relative sequence distance (RSD), algorithm sensitivity, and spectrum quality. These results indicate that for the de novo sequencing problem in proteomics there is not a general solution yet. All analyzed algorithms failed to exceed a 50% threshold of exact peptide sequence identification for both QSTAR and LCQ data sets. Enlargement of standard spectral data sets will allow more robust algorithm comparison and evaluation, which will be an important standard for newly developed de novo sequencing programs.

Acknowledgment. This project was funded by RFBR grant 05-07-90238 and project seed funding from Bindley Bioscience Center, Discovery Park, Purdue University. We thank Richard Johnson, Jonas Grossmann and John Morey for helpful explanations. Supporting Information Available: Table describing percent of spectra correctly and incorrectly interpreted by each of the pairs of algorithms for QSTAR and LCQ data sets for different values of lengths of aligned segment with possible de novo sequencing error which is tolerable by relative sequence distance. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) Eng, J. K.; McCormack, A. L.; Yates, J. R., III J. Am. Soc. Mass Spectrom. 1994, 5, 976-989. (2) Perkins, D. N.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551-3567. (3) Colinge, J.; Masselot, A.; Giron, M.; Dessingy, T.; Magnin, J. Proteomics 2003, 3, 1454-1463. (4) Bafna, V.; Edwards, N. Bioinformatics 2001, 17, 13-21. (5) Field, H. I.; Fenyo, D.; Beavis, R. C. Proteomics 2002, 2, 36-47. (6) Havilio, H.; Haddad, Y.; Smilansky, Z. Anal. Chem. 2003, 75, 435444.

research articles (7) Dancik, V.; Addona, T. A.; Clauser, K. R.; Vath, J. E.; Pevzner, P. J. Comput. Biol. 1999, 6, 327-342. (8) Elias, J. E.; Gibbons, F. D.; King, O. D.; Roth, F. P.; Gygi, S. Nat. Biotechnol. 2004, 22, 214-219. (9) Clauser, K. R.; Baker, P.; Burlingame, A. L. Anal. Chem. 1999, 71 (14), 2871-2882. (10) Sakurai, T.; Matsuo, T.; Matsuda, H.; Katakuse, I. Biomed. Mass Spectrom. 1984, 11 (8), 396-399. (11) Hamm, C. W.; Wilson, W. E.; Harvan, D. J. Comput. Appl. Biosci. 1986, 2, 115-118. (12) Spengler, B. J. Am. Soc. Mass Spectrom. 2004, 15 (5), 703-714. (13) Biemann, K. Biomed. Environ. Mass Spectrom. 1988, 16 (1-12), 99-111. (14) Johnson, R. S.; Martin, S. A.; Biemann, K.; Stults, J. T.; Watson, J. T. Anal. Chem. 1987, 1, 59 (21), 2621-2625. (15) Ishikawa, K.; Niva, Y. Biomed. Environ. Mass Spectrom. 1986, 13, 373-380. (16) Siegel, M. M.; Bauman, N. Biomed. Environ. Mass Spectrom. 1988, 15, 333-343. (17) Bartels, C. Biomed. Environ. Mass Spectrom. 1990, 19, 363-368. (18) Taylor, A.; Johnson, R. Rapid. Commun. Mass Spectrom. 1997, 11, 1067-1075. (19) Taylor, A.; Johnson, R. Anal. Chem. 2001, 73, 2594-2604. (20) Ma, B.; Zhang, K.; Hendrie, C.; Liang, C.; Li, M.; Doherty-Kirby, A.; Lajoie, G. Rapid Commun. Mass Spectrom. 2003, 17, 23372342. (21) F.-Cossio, J.; Gonzalez, J.; Betancourt, L.; Besada, V.; Padron, G.; Shimonishi, Y.; Takao, T. Rapid Commun. Mass Spectrom. 1998, 12, 1867-1878. (22) Scigelova, M.; Maroto, F.; Dufresne, C.; Vazquez, J. Proceedings of the 50th ASMS Conference on Mass Spectrometry and Allied Topics, Orlando, FL, June 2-6, 2002. (23) Zhong, H.; Li, L. Rapid Commun. Mass Spectrom. 2005, 19, 10841096. (24) Cormen, T. H.; Leiserson, C. E.; Rivest, R. L.; Stein, C. Introduction to Algorithms, 2nd ed.; The MIT Press: Canbridge, MA, 2001. (25) Garey, M. R.; Johnson, D. S. Computers and Intractability; Freeman: New York, 1979. (26) Chen, T.; Kao, M. Y.; Tepel, M.; Rush, J.; Churc, G. M. J. Comput. Biol. 2001, 8 (3), 325-337. (27) Lu, B.; Chen, T. J. Comput. Biol. 2003, 10 (1), 1-12. (28) Grossmann, J.; Roos, F. F.; Cieliebak, M.; Lipta, Z.; Mathis, L. K.; Muller, M.; Gruissem, W.; Baginsky, S. J. Proteome Res. 2005, 4(5), 1768-1774. (29) Fischer, B.; Roth, V.; Roos, F.; Grossmann, J.; Baginsky, S.; Widmayer, P.; Grulssem, W.; Buhmann, J. M. Anal. Chem. 2005, 77 (22), 7265-7273. (30) Skilling, J.; Cottrell, J.; Green, B.; Hoves, J.; Kapp, E.; Landgridge, J.; Bordoli, B. Proceedings of the 47th ASMS Conference on Mass Spectrometry and Allied Topics, Dallas, TX, June 13-17, 1999. (31) Savitski, M. M.; Nielsen, M. L.; Kjeldsen, F.; Zubarev R. A. J. Proteome Res. 2005, 4, 2348-2354. (32) Horn, D. M.; Zubarev, R. A.; McLafferty, F. W. Proc. Natl. Acad. Sci. U.S.A. 2000, 97 (19), 10313-10317. (33) Malard, J. M.; Heredia-Langner, A.; Baxter, D. J.; Jarman, K. H.; Cannon, W. R. Third IEEE International Workshop on High Performance Computational Biology (HiCOMB), Santa Fe, NM, April 26, 2004. (34) Stranz, D. D.; Martin, L. B., III J. Biomol. Techniques 1998, 9 (4), 19-23. (35) Scarberry, R. E.; Zhang, Z.; Knapp, D. R. J. Am. Soc. Mass Spectrom. 1995, 6, 947-961. (36) Qin, J.; Herring, C. J.; Zhang, X. Rapid. Commun. Mass Spectrom. 1998, 12, 209-216. (37) Zhang, Z.; McElvain, J. S. Anal. Chem. 2000, 72, 2337-2350. (38) Olsen, J. V.; Mann, M. Proc. Natl. Acad. Sci. U.S.A. 2004, 101 (37), 13417-13422. (39) Zhang, Z. Anal. Chem. 2004, 76, 3908-3922. (40) Frank, A.; Pevzner, P. Anal. Chem. 2005, 77, 964-973. (41) Lubeck, O.; Sewell, C.; Gu, S.; Chen, X.; Cai, D. M. Proc. IEEE 2002, 90, 1868-1874. (42) Fenyo, D.; Qin, J.; Chait, B. T. Electrophoresis 1998, 19, 998-1005. (43) Shadforth, I.; Crowther, D.; Bessant, C. Proteomics 2005, 5, 40824095. (44) Keller, A.; Purvine, S.; Nesvizhskii, A. I.; Stolyar, S.; Goodlett, D. R.; Kolker, E. OMICS 2002, 6, 207-212. (45) Fedulova, I. A.; Mirzaei, H.; Pevtsov, S. E.; Ouyang, Z.; Zhang, X. Proceedings of the 54th ASMS Conference on Mass Spectrometry and Allied Topics, Seattle, WA, May 28-June 1, 2006.

Journal of Proteome Research • Vol. 5, No. 11, 2006 3027

research articles (46) Ma, B.; Zhang, K, Liang, C. J. Comput. Syst. Sci. 2005, 70, 418430. (47) West, D. Introduction to Graph Theory; Prentice Hall: Englewood Cliffs, NJ, 1996. (48) Owens, K. Appl. Spectrosc. Rev. 1992, 27, 1-49. (49) Baginsky, S.; Cieliebak, M.; Gruissem, W.; Kleffmann, T.; Liptak, Z.; Muller, M.; Penna, P. Technical Report No. 383; ETH Zurich, Department of Computer Science, 2002. (50) Fedulova, I.; Zheng, O.; Zhang, X. Manuscript in preparation.

3028

Journal of Proteome Research • Vol. 5, No. 11, 2006

Pevtsov et al. (51) Gay, S.; Binz, P. A.; Hochstrasser, D. F.; Appel, R. D. Proteomics 2002, 2, 1374-1391. (52) Blom, N.; Sicheritz-Ponten, T.; Gupta, R.; Gammeltoft, S.; Brunak, S. Proteomics 2004, 4, 1633-1649. (53) Xu, M.; Geer, L. Y.; Bryant, S. H.; Roth, J. S.; Kowalak, K. A.; Maynard, D. M.; Markey, S. P. J. Proteome Res. 2005, 4, 300-305.

PR060222H