A Hybrid Method for Peptide Identification Using Integer Linear

A Hybrid Method for Peptide Identification Using Integer Linear Optimization, Local Database Search, and Quadrupole Time-of-Flight or OrbiTrap Tandem ...
0 downloads 0 Views 437KB Size
A Hybrid Method for Peptide Identification Using Integer Linear Optimization, Local Database Search, and Quadrupole Time-of-Flight or OrbiTrap Tandem Mass Spectrometry Peter A. DiMaggio, Jr. and Christodoulos A. Floudas* Department of Chemical Engineering, Princeton University, Princeton, New Jersey 08544-5263

Bingwen Lu and John R. Yates, III Department of Cell Biology, The Scripps Research Institute, SR11, La Jolla, California 92037 Received September 5, 2007

A novel hybrid methodology for the automated identification of peptides via de novo integer linear optimization, local database search, and tandem mass spectrometry is presented in this article. A modified version of the de novo identification algorithm PILOT1,2 is utilized to construct accurate de novo peptide sequences. A modified version of the local database search tool FASTA3 is used to query these de novo predictions against the nonredundant protein database to resolve any low-confidence amino acids in the candidate sequences. The computational burden associated with performing several alignments is alleviated with the use of distributive computing. Extensive computational studies are presented for this new hybrid methodology, as well as comparisons with MASCOT4 for a set of 38 quadrupole time-of-flight (QTOF) and 380 OrbiTrap tandem mass spectra. The results for our proposed hybrid method for the OrbiTrap spectra are also compared with a modified version of PepNovo,5 which was trained for use on high-precision tandem mass spectra, and the tag-based method InsPecT.6 The de novo sequences of PILOT and PepNovo are also searched against the nonredundant protein database using CIDentify7 to compare with the alignments achieved by our modifications of FASTA. The comparative studies demonstrate the excellent peptide identification accuracy gained from combining the strengths of our de novo method, which is based on integer linear optimization, and database driven search methods. Keywords: hybrid peptide identification • high-precision tandem mass spectrometry • integer linear optimization (ILP)

1. Introduction and Background Peptide and protein identification is of fundamental importance in the study of proteomics. Tandem mass spectrometry (MS/MS) coupled with high-performance liquid chromatography (HPLC) has emerged as a powerful protocol for highthroughput and high sensitivity peptide and protein identification experiments. In recognition of the extensive amount of sequence information embedded in a single mass spectrum, tandem MS has served as an impetus for the recent development of numerous computational approaches formulated to sequence peptides robustly and efficiently with particular emphasis on the integration of these algorithms into a highthroughput computational framework for proteomics. The two most frequent computational approaches reported in the literature are (a) de novo methods and (b) database search methods, both of which can utilize deterministic, probabilistic, and/or stochastic solution techniques. De novo methods have distinct advantages over database methods in that they can * Author to whom all correspondence should be addressed. Tel.: (609) 258-4595. E-mail address: [email protected].

1584 Journal of Proteome Research 2008, 7, 1584–1593 Published on Web 03/07/2008

analyze peptides not present in a protein database and are more amenable to identifying post-translational modifications. Recent work has been done to combine the strengths of de novo and database techniques to improve peptide identification accuracy. Traditional hybrid approaches find a middle ground between de novo and database methods.8–10 A set of small subsequences, known as “sequence tags”, are typically generated from high-intensity, high m/z peaks in the tandem mass spectrum. Each sequence tag, along with the known masses flanking the N- and C-terminus of the sequence tag, is used to query a protein database and extract full sequences that are consistent in mass and sequence composition. The theoretical tandem mass spectra of these extracted database sequences are then compared to the experimental tandem mass spectrum to determine the best peptide match. Other approaches to hybrid peptide identification have combined the results of de novo peptide identification and local database search. One of the first attempts at querying de novo sequences against a protein database using local search programs was done by Lutefisk.7,11 De novo sequences generated by Lutefisk were queried against a protein database using CIDentify, which 10.1021/pr700577z CCC: $40.75

 2008 American Chemical Society

Hybrid Method for Peptide Identification 3

is a modified version of the FASTA program. Dipeptides corresponding to unknown amino acid permutations are encoded as wild card residues in the query sequence. The source code of FASTA was modified to include a loop over each query sequence for a given database sequence, and the sum of each individual alignment score comprises the total score for that database sequence. Gaps in the sequence alignment are not allowed, and the ktup value for FASTA is set to one in CIDentify. Another method uses a modified version of BLAST, denoted as MS BLAST, to verify database identifications that are statistically unreliable or of boarderline confidence.12,13 In MS BLAST, all candidate peptide sequences are merged in arbitrary order to create a “chimeric sequence” that is queried in the protein database only once.12 This single query is effective since the total score for a protein match, which is derived by adding up each of the high-scoring pair (HSP) scores that exceed a given threshold, is independent of the order of the segment matches. Matches corresponding to segments from two separate sequences are prevented by introducing a gap symbol between each sequence that is associated with a large negative score. A modified version of the PAM30 scoring matrix is used to accommodate MS/MS issues such as isobaric single amino acid residues and tryptic specificity. Similarly, the algorithm FASTS14 uses a modified version of FASTA to search multiple sequence tags for identifying homologous proteins. In this method, the FASTA lookup technique is complemented by a joining strategy to align unordered peptides, and the SmithWaterman optimal gapped alignment algorithm is not used. Several techniques have been developed specifically for searching a protein database using peptide sequences generated from de novo algorithms. The OpenSea15,16 method utilizes a mass-based alignment that performs a greedy local alignment around short sequence tags. Both homology mutations and de novo sequencing errors are allowed but cannot occur in the same position. OpenSea can also be used for the identification of posttranslational modifications in candidate peptide sequences. The SPIDER algorithm17 allows for both mutations and de novo sequencing errors to occur in the same position. This approach uses seeding heuristics as a first pass for culling likely database matches and then solves a simplified alignment model using dynamic programming to find the best alignment. Recent outstanding work has highlighted the potential of de novo sequencing for high-precision mass spectrometry data.5 A modified version of PepNovo18 for precision mass spectrometry data (i.e., generated by QTOF or OrbiTrap instruments) is used to generate several suboptimal candidate sequences which are then queried against a protein database using fast pattern matching, such as a hash table or suffix tree. Such a method depends on the quality of the de novo sequences generated since a direct database lookup requires an exact match in the residue composition between the query and database sequences. We have recently developed a novel integer linear optimization (ILP) approach to efficiently address the de novo peptide identification problem so as to form a basis for a highthroughputcomputationalframeworkforpeptideidentification.1,2 This framework is denoted as PILOT, which stands for Peptide identification via Integer Linear Optimization and Tandem mass spectrometry. The overall algorithm PILOT is comprised of: (1) a preprocessing algorithm used to identify certain peaks and to validate boundary conditions, (2) a two-stage integer linear optimization framework to address missing ion peaks

research articles due to residue-dependent fragmentation characteristics, and (3) a postprocessing technique for selecting the most probable sequence by cross-correlating the theoretical spectra of the candidate sequences with the experimental tandem mass spectrum. An integer linear optimization (ILP) problem is a mathematical programming problem in which the decision variables are discrete and the objective function and constraint equations are linear. These problems can be solved rigorously to global optimality using existing softwares, such as CPLEX,19 and provide a natural way for generating a rank-ordered list of candidate sequences by incorporating integer cuts. A thorough treatment of integer optimization can be found elsewhere.20,21 In this work, we propose a hybrid methodology which utilizes the rank-ordered list of de novo predictions provided by PILOT1,2 to query the nonredundant protein database using a modified version of FASTA.3 The hybrid framework is denoted as PILOT_SEQUEL, which stands for Peptide identification via Integer Linear Optimization, Tandem mass spectrometry and local SEQUEnce aLignment. High-confidence residues from the de novo predictions are identified based on peak intensities, the existence of complementary ions, and conservation of subsequences over all possible candidate sequences. A modified BLOSUM scoring matrix is constructed to bias exact residue matches in the alignment with an additional award for high-confidence residues in the query sequences. The computational burden associated with performing several sequence alignment calculations is circumvented by use of distributive computing. The individual run time for each sequence alignment is also reduced by reformatting the nonredundant protein database. Results for the hybrid methodology for peptide identification are presented on experimentally validated quadrupole time-of-flight (QTOF) tandem MS22 and a test set of annotated OrbiTrap tandem MS. PILOT_SEQUEL generates de novo sequences using a modified version of the PILOT1,2 algorithm, which uses preprocessing to elucidate key spectral features that can be exploited in the sequencing calculations and then solves a two-stage ILP model to address missing peaks in the tandem mass spectrum. Twenty candidate sequences are generated from this de novo stage, and high-confidence amino acids are identified. The weights in these candidate sequences are replaced by permutations of amino acids that are consistent with these masses. The resulting sequences are then queried against a protein database using a modified version of FASTA, and the best-scoring sequence is reported as the peptide responsible for generating the tandem mass spectrum. The overall framework for PILOT_ SEQUEL is summarized in Figure 1.

2. Mathematical Model for the De Novo Sequencing Stage In this section, we will review the integer linear optimization framework for PILOT.1,2 The essential components of this formulation are the parameters, sets, binary variables, constraints, and objective function. 2.1. Parameters. The tandem mass spectrum of a peptide is comprised of the mass of the parent peptide, its corresponding charge state, and a mass-ordered list of the mass-to-charge ratios of the ion peaks and their corresponding intensities. One should note that the measured mass-to-charge ratios are subject to a certain degree of experimental error, depending on the resolution of the instrument.23 The parameters are defined as follows: Journal of Proteome Research • Vol. 7, No. 4, 2008 1585

research articles

DiMaggio et al. is. However, we can eliminate certain ions based on known relationships. When the protonated peptide undergoes collision-induced dissociation (CID), it primarily fragments into two ion pairs: either a and x, b and y, or c and z, where all three pairs are complementary ions by definition. That is, the sum of the masses of these ions is equal to the weight of the parent peptide, mP, as determined experimentally. We define the set C to contain the pairs of peaks that are complementary ions, as shown in eq 3. C ) {Ci,j ) (i, j) : mass(ion peak i) + mass(ion peak j) ) mP + 2; i * j}

Figure 1. Flow diagram representing the framework for the hybrid method for peptide identification.

mP ) mass of parent peptide mass(ion peak i) ) mass of ion peak i λi ) intensity of ion peak i 2.2. Set Definitions. The first set we define is comprised of the mass difference between each pair of the peaks in a tandem mass spectrum, which we denote by the matrix M as shown in eq 1. M ) {Mi,j ) mass(ion peak j) - mass(ion peak i): mass(ion peak j) > mass(ion peak i)}

S ) {Si,j ) (i, j) : Mi,j ) mass of an amino acid}

Journal of Proteome Research • Vol. 7, No. 4, 2008

BCihead ) {1, 19} Da

(4)

BCtail j ) {mP - 17, mP + 1} Da

(5)

2.3. Binary Variables. We utilize binary {0–1} variables in the problem formulation to model the peaks (pi) and paths connecting peaks (wij) that are used in the construction of the candidate sequence. Binary variables facilitate the use of logical inference when formulating the model constraints. pi )

(2)

Therefore, the mass difference between peak j and peak i is equal to the weight of some amino acid for every (i,j) ∈ Si,j. The subsequent problem formulation will be restricted over this set, Si,j. We can exploit the mathematical relationships between different ion series to define additional sets. For instance, the derivation of the candidate peptide sequence must be done by connecting ion peaks from the same ion series using the weights of amino acids. This is typically done by using only either the b- or y-ion series. In an experimental tandem mass spectrum, it is not known a priori of what ion type a mass peak 1586

Thus, we can use this set to eliminate certain ions in the derivation of the candidate sequences. It is important to note that the number of complementary ions found in practice can be limited by what instrument is used for tandem MS and the amount of fragmentation observed. An additional set can be constructed based on the fact that different types of ion series begin and end at different m/z values in the tandem mass spectrum. For instance, a candidate peptide derived using the y-ion series must begin at the weight of water (19 Da) and terminate at the weight of the parent peptide (mP + 1), whereas deriving the same sequence using the b-ion series, the appropriate bounds become zero mass (1 Da) and the weight of the parent peptide subtracted by the weight of water (mP - 17), respectively. Thus, we define two sets, BCihead and BCitail, which contain the peaks that can be used for beginning and terminating the sequencing calculations, respectively. Note that the sets in eqs 4 and 5 consider only the possibility for b- or y-ions in the candidate sequence.

(1)

The index i denotes the rows, and the index j denotes the columns of the matrix Mij. The candidate sequence should be derived by connecting peaks in the tandem mass spectrum by the weights of amino acids. Thus, we define a set that contains all of the pairs of peaks in the tandem mass spectrum whose mass difference is equal to that of an amino acid, which we define to be the set S as shown in eq 2.

(3)

wi,j )

{

{

1, if peak (i) is selected 0, otherwise

1, if peaks (i) and (j) are connected by a path (i.e., pi ) pj ) 1) 0, otherwise

2.4. Constraints. The mathematical constraints of the problem formulation are derived from ion properties and graph theory. The first constraint is based on the fact that the candidate peptide must be derived using ions of the same type and that complementary ions are of different type by definition. In other words, if peak i is used in the candidate sequence and the peak pair (i,j) belongs to the complementary ion set defined in eq 3, then peak j should be eliminated from consideration in the sequencing calculations. This is modeled mathematically by the constraint in eq 6. p i + pj e 1

∀ (i, j) ∈ Ci,j

(6)

Another important constraint to consider is that the mass of the derived candidate sequence is equal to that of the parent peptide (mP). It is well-known that the experimentally measured parent peptide mass is subject to a certain degree of experi-

research articles

Hybrid Method for Peptide Identification 23

mental error, which is in turn dependent on the resolution of the mass spectrometer used. Thus, exact conservation of mass cannot be achieved but must be relaxed by some tolerance of error. This is represented by constraint eqs 7 and 8.



Mi,j · wi,j e (mP - 18) + tolerance

(7)



Mi,j · wi,j g (mP - 18) - tolerance

(8)

(i,j)∈Si,j

(i,j)∈Si,j

The parameter tolerance corresponds to the tolerance of error, for which we typically use a value of 2 Da in the algorithm. One should note that we could also declare the tolerance term to be a positive variable and then minimize its value in the objective function. The derivation of the candidate peptide sequence can be thought of as connecting peaks in the tandem mass spectrum with paths, which correspond to weights of amino acids. To ensure that the paths selected are continuous and nondegenerate, we use the flow conservation law from graph theory which has been used extensively in process synthesis problems,24–30 as shown in eq 9.

∑w - ∑ w j,i

j∈Sj,i

i,k ) 0

i ∉ BCihead,

∀ i,

i ∉ BCitail

eliminating the precursor ion and multiply charged ions from consideration. 2.5. Objective Function. For low-energy CID spectra, it is commonly observed that the b- and y-ion peaks are on average the most abundant in intensity throughout the entire m/z range.31 Based on this observation, we propose to maximize the explicit intensity of the peaks used in the construction of the candidate sequence in an attempt to maximize the number of y-ions used. MAX pk,wi,j



λj · wi,j

(14)

(i,j)∈Si,j

The entire integer linear optimization model for PILOT1,2 is summarized below. MAX pk,wi,j

s.t .

∑M ∑M



λj · wi,j

(i,j)∈Si,j

i,j

· wi,j e (mP - 18) + tolerance

i,j

· wi,j g (mP - 18) - tolerance

(i,j)∈Si,j

(i,j)∈Si,j

p i + pj e 1

∑w ∑w

(9)

k∈Si,k

∀(i, j) ∈ Ci,j

i,j ) pi

∀i ∈ BCihead

j,i ) pi

∀i ∉ BCihead

j∈Si,j

The above constraints enforce that the number of paths entering a peak is equal to the number of paths leaving a peak. That is, if the peak is used in the candidate sequence, then the above constraints enforce that it is only used once. To ensure that the beginning and end of the candidate sequence are the appropriate m/z values, we enforce that the peaks denoted as the boundary conditions are activated, as shown in eqs 11 and 12.

∑ ∑w

i,j ) 1

(10)

i∈BCi head j∈Si,j

∑ ∑w

i,j ) 1

(11)

j∈BCj tail i∈Si,j

It is important to note that the existence of these boundary condition ions (and ions related to them) is checked by a preprocessing algorithm. If it is determined that the tandem mass spectrum is missing a subset of these ions, then these sets can be adjusted to restrict the derivation of the candidate peptide to take place over only a subsequence. Furthermore, these constraints enforce the nondegeneracy of paths since only one path can initiate and terminate the sequence, respectively. The final set of constraints represents the mathematical relationship between the binary variables representing the peaks, pi, and the paths connecting the peaks, wij

∑w

i,j ) pi

∀ i ∈ BCi head

(12)

∑w

j,i ) pi

∀ i ∉ BC ihead

(13)

j∈Si,j

j∈Sj,i

These constraints ensure that if there exists a path entering and leaving a peak i (i.e., wj,i ) 1 and wi,j ) 1), then peak i will be activated in the construction of the candidate sequence (i.e., pi ) 1). These constraints also allow us to remove certain peaks (and the paths connected to these peaks) from the sequencing calculations by simply deactivating the binary variables that represents those peaks (i.e., pi ) 0). This feature is useful for

j∈Sj,i

∑ ∑w

i,j ) 1

i∈BCiheadj∈Si,j

∑ ∑w

∑w - ∑ w j,i

j∈Sj,i

k∈Si,k

i,j ) 1

j∈BCjtail i∈Si,j i,k ) 0

wi,j, pk ) {0, 1}

∀i, i ∉ BCihead, i ∉ BCitail ∀(i, j), (k)

This problem can be solved to global optimality using existing methods such as CPLEX,19 and integer cuts20 can be used to generate a rank-ordered list of candidate sequences according to the objective function in eq 14. 2.6. Modifications in the De Novo Sequencing Model. Several changes to the above integer linear optimization model have been made for use in the hybrid methodology for OrbiTrap spectra since these spectra have an abundance in both the b- and y-ion series and have a high mass precision, which allows for a less-constrained problem formulation.5 In the original de novo formulation for PILOT,1,2 the preprocessing algorithm examines the raw tandem mass spectrum for the peaks which are the y-ion boundary conditions for the Nterminus (see eq 5) and C-terminus (see eq 4) of the peptide. If no ion peaks corresponding to the N-terminus boundary condition for the y-ion series are found, then a presequencing calculation is performed, where several high intensity peaks from the high-mass region of the spectrum are used as potential N-terminus boundary conditions (i.e., the sequencing of the peptide is only done up to these peaks). The peak which results in the greatest objective function value from the presequencing calculations is then used as the N-terminus boundary condition for the subsequent stage one and stage two de novo computations.1 In the hybrid methodology, the use of only one peak as the N-terminal boundary condition for the y-ion series is changed to using the top three peaks with the greatest objective function value from the presequencing stage. This allows for more flexibility in the de novo Journal of Proteome Research • Vol. 7, No. 4, 2008 1587

research articles

DiMaggio et al.

sequences being generated, and it was observed that this leads to higher quality de novo sequences. The variability of the N-terminus boundary condition requires dynamic flexibility in the mass conservation law (in eqs 7 and 8). This is accomplished by replacing the parameter mP (the mass of the peptide) by the summation of the product of all potential N-terminus boundary condition masses and their corresponding binary variables, as shown in eqs 15 and 16.



Mi,j · wi,j e



Mi,j · wi,j g

(i,j)∈Si,j



m ⁄ zi · pi + tolerance

(15)



m ⁄ zi · pi - tolerance

(16)

i∈BCitail

(i,j)∈Si,j

i∈BCitail

These constraints allow the mass of the peptide to vary with the peak that is selected as the N-terminal boundary condition, and eq 11 ensures that only one of these peaks is selected. The summation on the right-hand side of eqs 15 and 16, Σi∈BCitail m/zi · pi, is a summation of each peak that is a potential N-terminal boundary condition multiplied by its corresponding binary variable to indicate whether or not it is used in the construction of the candidate sequence. Equation 11 enforces that only one of these peaks can be used as the N-terminus boundary condition for the y-ion series. Thus, only one N-terminus boundary condition is used in the construction of the candidate sequence, and this is optimally selected by the algorithm at execution time. Based on the observation that y- and b-ions are typically the most abundant in intensity in a tandem mass spectrum, the objective function in eq 14 maximizes the intensities of the peaks used in the construction of the candidate sequence so as to maximize the number of b- or y-ions used. For OrbiTrap spectra, a large percentage of y-ions as well as their complementary b-ions are present in the tandem mass spectrum. Thus, if we were to emphasize the construction of the peptide sequence to use high-intensity peaks that are complementary ions, then we would increase the likelihood that the peaks used are b- or y-ions. This can be represented mathematically by the objective function defined in eq 17. MAX pk,wij

(





1 λ · wi,j + ω i∈S i i∈C i,j

(λi + λj′) · pi

ij′,Si,j

)

(17)

The first term in eq 17 is similar to the original objective function (see eq 14), except that its contribution is reduced by the factor 1/ω. The second term in eq 17 emphasizes the use of complementary ions in the derivation of the candidate peptide sequences since OrbiTrap tandem mass spectra are typically populated with both the y- and b-ion series. The first term in the objective function is penalized so that the candidate sequences are derived primarily by y-ions which have been validated by the existence of their complementary b-ions. However, this first term enforces that the remaining ions used in the construction of the candidate sequence are of high intensity, which is a common characteristic of the y-ion series.31 ω must be selected so that the contributions from term one are less than term two in eq 17. We use a value of ω ) 10 in our algorithm, but other low values produce the same results. An additional constraint is introduced into the model for the analysis of tryptic peptides since the y1-ion must be either a C-terminal lysine (K), identified by a m/z peak at 147.17, or a C-terminal arginine (R), identified by a m/z peak at 175.18.31 The variable indices corresponding to these ion peaks are represented by the set TP, as defined in eq 18. 1588

Journal of Proteome Research • Vol. 7, No. 4, 2008

TP ) {i : m ⁄ zi ) 147.17 or m ⁄ zi ) 175.18}

(18)

The constraint in eq 19 then ensures that only one Cterminal peak is selected and corresponds to a lysine or arginine.

∑ p )1

(19)

i

i∈TPi

The complete integer linear formulation for constructing the de novo sequences from OrbiTrap tandem MS to be queried in the protein database is presented in Problem (P). This is an integer linear optimization model which can be solved efficiently to global optimality using CPLEX.19 Furthermore, a rank-ordered list of solutions can be generated through the use of integer cuts.20 MAX pk,wi,j

s.t.



(∑



1 λ · wi,j + ω i∈S i i∈C i,j

Mi,j · wi,j e

(i,j)∈Si,j





)

(λj + λj’) · pi

i,j’,Si,j

m ⁄ zi · pi + tolerance

i∈BCitail



Mi, j · wi,j g

(i,j)∈Si,j

m ⁄ zi · pi - tolerance

i∈BCitail

p i + pj e 1

∑w ∑w

∀(i, j) ∈ Ci,j

i,j ) pi

∀i ∈ BCihead

j,i ) pi

∀i ∉ BCihead

j∈Si,j

j∈Sj,i

∑ ∑w

(P)

i,j ) 1

i∈BCiheadj∈Si,j

∑ ∑w

i,j ) 1

j∈BCjtaili∈Si,j

∑ p )1 i

∑w - ∑ w j,i

j∈Sj,i

k∈Si,k

i∈TPi i,k ) 0

wi,j, pk ) {0, 1}

∀ i, i ∉ BCihead, i ∉ BCitail ∀ (i, j), (k)

The method proposed in this article is intended for doubly charged tryptic peptides that were ionized using electrospray ionization. In general, multiply charged peptides are more difficult to interpret due to their lack of fragmentation, which is believed to result from limited migration of the proton that was initially associated with the N-terminal amine moiety.31 Singly charged peptides result in fewer product ions since during fragmentation only one group will retain the proton and hence be detected in the tandem mass spectrum. In this article, only doubly charged peptides are studied since they result in the most unambiguous fragmentation characteristics. However, the proposed approach can be extended to address the other charge states for tryptic peptides by using a de novo method that was developed for multiply charged ions to generate candidate peptide sequences.

3. FASTA Algorithm The de novo sequences generated from the mathematical model presented in Problem (P) are then queried against the nonredundant protein database to resolve ambiguous amino acid assignments and determine missing N-terminal amino acids. The FASTA3 program is selected to perform the alignment between these de novo sequences and the protein sequences in the database. 3.1. Scoring Matrices. Traditional scoring matrices used to score FASTA or BLAST alignments are constructed based on

research articles

Hybrid Method for Peptide Identification Table 1. List of Isobaric Residues isobaric residue

doublet pairs of same mass

N Q, K R W

GG GA GV AD, SV, GE

evolutionary distances. However, when querying the de novo sequences from PILOT1,2 against a protein database, the conservation of mass between the template and query sequence is of great importance. To assist the conservation of mass in the alignment, a scoring matrix was constructed to emphasize exact residue matches, where a residue match to itself is given a constant reward of +5 and a match to any other residue is given a constant penalty of -5. This results in a diagonal matrix with +5 values for diagonal elements and -5 values for the off-diagonal elements. The alignment can also be biased toward exact matches for high-confidence residues identified in the candidate peptide sequences. In this work, we define high-confidence residues from the de novo sequencing results based on the following criterion: the amino acid is verified by both the b- and y-ion series and has an N-terminal intensity greater than that of the C-terminal y1-ion intensity. Exact matches between the query and template for a highconfidence residues are given an award of +15 so that the alignment favors these residues. It should be noted here that isoleucine and leucine are considered to be equivalent residues, and matches between these residues are given a score of +5. 3.2. Isobaric Residues. An important issue regarding conservation of mass between the query and template sequence is how to incorporate isobaric residues into the alignment. A list of common isobaric residues found in de novo sequences is shown in Table 1. An isobaric residue can exist on the query or template sequence, and de novo sequences often contain isobaric residues due to incomplete fragmentation, especially in lower quality spectra. In the traditional alignment algorithm, a gap penalty is used to extend the sequence which contains the isobaric residue. This reduces the overall score and could potentially result in an alternate local alignment that is less desirable between the query and template sequence. To address the existence of isobaric residues in the alignment, we modified the final step in the FASTA algorithm where the method of Smith and Waterman32 is used to compute an optimal alignment score between the query and library sequence. This is incorporated by replacing the penalty for the residue mismatch and sequence gap with a reward for the isobaric alignment and then extending the sequence without incurring the associated gap penalty. The alignments to isobaric residues are allowed on both the query and library sequences. 3.3. Conservation of Mass. Even with the above modifications to FASTA, the library sequence with the optimal alignment score could still violate the conservation of mass for the parent peptide as determined experimentally. For every possible alignment between the query and library sequence, the mass difference between these two sequences up to the point of alignment is measured. If there is a substantial mass difference at the end of the alignment (i.e., greater than 2.0 Da), then the final score is penalized by reducing it to 10% of its nominal value. The nonredundant database was reformatted to create several smaller databases consisting of tryptic peptides within

a mass window of 5 Da, where the range of these databases is from 450-455 Da to 2730–2735 Da. For each spectrum, the mass of the parent peptide is used to select the appropriate database to perform the alignment. If the mass of the parent peptide is within 2.0 Da of the mass window for the database, then the next closest database in the mass range is also searched. Reformatting the database in this fashion reduces the CPU to 1/5 the time required to query the entire nonredundant database. 3.4. Distributed Computing Framework. The approximate time for querying a single candidate sequence against the nonredundant protein database is on the order of 10 s. This overhead is predominately the time required to read in the protein database. Since the de novo sequences derived from QTOF and OrbiTrap spectra are of very high quality and accuracy, we only perform the modified Smith-Waterman alignment on those sequences which have four consecutive correct residues (which is accomplished by setting the ktup parameter in FASTA equal to 4). Even with these efficiency gains, it is not practical to query each sequence one at a time in a serial fashion. To alleviate the computational burden associated with performing several independent sequence alignments, we have implemented an algorithm for distributing these jobs over multiple processors. A centralized control strategy33 is implemented where a “master” processor manages and distributes alignment jobs to several “slave” processors. These slave processors perform the alignment calculations for the given sequence(s) and then communicate the results back to the master processor, where they are stored and ranked. The communication time between the master and slave processors is negligible in comparison to the time required for alignment. The computing facilities available for performing the distributed sequence alignment are an 80 node Beowulf cluster with dual Intel Xeon 3.0 GHz processors. The average time required for querying a set of de novo sequences derived from a tandem mass spectrum is about 18 s.

4. Computational Studies We applied the proposed hybrid methodology to 38 experimentally validated quadrupole time-of-flight (QTOF) tandem MS22 and a test set of 380 annotated OrbiTrap tandem MS. The results from the proposed method are compared with Mascot4 for both sets of data and the modified version of PepNovo,5 InsPecT,6 and CIDentify7 for the OrbiTrap spectra. PILOT_SEQUEL, InsPecT, and CIDentify were searched against the nonredundant database, and Mascot was searched against the NCBI protein database. We also attempted to compare against the methods OpenSea15 and SPIDER,17 but both methods were unavailable during the time this article was written. 4.1. QTOF Spectra: Comparisons with Mascot.4 The proposed hybrid method was tested on a benchmark set of 38 experimentally validated, doubly charged quadrupole time-offlight spectra that were previously analyzed by PILOT.2 These spectra were collected with Q-TOF2 and Q-TOF-Global mass spectrometers for a control mixture of four known proteins: alcohol dehydrogenase (yeast), myoglobin (horse), albumin (bovine, BSA), and cytochrome C (horse). In the case study presented in ref 2, PILOT resulted in a correct identification accuracy of 65.8% for all residues and within one correct residue (see Table 2). The accuracy is 86.8% within two correct residues and 92.1% within three correct residues. The overall residue prediction accuracy for PILOT on this test set was Journal of Proteome Research • Vol. 7, No. 4, 2008 1589

research articles

DiMaggio et al.

Table 2. Identification Rates for Quadrupole Time-of-Flight Spectra

correct Identifications within 1 residue within 2 residue within 3 residue total Correct Residues

Mascot

Mascot (semi trypsin)

PILOT De Novo

PILOT_SEQUEL

26 (0.684) 27 (0.711) 28 (0.737) 28 (0.737) 332 (0.794)

28 (0.737) 29 (0.763) 31 (0.816) 31 (0.816) 368 (0.885)

25 (0.658) 25 (0.658) 33 (0.868) 35 (0.921) 381 (0.911)

36 (0.947) 36 (0.947) 38 (1.00) 38 (1.00) 414 (0.990)

Table 3. Identification Rates for OrbiTrap Spectra

correct identifications within 1 residue within 2 residue within 3 residue total correct residues

Mascot

InsPecT, InsPecT L ) 6

CIDentify (PILOT)

PILOT_SEQUEL

286 (0.753) 287 (0.755) 289 (0.760) 289 (0.760) 3638 (0.834)

280 (0.737), 264 (0.695) 294 (0.774), 267 (0.703) 351 (0.924), 322 (0.847) 352 (0.926), 323 (0.850) 4045 (0.927), 3806 (0.872)

298 (0.784) 299 (0.787) 313 (0.824) 318 (0.837) 3841 (0.880)

352 (0.926) 352 (0.926) 356 (0.936) 357 (0.939) 4159 (0.953)

91.1%. To test the ability of the proposed framework, we queried the de novo sequences generated by the original version of PILOT2 against the nonredundant protein database. The results for PILOT, PILOT_SEQUEL, and Mascot for these 38 spectra are shown in Table 2. From Table 2, we observe that the predictions for the hybrid methodology were almost perfect: only two predictions were incorrect, and both were due to the incorrect assignment of the N-terminal amino acid pairs. The predictions for Mascot on the same set of tandem mass spectra resulted in only 26 out of the 38 peptides being correctly identified. When visually inspecting the sequences of the 4 proteins in the control mixture, it is observed that 9 out of the 12 peptides that were incorrectly predicted by Mascot did not have tryptic N-termini. This set of tandem mass spectra illustrates how database lookup is unaffected by the constraints of tryptic specificity. To address this issue in the database predictions, we also performed the Mascot database search by specifying the semiTrypsin option, which allows one of the termini to be a nontryptic cleavage. The results for the semiTrypsin Mascot search are included in column 2 of Table 2, where it is shown that this option only increases the identification rate from 68.4% to 73.7%. The inherent tradeoff in using this option is that it allows for half-tryptic peptides but also introduces a larger number of potential sequences into the search space, which can introduce false-positive identifications. We also examined how many identifications were within 1, 2, and 3 incorrect amino acids since short sequences can be highly homologous in large protein databases. Within two incorrect amino acids, PILOT_SEQUEL has a perfect identification rate. Mascot’s identification rate increases by two predictions to achieve 73.7% percentile, and the half-tryptic Mascot search increases the number of correct identifications from 28 to 31 (81.6%). PILOT_SEQUEL has a residue prediction accuracy of 99% for the 418 residues in the QTOF set, and Mascot correctly predicts 79.4% and 88.5% of these amino acids for the tryptic and half-tryptic searches, respectively.

L5385), carbonic anhydrase (bovine, Sigma C3934), lysozyme (chicken, Sigma L6876), cytochrome c (horse, Sigma C2037), R-galactosidase (ecoli, Biochemika 48274), glycogen phosphorylase (rabbit, Sigma P6635), catalase (bovine, Sigma C1345), Actin (bovine, Sigma A3653), glyceraldehyde-3-phosphate dehydrogenase (rabbit, Sigma G2267), R-A-Crystallin (bovine, Sigma C8991), myoglobin (horse, Sigma M0630), and transferrin (bovine, Sigma T1408). Urea was added to the protein mixture to denature the proteins. Proteins were then reduced with TCEP, alkylated using iodoacetamide (IAM), and subsequently digested with trypsin. The peptide mixture was analyzed by automated microcapillary liquid chromatography and a LTQOrbitrap hybrid mass spectrometer (ThermoFinnigan, San Jose, CA). Both MS and MS/MS spectra were recorded in the Orbitrap instrument.

4.2. OrbiTrap Spectra: Comparison with Database and Hybrid Methods. To validate the proposed approach on a large test set, we applied our hybrid methodology and other database and hybrid algorithms to the OrbiTrap tandem mass spectra of 380 doubly charged tryptic peptides. Stock solutions were prepared for a 16 protein mixture by including equal protein amounts from R-casein (bovine, Sigma C8032), R-casein (bovine, Sigma C6905), albumin (bovine serum, Sigma A4503), ovalbumin (chicken, A5503), R-lactalbumin (bovine, Sigma

The results for PILOT_SEQUEL, InsPecT, and Mascot for the 380 OrbiTrap spectra are presented in Table 3. In Table 3, we observe that PILOT_SEQUEL has a 92.6% rate of correct identification for this set of spectra, while Mascot and InsPecT correctly identify 75.3% and 73.7% of the spectra, respectively. It should be noted here that for each identification, Mascot provides a scoring threshold to indicate positive identification. Out of the 380 tandem mass spectra, only 177 of the predictions exceeded their respective scoring thresholds. One of these

1590

Journal of Proteome Research • Vol. 7, No. 4, 2008

The tandem mass spectra generated were searched against a target protein database containing the Schizosaccharomyces pombe protein database (downloaded from ftp://ftp.sanger. ac.uk/pub/yeast/pombe/Protein_data/pompep) and the 16 control proteins for a total of 5009 proteins. To calculate confidence levels and false-positive rates, we used a decoy database containing the reverse sequences of the 5009 proteins appended to the target database34 and the SEQUEST algorithm to find the best matching sequences from the combined database. No enzymatic cleavage conditions were imposed on the database search, and no differential modifications were considered. The validity of peptide/spectrum matches was assessed in DTASelect35 using SEQUEST-defined parameters, the cross-correlation score (XCorr), and normalized difference in cross-correlation scores (DeltaCN). The distribution of XCorr and DeltaCN values for (a) direct and (b) decoy database hits was obtained, and the two subsets were separated by quadratic discriminant analysis. The discriminant score was set such that a false-positive rate of 0% was determined based on the number of accepted decoy database peptides. The 380 tandem mass spectra, the identified peptide sequences, and the predictions of all the methods are provided in the Supporting Information.

Hybrid Method for Peptide Identification Table 4. De Novo Identification Rates for OrbiTrap Spectra

correct identifications within 1 residue within 2 residue within 3 residue total correct residues

de novo PILOT_SEQUEL

allowing for missing N-term

242 (0.637) 252 (0.663) 297 (0.782) 306 (0.805) 3781/4364 (0.866)

269 (0.707) 279 (0.734) 324 (0.852) 333 (0.876) 3781/4249 (0.890)

confident predictions resulted in a false positive identification. The remaining 110 spectra that were correctly predicted by Mascot exhibited scores less than the thresholds provided. To investigate the influence of homologous proteins, we examined the effect of allowing for up to three incorrect residues in the identification. Table 3 reveals that allowing up to three incorrect residues only increases the identification rate by 1% for both Mascot and PILOT_SEQUEL, indicating that homologous peptides did not significantly contribute to the identifications. However, allowing up to three incorrect amino acid assignments significantly increases the identification rate for InsPecT to 92.6%. This dramatic increase indicates that homologous peptides are resulting in false identifications. In an attempt to improve the predictions of InsPecT, we specified a minimum peptide sequence tag length of six residues (the default value is three amino acids) since more accurate de novo coverage should in theory help to eliminate homologous proteins. However, forcing the InsPecT algorithm to generate sequence tags of length 6 decreased the identification rate to 69.5%, and only 85.0% were correct for up to three incorrect amino acids, which implies that the quality of their sequence tags decreases with tag length. It also demonstrates that accurate de novo sequences are extremely useful for distinguishing among homologous proteins in the database search. The residue prediction accuracy for Mascot, InsPecT, InsPecT for sequence tags of length 6, and PILOT_SEQUEL was 83.4%, 92.7%, 87.2%, and 95.3% for the 4364 residues, respectively. It is also of interest to examine the quality of the de novo predictions necessary to achieve these results. Table 4 presents the statistics for the de novo sequences that correspond to the best database match for PILOT_SEQUEL reported against the correct sequences. The 242 correct identifications in Table 4 do not reflect the actual accuracy of the de novo sequences. These candidate sequences were not refined to address isobaric residues, which are accommodated for in the alignment calculations. To examine the influence of isobaric residues in the de novo sequences, we also reported how many predictions were within 1, 2, and 3 incorrect amino acids. As seen in Table 4, the identification rate increases significantly from 63.7% to 80.5% when allowing for these three incorrect residues. Even allowing for two incorrect residues in the prediction increases the identification rate by 12%. Another issue not represented in the 242 correct identifications is that several de novo sequences were missing entire N-terminal segments, which occurs when the upper bounding calculations chose to sequence after the first four or five N-terminal amino acids. The residue combinations corresponding to these missing Nterminal weights are not appended to the candidate sequences prior to the FASTA alignment as this would result in an unnecessarily large number of database queries. However, it is misleading to label these predictions as “incorrect” identifications since the portion of the sequence predicted by the de novo algorithm is exactly correct. Thus, we adjusted the statistics in the second column of Table 4 to incorporate these

research articles 27 spectra for which no N-terminal segment was predicted as “correct” identifications. This results in an 8% increase in identification accuracy and corresponds to 87.6% of the peptide identifications to be correct within three amino acids. To compare the proposed method against a similar technique, we applied the modified version of the PepNovo algorithm5 that was trained on OrbiTrap tandem mass spectra to our 380 OrbiTrap tandem mass spectra. In their peptide identification method,5 de novo sequences are used to perform direct table look-up in a protein database, using hashing or a suffix tree, for high-precision mass spectra. The parameters for their de novo algorithm were adjusted to generate 10 candidate sequences (at least 5 amino acids and less than 10 amino acids in length) for tandem MS acquired from a LTQ-Orbitrap instrument for tryptic peptides. The de novo sequences reported by PepNovo are considerably shorter than the ones predicted by PILOT_SEQUEL. PepNovo predicted only 2958 amino acids, and PILOT_SEQUEL predicted 4249 amino acids (38% more sequence coverage). The database lookup algorithm used to query the PepNovo de novo sequences is not publicly available, so we directly compared each of the 10 de novo sequences per peptide against the actual sequence and reported the possible best match. The results for this analysis are presented in the second column of Table 5. In column 3, the accuracy over only the subsequences predicted by PepNovo is presented since the average de novo sequence length per peptide generated by PepNovo is 7.8 residues and the average length of the actual peptides is 11.5 amino acids. Furthermore, the full sequence was predicted by PepNovo for only 56 out of the 380 peptides. In Table 5, we see that only 69.7% of the best subsequences predicted by PepNovo are exactly correct. The identification rates increase considerably when allowing for 1, 2, and 3 incorrect amino acids because the predicted peptides are very short (with an average and median of 7.8 and 8 residues, respectively). To provide a fair basis of comparison, we implemented a database search algorithm that uses the FASTA to align the sequence tag reported by PepNovo with a peptide in the protein database and also ensures that the N- and C-terminal masses flanking the sequence tag are consistent. One should note that this algorithm is more flexible than the direct database lookup method usually employed for these de novo sequences since the alignment method tolerates isobaric and incorrect amino acids. The corresponding predictions from using the PepNovo de novo sequences to search the protein database are presented in the final column in Table 5. An identification rate of 76.8% is achieved, which is a 7.1% increase over what would have resulted from using direct sequence comparisons. This rate increases to 78.4% when allowing for 1, 2, and 3 incorrect amino acids, which is consistent with the increase in rates for PILOT_SEQUEL and Mascot. It is interesting to note that the peptide subsequences generated by PepNovo have a residue accuracy of 89% for the 2958 residues predicted (out of a total of 4364 amino acids). When evaluating the corresponding database sequences, the amino acid accuracy of the peptides predictions is 82%. To compare the alignments achieved by PILOT_SEQUEL to existing techniques for de novo sequences, we used CIDentify to query the de novo sequences generated by PILOT and PepNovo against the nonredundant protein database. The results of CIDentify for the PILOT and PepNovo de novo sequences are presented in Tables 3 and 5, respectively. For the PILOT de novo sequences, we see in Table 3 that the Journal of Proteome Research • Vol. 7, No. 4, 2008 1591

research articles

DiMaggio et al.

Table 5. Identification Rates for PepNovo Algorithm for OrbiTrap Spectra

correct identifications within 1 residue within 2 residue within 3 residue total correct residues

best de novo sequence

best de novo subsequence

CIDentify (PepNovo)

database match using FASTA

48 (0.126) 76 (0.200) 132 (0.347) 164 (0.432) 2633/4364 (0.603)

265 (0.697) 293 (0.771) 326 (0.858) 344 (0.905) 2633/2958 (0.890)

287 (0.755) 287 (0.755) 291 (0.766) 291 (0.766) 3544/4364 (0.812)

292 (0.768) 292 (0.768) 296 (0.780) 298 (0.784) 3564/4364 (0.820)

identification accuracy of CIDentify is 14% less than that of PILOT_SEQUEL (i.e., 78.4% versus 92.6%) as it correctly identifies 298 out of the 380 tandem mass spectra. Allowing for up to three incorrect amino acids only increases the identification rate from 78.4% to 83.7%, and the residue composition accuracy is 88% for CIDentify using the PILOT de novo sequences. This distinction in prediction accuracy between PILOT_ SEQUEL and CIDentify clearly demonstrates that our modifications to FASTA are more effective in aligning the de novo sequences of PILOT and are different from those implemented by CIDentify. In Table 5, it is shown that CIDentify and the modified FASTA tag-algorithm, which was implemented specifically for the PepNovo de novo sequences, result in similar identification accuracies (75.5% and 76.8%, respectively), with the latter method performing slightly better. The similarity of these results is due to the fact that the majority of the misaligned de novo sequences generated by PepNovo were of poor sequence quality. That is, there was not enough correct de novo information to make a confident alignment, and incorrectly predicted amino acid subsequences exhibited highscoring matches to different proteins in the nonredundant database. The residue prediction accuracy over subsequences of a specified length for all the methods is presented in Figure 2. To complement the database predictions, we have also reported the corresponding de novo sequences for PILOT and PepNovo in Figure 2 (see dashed lines). PILOT_SEQUEL

exhibits a peptide identification accuracy of 99% for subsequence lengths of three amino acids, and this identification rate levels off to 93% for subsequence lengths of 10 residues. The differences in the curves between PILOT_SEQUEL and PILOT are a measure of the accuracy gained by utilizing FASTA to perform local database search for peptide identification. It is interesting to examine the curve in Figure 2 for PepNovo’s best de novo sequence, which corresponds to the most accurate de novo sequence out of the 10 that were reported per peptide. This curve attains a maximum identification rate of 86.8% for subsequence lengths of three amino acids, which implies that using a direct lookup method like hashing would result in a maximum peptide identification accuracy of 86.8% since exact residue matches would be required (one should note that in the analysis presented leucine and isoleucine were treated as equivalent residues). The FASTA database matches corresponding to the PepNovo de novo sequences outperform the predictions from Mascot for subsequence lengths greater than five amino acids and are consistently close to 80% accuracy. One can see from Figure 2 that the alignment results for CIDentify using the PILOT de novo sequences are approximately 8 to 12% worse than PILOT_SEQUEL over the entire subsequence domain. The results for InsPecT achieve a maximum of 99% for a subsequence length of 3 but then steadily decrease to 82.4% for a subsequence length of 10. This trend also suggests that the smaller sequence tags generated by InsPecT are not enough to distinguish between homologous peptides in the protein database search. However, when forcing InsPecT to generate sequence tags of length 6, it is shown in Figure 2 that the method performs poorer due to inaccuracies in the longer sequence tags.

5. Conclusions

Figure 2. Comparison of correct subsequences of varying lengths for OrbiTrap spectra. All predictions are compared to the correct peptide sequence. The “PILOT De Novo” sequences are the de novo sequences generated by PILOT that correspond to the best database match. The “PepNovo Best De Novo” sequences correspond to the most accurate de novo sequence out of the 10 that were reported per peptide by PepNovo. 1592

Journal of Proteome Research • Vol. 7, No. 4, 2008

A novel method for peptide identification via integer linear optimization and local database search, PILOT_SEQUEL, was proposed for the identification of peptides using QTOF or OrbiTrap tandem mass spectroscopy. Candidate sequences generated from a modified version of the de novo algorithm PILOT1,2 are used to query the nonredundant protein database to resolve low-confidence amino acids. The source code for the local alignment database search program, FASTA, was modified to emphasize the conservation of mass between the query and the template sequence and to allow isobaric alignments without incurring the associated gap penalties. A diagonal scoring matrix enforces exact residue matches, and high-confidence residues are given a larger reward to bias the database search. The computational efficiency of the alignment was optimized via distributive computing and reformatting the nonredundant protein database to smaller peptide databases. It was demonstrated that the proposed hybrid method exhibits an excellent prediction accuracy for a test of 38 quadrupole time-of-flight and 380 OrbiTrap tandem mass spectra and is very useful for validating identifications from Mascot that are not statistically significant.

research articles

Hybrid Method for Peptide Identification

Acknowledgment. C.A.F. gratefully acknowledges financial support from the National Institutes of Health (R01LM009338) and the US Environmental Protection Agency, EPA (R 832721-010). B.L. is supported by a CFFT computational fellowship (BALCH05 × 5), and J.R.Y. gratefully acknowledges financial support from the National Institutes of Health (5R01 MH067880 and P41 RR11823). Although the research described in the article has been funded in part by the U.S. Environmental Protection Agency’s STAR program through grant (R 832721-010), it has not been subjected to any EPA review and therefore does not necessarily reflect the views of the Agency, and no official endorsement should be inferred. Supporting Information Available: The 380 annotated OrbiTrap tandem mass spectra and corresponding algorithmic results for the quadrupole time-of-flight and OrbiTrap tandem mass spectra. This material is available free of charge via the Internet at http://pubs.acs.org. References (1) DiMaggio, P. A.; Floudas, C. A. A mixed-integer optimization framework for de novo peptide identification. AIChE J. 2007, 53 (1), 160173. (2) DiMaggio, P. A.; Floudas, C. A. De novo peptide identification via tandem mass spectrometry and integer linear optimization. Anal. Chem. 2007, 79, 1433–1446. (3) Pearson, W.; Lipman, D. Improved tools for biological sequence comparison. PNAS 1988, 85, 2444–2448. (4) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 1999, 20, 3551–3567. (5) Frank, A.; Savitski, M.; Nielsen, M.; Zubarev, R.; Pevzner, P. De novo peptide sequencing and identification with precision mass spectrometry. J. Proteome Res. 2007, 6, 114–123. (6) Tanner, S.; Shu, H.; Frank, A.; Wang, L-C.; Zandi, E.; Mumby, M.; Pevzner, P. A.; Bafna, V. Identification of posttranslationally modified peptides from tandem mass spectra. Anal. Chem. 2005, 77, 4626–4639. (7) Taylor, J. A.; Johnson, R. S. Sequence database searches via de novo peptide sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 1997, 11, 1067–1075. (8) Mann, M.; Wilm, M. Error-tolerant identification of peptides in sequence databases by peptide sequence tags. Anal. Chem. 1994, 66, 4390–4399. (9) Tabb, D. L.; Saraf, A.; Yates, J. R. High-throughput sequence tagging via an empirically derived fragmentation model. Anal. Chem. 2003, 75, 6415–6421. (10) Sunyaev, S.; Liska, A. J.; Golod, A.; Shevchenko, A.; Shevchenko, A. multiple error-tolerant sequence tag search for the sequencesimularity identification of proteins by mass spectrometry. Anal. Chem. 2003, 75, 1307–1315. (11) Taylor, J. A.; Johnson, R. S. Implementation and uses of automated de novo peptide sequencing by tandem mass spectrometry. Anal. Chem. 2001, 73, 2594–2604. (12) Shevchenko, A.; Sunyaev, S.; Loboda, A.; Shevchenko, A.; Bork, P.; Ens, W.; Standing, K. Charting the proteomes of organisms with unsequenced genomes by MALDI-quadrupole time-of-flight mass spectrometry and BLAST homology searching. Anal. Chem. 73, 2001, 1917-1926.

(13) Wielsch, N.; Thomas, H.; Surendranath, V.; Waridel, P.; Frank, A.; Pevzner, P.; Shevchenko, A. Rapid validation of protein identifications with borderline statistical confidence via de novo sequencing and MS BLAST searches. J. Proteome Res. 2006, 5, 2448–2456. (14) Mackey, A. J.; Haystead, T. A. J.; Pearson, W. R. Getting more for less: algorithms for rapid protein identification with multiple short peptide sequences. Mol. Cell Proteomics 2002, 1, 139–147. (15) Searle, B. C.; Dasari, S.; Turner, M.; Reddy, A. P.; Choi, D.; Wilmarth, P. A.; McCormack, A. L.; David, L. L.; Nagalla, S. R. Highthroughput identification of proteins and unanticipated sequence modifications using a mass-based alignment algorithm for MS/ MS de novo sequencing results. Anal. Chem. 2004, 76, 2220–2230. (16) Searle, B. C.; Dasari, S.; Wilmarth, P. A.; Turner, M.; Reddy, A. P.; David, L. L.; Nagalla, S. R. Identification of protein modifications using MS/MS de novo sequencing and the OpenSea alignment algorithm. J. Proteome Res. 2005, 4, 546–554. (17) Han, Y.; Ma, B.; Zhang, K. Software for protein identification from sequence tags with de novo sequencing erros. J. Bioinf. Comp. Bio. 2005, 3 (3), 697–716. (18) Frank, A.; Pevzner, P. De novo peptide sequencing via probabilistic network modeling. Anal. Chem. 2005, 77 (4), 964–973. (19) CPLEX. ILOG CPLEX 9.0 User’s Manual; 2005. (20) Floudas, C. A. Nonlinear and Mixed-Integer Optimization; Oxford University Press: New York, 1995. (21) Nemhauser, G. L.; Wolsey, L. A. Integer and Combinatorial Optimization; John Wiley and Sons, Inc.: New York, 1988. (22) Ma, B.; Zhang, K. Z.; Hendrie, C.; Liang, C.; Li, M.; Doherty-Kirby, A.; Lajoie, G. Powerful software for peptide de novo sequencing by tandem mass spectrometry. Rapid Commun. Mass Spectrom. 2003, 17, 2337–2342. (23) Dancik, V.; Addona, T. A.; Clauser, K. R.; Vath, J. E.; Pevzner, P. A. De novo peptide sequencing via tandem mass spectrometry. J. Comp. Biol. 1999, 6 (3), 327–342. (24) SchrijverA. Theory of Linear and Integer Programming; John Wiley and Sons: New York, 1986. (25) Floudas, C. A.; Grossmann, I. E. Synthesis of flexible heat exchanger networks with uncertain flowrates and temperatures. Comput. Chem. Eng. 1987, 11 (4), 319–336. (26) Floudas, C. A.; Anastasiadis, S. H. Synthesis of general distillation sequences with several multicomponent feeds and products. Chem. Eng. Sci. 1988, 43 (9), 2407–2419. (27) Paules, G. E., IV; Floudas, C. A. Algorithmic development methodology for discrete-continuous optimization problems. Oper. Res. J. 1989, 37 (6), 902–915. (28) Floudas, C. A.; Ciric, A. R. A retrofit approach of heat exchanger networks. Comput. Chem. Eng. 1989, 13 (6), 703–715. (29) Aggarwal, A.; Floudas, C. A. Synthesis of general separation sequences - nonsharp separations. Comput. Chem. Eng. 1990, 14 (6), 631–653. (30) Kokossis, A. C.; Floudas, C. A. Optimization of complex reactor networks-II: nonisothermal operation. Chem. Eng. Sci. 1994, 49 (7), 1037–1051. (31) Kinter, M.; Sherman, N. E. Protein Sequencing and Identification using Tandem Mass Spectrometry; John Wiley and Sons Inc.: New York, NY, 2000. (32) Smith, T.; Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 1981, 147, 195–197. (33) Crainic, T. G.; Le Cun, B.; Roucairol, C. Parallel Combinatorial Optimization; John Wiley and Sons: New York, 2006. (34) Peng, J.; Elias, J. E.; Thoreen, C. C.; Licklider, L. J.; Gygi, S. P. Evaluation of multidimensional chromatography coupled with tandem mass spectrometry (LC/LC-MS/MS) for large-scale protein analysis: the yeast proteome. J. Proteome Res. 2003, 2 (1), 43–50. (35) Tabb, D. L.; McDonald, W. H.; Yates, J. R., III. Contrast: tools for assembling and comparing protein identification from shotgun proteomics. J. Proteome Res. 2002, 1, 21–26.

PR700577Z

Journal of Proteome Research • Vol. 7, No. 4, 2008 1593