De Novo Peptide Sequencing Based on a Divide-and-Conquer

A new de novo peptide sequencing technique, DACSIM, combining a divide-and-conquer algorithm for deriving sequence candidates and spectrum simulation ...
12 downloads 0 Views 111KB Size
Anal. Chem. 2004, 76, 6374-6383

De Novo Peptide Sequencing Based on a Divide-and-Conquer Algorithm and Peptide Tandem Spectrum Simulation Zhongqi Zhang*

Analytical Sciences, Amgen Inc., One Amgen Center Drive, Thousand Oaks, California 91320

Mass spectrometry-based de novo peptide sequencing is generally more reliable on high-resolution instruments owing to their high resolution and mass accuracy. On a lower resolution instrument such as the more widely used quadrupole ion traps, de novo peptide sequencing is not so reliable or requires more MS3 experiments. However, the peptide CID spectrum has been demonstrated to be quite reproducible on an ion trap instrument and can be predicted with good accuracy. A new de novo peptide sequencing technique, DACSIM, combining a divide-andconquer algorithm for deriving sequence candidates and spectrum simulation for sequence refinement, is developed for spectra acquired on an ion trap instrument. When DACSIM was used to sequence peptides 5001900 u in mass generated from proteolytic digests of hemoglobin and myoglobin, the success rate was 70% with a false positive rate of only 6%, when isoleucine and leucine residues were not distinguished. Peptide tandem mass spectra generated from low-energy collision-induced dissociation (CID) processes have been widely used in peptide sequence confirmation and protein identification by matching these tandem mass spectra to sequences in protein database.1-6 When database search fails to provide the identification of a peptide, de novo sequencing is often performed to derive the sequence or partial sequence of the peptide.7 Although some peptide modification schemes, either by chemical derivatization8-13 or isotopic labeling,14-20 have been used to simplify the spectrum interpretation, de novo sequencing from * E-mail: [email protected]. Fax: (805)480-1752. (1) Yates, J. R. Electrophoresis 1998, 19, 893-900. (2) Yates, J. R. J. Mass Spectrom. 1998, 33, 1-19. (3) Chalmers, M. J.; Gaskell, S. J. Curr. Opin. Biotechnol. 2000, 11, 384-390. (4) Aebersold, R.; Goodlett, D. R. Chem. Rev. 2001, 101, 269-295. (5) Mann, M.; Hendrickson, R. C.; Pandey, A. Annu. Rev. Biochem. 2001, 70, 437-473. (6) Aebersold, R.; Mann, M. Nature 2003, 422, 198-207. (7) Standing, K. G. Curr. Opin. Struct. Biol. 2003, 13, 595-601. (8) Roth, K. D. W.; Huang, Z. H.; Sadagopan, N.; Watson, J. T. Mass Spectrom. Rev. 1998, 17, 255-274. (9) Keough, T.; Youngquist, R. S.; Lacey, M. P. Anal. Chem. 2003, 75, 156A165A. (10) Shen, T. L.; Huang, Z. H.; Laivenieks, M.; Zeikus, J. G.; Gage, D. A.; Allison, J. J. Mass Spectrom. 1999, 34, 1154-1165. (11) Lindh, I.; Hjelmqvist, L.; Bergman, T.; Sjovall, A.; Griffiths, W. J. J. Am. Soc. Mass Spectrom. 2000, 11, 673-686. (12) Munchbach, M.; Quadroni, M.; Miotto, G.; James, P. Anal. Chem. 2000, 72, 4047-4057.

6374 Analytical Chemistry, Vol. 76, No. 21, November 1, 2004

tandem spectra of unmodified peptides is often preferred. First, the extra step of wet chemistry is less desirable, especially when the procedure is sample limited. Second, de novo sequencing is often desired when a database search failed to provide identification of the precursor ion. In that case, tandem mass spectrum of the unmodified peptide is already available, and it is more desirable to derive the de novo sequence directly from this spectrum. A number of algorithms have been reported to derive the peptide sequence based on the tandem spectra of unmodified peptides.21-34 Most of these algorithms generate high-score (13) Sonsmann, G.; Romer, A.; Schomburg, D. J. Am. Soc. Mass Spectrom. 2002, 13, 47-58. (14) Gaskell, S. J.; Haroldsen, P. E.; Reilly, M. H. Biomed. Environ. Mass Spectrom. 1988, 16, 31-3. (15) Takao, T.; Hori, H.; Okamoto, K.; Harada, A.; Kamachi, M.; Shimonishi, Y. Rapid Commun. Mass Spectrom. 1991, 5, 312-5. (16) Schnoelzer, M.; Jedrzejewski, P.; Lehmann, W. D. Electrophoresis 1996, 17, 945-53. (17) Shevchenko, A.; Chernushevich, I.; Ens, W.; Standing, K. G.; Thomson, B.; Wilm, M.; Mann, M. Rapid Commun. Mass Spectrom. 1997, 11, 10151024. (18) Uttenweiler-Joseph, S.; Neubauer, G.; Christoforidis, A.; Zerial, M.; Wilm, M. Proteomics 2001, 1, 668-682. (19) Gu, S.; Pan, S. Q.; Bradbury, E. M.; Chen, X. Anal. Chem. 2002, 74, 57745785. (20) Gu, S.; Pan, S. Q.; Bradbury, E. M.; Chen, X. A. J. Am. Soc. Mass Spectrom. 2003, 14, 1-7. (21) Hamm, C. W.; Wilson, W. E.; Harvan, D. J. Comput. Appl. Biosci. 1986, 2, 115-118. (22) Hines, W. M.; Falick, A. M.; Burlingame, A. L.; Gibson, B. W. J. Am. Soc. Mass Spectrom. 1992, 3, 326-336. (23) Fernandez-de-Cossio, J.; Gonzalez, J.; Besada, V. Comput. Appl. Biosci. 1995, 11, 427-434. (24) Taylor, J. A.; Johnson, R. S. Rapid Commun. Mass Spectrom. 1997, 11, 10671075. (25) Fernandez-de-Cossio, J.; Gonzalez, J.; Betancourt, L.; Besada, V.; Padron, G.; Shimonishi, Y.; Takao, T. Rapid Commun. Mass Spectrom. 1998, 12, 1867-1878. (26) Dancik, V.; Addona, T. A.; Clauser, K. R.; Vath, J. E.; Pevzner, P. A. J. Comput. Biol. 1999, 6, 327-342. (27) Chen, T.; Kao, M. Y.; Tepel, M.; Rush, J.; Church, G. M. J. Comput. Biol. 2001, 8, 325-337. (28) Taylor, J. A.; Johnson, R. S. Anal. Chem. 2001, 73, 2594-2604. (29) Lubeck, O.; Sewell, C.; Gu, S.; Chen, X. A.; Cai, D. M. Proc. IEEE 2002, 90, 1868-1874. (30) Lu, B. W.; Chen, T. J. Comput. Biol. 2003, 10, 1-12. (31) Fernandez-de-Cossio, J.; Gonzalez, J.; Satomi, Y.; Shima, T.; Okumura, N.; Besada, V.; Betancourt, L.; Padron, G.; Shimonishi, Y.; Takao, T. Electrophoresis 2000, 21, 1694-1699. (32) Ma, B.; Zhang, K. Z.; Hendrie, C.; Liang, C. Z.; Li, M.; Doherty-Kirby, A.; Lajoie, G. Rapid Commun. Mass Spectrom. 2003, 17, 2337-2342. (33) Spengler, B. J. Am. Soc. Mass Spectrom. 2004, 15, 703-714. (34) Demine, R.; Walden, P. Rapid Commun. Mass Spectrom. 2004, 18, 907913. 10.1021/ac0491206 CCC: $27.50

© 2004 American Chemical Society Published on Web 10/05/2004

sequences directly from the spectrum using graph theory approaches.23-31 The reliability of these peptide sequences depends highly on the scoring schemes used, and most algorithms use limited, if any, information regarding the chemical properties of the peptide in their scoring schemes. For reliable de novo sequencing, many algorithms require a high-resolution mass spectrometer with high mass accuracy.31-33 Taking advantage of high-resolution mass analyzers, such as a time-of-flight mass spectrometer, magnetic sector mass spectrometer, or Fourier transform ion cyclotron resonance mass spectrometer, charge states of all fragment ions can be unambiguously determined. In addition, the accurate mass of precursor and product ions determined on these types of instrument makes amino acid assignments more reliable. On a low-resolution instrument, such as the widely used ion trap instrument, the de novo sequencing result is usually less satisfactory due to lack of information on charge state and accurate mass. However, two important advantages of an ion trap instrument, if utilized properly, may significantly improve its capability to sequence a peptide. The first advantage of an ion trap instrument is its capability of performing multiple stages of mass spectrometry (MSn). It has been shown that de novo peptide sequencing can be greatly improved by performing several MS3 experiments on some fragment ions, followed by a two-dimensional fragment correlation analysis.35 However, the requirement of MS3 experiments in this technique limits its application. The second important advantage of an ion trap instrument, which is often neglected, is that peptide CID spectra generated from these instruments are often quite reproducible. That is, the relative intensities of fragment ions are relatively insensitive to instrument parameters and even from instrument to instrument. This property of an ion trap instrument makes it possible to predict the peptide fragmentation patterns. Much of the ion intensity information is often neglected when de novo sequencing is performed. In an earlier work, the author developed a kinetic model to simulate CID spectra of any singly or doubly charged peptide in an ion trap mass spectrometer.36 This model can potentially be used to evaluate de novo peptide sequences, regardless of the algorithms used to drive these sequences. In this work, the author developed a simple yet effective divide-and-conquer (DAC) algorithm to generate a list of sequence candidates, followed by peptide tandem spectrum simulation (SIM) for further refinement of these sequence candidates. The new algorithm, DACSIM, has been proved robust and reliable and has been used routinely in Amgen to characterize unknown peptides. EXPERIMENTAL SECTION Human hemoglobin, horse apomyoglobin, and methylamine hydrochloride were purchased from Sigma (St. Louis, MO). Trypsin and endoproteinase Glu-C were purchased from Roche Diagnostics (Indianapolis, IN). Ultrapure urea was purchased from ICN Biomedicals (Aurora, OH). Acetonitrile was purchased from Burdick Jackson (Muskegon, MI). Trifluoroacetic acid (TFA) was purchased from Pierce (Rockford, IL). Water was purified with a Milli-Q water purification system (Millipore, Bedford, MA). Hemoglobin was digested with trypsin in a 0.1 M sodium phosphate buffer (pH 7.0) containing 1 mg/mL hemoglobin, 2 M (35) Zhang, Z. Q.; McElvain, J. S. Anal. Chem. 2000, 72, 2337-2350. (36) Zhang, Z. Q. Anal. Chem. 2004, 76, 3908-3922.

urea, and 10 mM methylamine hydrochloride with a substrateto-enzyme ratio of 50:1. The digestion was carried out at 37 °C for 16 h. Apomyoglobin was digested by endoproteinase Glu-C in a 0.1 M sodium phosphate buffer (pH 7.0) containing 1 mg/mL protein, 2 M urea, and 10 mM methylamine hydrochloride with a substrate-to-enzyme ratio of 50:1. The digestion was carried out at 25 °C for 16 h. To create shorter peptides suitable for de novo peptide sequencing, the Glu-C-digested apomyoglobin was further digested with trypsin (50:1 substrate-to-enzyme ratio) in the same buffer at 37 °C for 8 h. Tryptic digest of hemoglobin and Glu-C/tryptic digest of apomyoglobin (10 µg each) were analyzed on an LC/MS/MS system with a minibore reversed-phase column (Phenomenex Jupiter C18, 300-Å pore, 5-µm particle size, 250 × 2.0 mm) and a flow rate of 0.2 mL/min. Peptides were eluted with an acetonitrile gradient with 0.1% TFA in the mobile phase. The LC/MS system was an Agilent 1100 HPLC system (Palo Alto, CA) connected to a Thermo Finnigan LCQ DECA ion trap mass spectrometer (San Jose, CA) equipped with an electrospray ionization interface. Mass spectra were acquired in data-dependent mode including zoom scan and MS/MS of the most intense ion. Dynamic exclusion was applied to tryptic digest of hemoglobin. For Glu-C/tryptic digest of apomyoglobin, five additional datadependent LC/MS3 analyses were also performed, with one to five MS3 scans on the most intense one to five product ions, respectively. All tandem mass spectra were acquired in centroid mode with an isolation width of 2 u, activation q of 0.25, activation time 30 ms, and normalized collision energy of 35%. A data set containing 1303 CID spectra of known peptides was used to optimize preprocessing parameters applied to the spectra. These spectra were collected on two Thermo Finnigan LCQ DECA mass spectrometers with normalized collision energy of 30-40%. These peptides, 300-1900 u in mass, were generated from proteolytic digestion of proteins made in Amgen (Thousand Oaks, CA) and commercial proteins purchased from Sigma. Proteases used for digestion include trypsin from Roche or Sigma, endoproteinases Lys-C from Roche or Wako (Osaka, Japan), Glu-C, Asp-N and Arg-C from Roche, and pepsin from Sigma. Spectra were collected with reversed-phase LC/MS/MS at ∼0.2 mL/min flow rate using acetonitrile gradient with ∼0.1% TFA in the mobile phase. COMPUTATIONAL METHODS The de novo sequencing algorithm (DACSIM) described in this paper involves two steps. In the first step, a collection of sequence candidates is generated from the experimental tandem mass spectrum using a divide-and-conquer algorithm. In the second step, these candidates are further refined by comparing their simulated tandem mass spectra36 to the experimental spectrum. The algorithm is described in detail as follows. The described de novo sequencing technique is designed for singly and doubly charged precursor ions. However, it may also be used to sequence triply charged peptides, although the results are usually not as satisfactory as singly or doubly charged ions. Spectrum Preprocessing To Maximize Singly Charged b Ions. When deriving a sequence directly from a tandem spectrum, ideally, we need to know the ion type (b, y, etc.) of each fragment ion. However, this is not possible in most cases. For the convenience of deriving sequence candidates, it is assumed that all ions Analytical Chemistry, Vol. 76, No. 21, November 1, 2004

6375

in the spectrum are b ions. To ensure that a b ion will be observed for every backbone cleavage, the original experimental spectrum needs to be preprocessed; most importantly, the complimentary spectrum of the original spectrum is added to the original spectrum (see below). Note that if a cleavage at the backbone amide bond generates only a y ion, the corresponding b ion can be calculated from its complimentary mass. When a y ion is incorrectly assumed to be a b ion, many incorrect sequences will be generated. Most of these incorrect sequences will be excluded from the final list of possible sequences when their simulated spectra are compared to the experimental spectrum. The following describes the spectrum preprocessing procedure for more efficient and reliable sequence generation. The purpose of this procedure is to maximize singly charged b ions. Intensities of some ions are decreased or increased depending on the likelihood of them being a singly charged b ion. The extents of changes in ion intensity for each step were roughly optimized from analyzing a data set containing 1303 tandem spectra of known peptides. An ion does not have to be present in the original experimental spectrum to have its intensity increased. For a triply charged precursor ion, it is assumed that most b and y ions in its fragment spectrum are singly or doubly charged. As a result, its spectrum is preprocessed the same way as a doubly charged precursor ion. 1. If a fragment ion has an m/z value corresponding to water or ammonia loss from the doubly or triply charged precursor ion, its intensity is reduced by a factor of 10. 2. If a fragment ion is potentially doubly charged (m/z in the range of 300-M/2, where M is the mass of the doubly charged peptide), its intensity is decreased by a factor of 3. In addition, 2% of the ion intensity is added to the corresponding singly charged ion. 3. The intensities of these potentially doubly charged ions are further decreased if they meet additional conditions indicating that they are more likely to be doubly charged. A fragment ion is more likely to be doubly charged when one or both of the following conditions are met. (a) The corresponding singly charged ion is present in the raw spectrum. In this case, the intensity of the doubly charged ion is further decreased by a factor of 3 and the intensity of the singly charged ion is increased by the geometric mean of the two ion intensities. (b) The fractional mass of the fragment ion is inconsistent with a singly charged ion. The fractional mass of a singly charged ion is predicted from examining possible singly charged ions in the entire spectrum. If the factional mass of the ion is off the predicted fractional mass of singly charged ions by more than 0.15 u, its intensity is further decreased by a factor of 20. If it is off by more than 0.3 u, its intensity is decreased by a factor of 200 and 5% of its intensity is added to the corresponding singly charged ion. All ions in the spectrum are considered singly charged after this step. 4. If a fragment ion is likely to be an isotopic peak of another ion, its intensity is decreased by a factor of 25. 5. If a fragment ion is likely to be a water loss, ammonia loss, or CO loss from another ion, its intensity is decreased by a factor of 5. 6. The complimentary spectrum of the current preprocessed spectrum is generated by converting the mass of each ion to its complimentary mass, while keeping the ion intensities unchanged. 6376

Analytical Chemistry, Vol. 76, No. 21, November 1, 2004

The complimentary mass of an ion is calculated by the following equation

mcomplimentary ) M + 2mH+ - m

where M is the mass of the peptide, mH+ is the mass of a proton (1.0073 u), and m is the m/z of the observed fragment ion. The complimentary spectrum is then added to the current preprocessed spectrum. 7. For any ion in the original spectrum, if its complimentary ion exists, the geometric mean of the two ion intensities is added to both ions in the preprocessed spectrum. In an ion trap mass spectrometer under low-energy CID conditions, commonly observed fragment ions include b ions, y ions, a ions, and water/ ammonia losses from these ions. Among these fragment ions, only singly charged b and y ions are complimentary to each other. Therefore, this procedure enhances the intensities of both singly charged b and y ions. 8. The intensities of ions at m/z ) 1 (corresponds to “b0” ion) and m/z ) M - 17 (corresponds to “bn” ion, where n is number of residues in the peptide) are increased to the intensity of the most intense ion in the preprocessed spectrum. 9. A spectrum that contains b ions only is generated. For any particular ion, if both its complimentary ion and the ion corresponding to loss of CO (a ion) exist in the original spectrum, it is considered a b ion and included in the b-ion-only spectrum. The intensity of the b ion is calculated by the geometric mean of the a-ion intensity and the sum of the two complimentary ion intensities. The b-ion-only spectrum is added to the current preprocessed spectrum to enhance intensities of b ions. It is also used to determine the ion type (b or y) of any MS2/MS3 intersection spectrum (see step 11) if MS3 scans are collected. 10. If the user specifies any possible terminal residues, as determined by the specificity of the protease used to generate the peptide, the corresponding b ion is increased to the intensity of the most intense ion in the preprocessed spectrum. For example, if the user specifies that the C-terminal residue is likely either a lysine or an arginine residue, the intensities of ions at m/z ) M - 145 and M - 173 (bn-1 ions) are increased to the maximum ion intensity in the spectrum. In this way, the algorithm puts more weight to C-terminal lysine or arginine-containing peptides. Similarly, if the user specifies a likely N-terminal aspartic acid residue, the intensity of a fragment ion at m/z ) 116 (b1 ion) is increased to the maximum intensity. 11. The ability to distinguish different ion types can be significantly improved if one or more MS3 experiments are performed on any fragment ions.35 If one or more MS3 experiments are performed on any fragment ions, the symmetrized 2-D fragment correlation spectrum is generated according to the procedure described previously.35 The ion type (b or y) of each scan in the 2-D fragment correlation spectrum is determined by comparing it to the b-ion-only spectrum (see step 9). For b-type scans, their spectra are added directly to the preprocessed spectrum. For y-type scans, their complimentary spectra are added to the preprocessed spectrum. Figure 1 shows a preprocessed spectrum of a peptide when no MS3 experiments were performed. The bottom of Figure 1 is the original CID spectrum of peptide VLGAFSDGLAHLDNLK

Figure 1. Preprocessed spectrum (top) as compared to experimental tandem spectrum (bottom) of peptide VLGAFSDGLAHLDNLK (2+).

(mass 1668.9 u, doubly charged, normalized collision energy 35%). The top of Figure 1 is the preprocessed spectrum. Some major fragment ions were labeled in both spectra. Many b and y ions are enhanced in the preprocessed spectrum. Note that addition of two complimentary spectra makes the intensity of any b ion the same as the intensity of its complimentary y ion, with the exception that when an a ion is observed, the corresponding b ion is more intense than its complimentary y ion. Divide-and-Conquer Algorithm To Derive Sequence Candidates. Because fragment ions corresponding to cleavages at some of the peptide bonds along the polypeptide chain are frequently absent in peptide fragmentation spectra (missing cleavages), an ideal way to derive the list of sequence candidates from a spectrum is to find all possible sequences that match the peptide mass. This approach will effectively solve the missing cleavage problem. However, this approach is not practical except for very short peptides due to the exponential growth of the number of possible sequences with the peptide mass. In this work, a divide-and-conquer algorithm was designed to derive sequence candidates from the preprocessed spectrum. A divide-and-conquer algorithm breaks a larger problem into two or more subproblems that are similar to the original problem but smaller in size, solves the subproblems recursively, and then combines these solutions to create a solution to the original problem. For de novo sequencing, the algorithm uses a recursive function to divide the spectrum into smaller and smaller subspectra until a subspectrum is small enough that an exhaustive search of all possible sequences can be practically performed. The input of the recursive function is a spectrum segment, and the output is a list of sequences with scores. The recursive function, summarized in Figure 2, is described in the following. The input to the top-level function is the preprocessed spectrum with the mass range from b0 (m/z ) 1) to bn (m/z ) M - 17), represented here as spectrum (b0, bn),

and the output of the top-level function is the list of full-length sequence candidates (DAC sequences) and their respective scores. The score for each sequence is the sum of intensities of all expected b ions for this sequence. The final score (DAC score) of each full-length sequence is represented as a fraction of total ion intensities in the spectrum. For a spectrum segment (bstart, bend) with a mass range from bstart to bend, the following applies: 1. If the mass range (bend-bstart) of the spectrum segment is smaller than any of the residue mass (usually 57 u, the residue mass of glycine), the recursive function stops with no output. 2. If bend-bstart is small enough (usually e372 u, corresponding to two tryptophan residues), an exhaustive search is performed to find all possible sequences with their total residue masses a match to bend-bstart. In an exhaustive search, all possible sequences are first generated from all permutations of amino acid residues by adding one residue at a time until the total residue mass exceeds bend-bstart. For each sequence, if its total residue mass equals bend-bstart, the sequence is recorded, together with a score for the sequence. The score for each sequence is calculated by summing the total intensities of all expected b ions in the spectrum segment. If an expected b ion is not present in the spectrum segment, an appropriate penalty is taken from the score. These recorded sequences, sorted according to their scores, are output of the function. 3. If bend-bstart is too large (>372 u) for a practical exhaustive search, several ions in the spectrum segment are selected as the pivot ions (bpivot). The most intense ions in the spectrum segment are usually selected as the pivot ions. Each pivot ion is assumed to be a b ion; thus, its complimentary ion (must be a y ion), if within the mass range of the spectrum segment, is removed from the spectrum segment. A pivot ion divides the spectrum segment into two smaller spectrum segments (bstart, bpivot) and (bpivot, bend), Analytical Chemistry, Vol. 76, No. 21, November 1, 2004

6377

Figure 2. Divide-and-conquer algorithm to derive sequence candidates from a preprocessed tandem mass spectrum. Table 1. Top Ten DAC Sequences and DACSIM Sequences and Their Respective Scores/Confidences of Peptide VLGAFSDGLAHLDNLKa top 10 DAC sequences

top 10 DACSIM de novo sequences

sequence candidates

DAC score

sequence and positional confidences

DAC score

SIM score

DACSIM score

confidence (%)

VIKFSDGIAHIDNIK PDKFSDGIAHIDNIK VIGAFSDGIAHIDNIK PDGAFSDGIAHIDNIK VIKFSDGIAHIDGGIK PDKFSDGIAHIDGGIK VIKFSDGIAHIDINK PDKFSDGIAHIDINK VIKFSDGIAHIDVKK PDKFSDGIAHIDVKK

0.5139 0.5138 0.5049 0.5047 0.5030 0.5029 0.5009 0.5008 0.5005 0.5004

V7I4G3A3F7S7D7G7L4A7H7L4D7N5L3K6 V7I4G3A3F7S7D7G7L4A7H7I3D7N5L3K6 V7I4G3A3F7S7D7G7I3A7H7L4D7N5L3K6 V7I4G3A3F7S7D7G7L4A7H7L4D7N5I2K6 V7L3G3A3F7S7D7G7L4A7H7L4D7N5L3K6 V7I4G3A3F7S7D7G7I3A7H7I3D7N5L3K6 V7I4G3A3F7S7D7G7L4A7H7I3D7N5I2K6 V7L3G3A3F7S7D7G7L4A7H7I3D7N5L3K6 V7I4K2F7S7D7G7L4A7H7I3D7A1R1Q1 V7L3K2F7S7D7G7L4A7H7I3D7A1R1Q1

0.5049 0.5049 0.5049 0.5049 0.5049 0.5049 0.5049 0.5049 0.5001 0.5001

0.7507 0.7502 0.7499 0.7498 0.7496 0.7493 0.7493 0.7491 0.7539 0.7538

0.6278 0.6275 0.6274 0.6273 0.6272 0.6271 0.6271 0.6270 0.6270 0.6270

10.7 2.7 2.4 2.3 2.1 1.9 1.8 1.7 1.7 1.7

a Note that K and Q as well as I and L are not distinguished in the DAC sequences (represented by K and I, respectively) and distinguished in the DACSIM de novo sequences. Superscript on each residue indicates the positional confidence of that residue. Incorrectly determined residues are underlined.

each of which is then submitted to the same recursive function. The same procedure is repeated for different pivot ions to ensure that at least one of these pivot ions is a true b ion. Typically, 3-10 pivot ions are selected, with fewer pivot ions when bend-bstart is large. The reason for fewer pivot ions with large bend-bstart is 2-fold. First, it saves computation time for large peptides. Second, it was found that the most intense ions in a large spectrum segment are more likely to be b or y ions than those in a small spectrum segment. The outputs of the recursive function for the two smaller spectrum segments are two lists of short sequences. These two lists of short sequences are combined with all possible combinations to derive a list of longer sequences, using the sum of the two scores as the score for each sequence. This list of longer sequences with scores is the output of the current function. Table 1 (left) shows the top 10 DAC sequences and their DAC scores derived from the preprocessed spectrum shown in Figure 1. Lysine and glutamine residues (represented as a K) as well as 6378

Analytical Chemistry, Vol. 76, No. 21, November 1, 2004

isoleucine and leucine residues (represented as an I) are not distinguished in these sequences due to their identical nominal masses. Residues that are determined incorrectly are underlined. Spectrum Simulation for Sequence Refinement. The divide-and-conquer algorithm generates many sequence candidates, with the correct sequence frequently on top. However, in many cases, when y ions are selected as pivot ions, incorrect sequences may appear on top. Most notably, when a series of y ions are selected as pivot ions, some sequences that are similar to the correct sequence except in a reversed direction will appear on top. Further sequence refinement is necessary to eliminate these incorrect sequences. The refinement is accomplished by comparing the simulated tandem mass spectrum of each DAC sequence candidate to the experimental spectrum. The simulation process for singly and doubly charged peptide ions has been described in detail elsewhere.36 Each sequence candidate is scored by the similarity value36 between the experimental spectrum and

Table 2. De Novo Sequencing Results for Peptides in Tryptic Digest of Hemoglobina peptide mass

ion charge

peptide sequence

top DACSIM de novo sequence and positional confidence

531.3 728.4 817.4 817.4 888.5 931.5 931.5 951.5 951.5 1070.6 1070.6 1086.6 1125.6 1148.7 1148.7 1170.7 1191.7 1251.7 1273.7 1273.7 1313.7 1313.7 1377.7 1377.7 1448.8 1528.7 1528.7 1668.9 1719.0

1 1 1 2 1 1 2 1 2 1 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 1 2 2 2

AAWGK VLSPADK VDPVNFK VDPVNFK LLVVYPW SAVTALWGK SAVTALWGK VHLTPEEK VHLTPEEK MFLSFPTTK MFLSFPTTK LRVDPVNFK LHVDPENFR VVAGVANALAHK VVAGVANALAHK VLSPADKTNVK Cbm-VVAGVANALAHKb FLASVSTVLTSK LLVVYPWTQR LLVVYPWTQR VNVDEVGGEALGR VNVDEVGGEALGR EFTPPVQAAYQK EFTPPVQAAYQK VVAGVANALAHKYH VGAHAGEYGAEALER VGAHAGEYGAEALER VLGAFSDGLAHLDNLK LLGNVLVCVLAHHFGK

1797.0 1832.9

2 2

KVLGAFSDGLAHLDNLK TYFPHFDLSHGSAQVK

A8A8W7G8K7 V9L4S9P8A7D8K8 V9D9P8V8N5F6K6 V8D8P8V8G7G7F7K7 P3E2V3Y3V5P8W8 S1A1V2T4A4L2W3G4Q3 S5A5V8T8A8L5W6G7K7 V8H8L5T8P8E8E8K7 V6H6I5T7P6E6E6K6 M1F1L1S2F2P2N2C2L1 M8F8L5S9F9S6R6A6L5 R5I3T7I5P9V9N9F9Q6 L5H9V9D9P7G7W8F9R9 K4A2T2P3V2N2A3I2A4H5K3 V8V8A8G8V8A8N6A6I3A7H7K6 V8I5S8P8A8D4K4N3T4V6K8 L3Q6A8G8V8A8N8A8I4A7H7K5 F5I3A5S5V5S5T5V5L2T4S4K4 I0N0D0V0Y0S0A0S0V0V0A0H0 P8E8V8V8Y8P8T7W6Q5R8 V3G2G2V4D7E7V7G6G6E7A6L3G7R7 V6N6V7D7E7G3V3G6E6A7I3G7R4 E8F8T8P8P8V8Q5A8A8Y7Q3K5 E3F3T8P8P8V8Q6A8A8Y7K5K7 V5V5A5G5V7A7N6A7L3A7H7K4Y7H7 V4G3A4H4A4S2V2Y4Q2E4A4L2E4G4V4 V9G7A7H9A9G6E6Y9G9A9E9A8L3E6R9 V7I4G3A3F7S7D7G7L4A7H7L4D7N5L3K6 I3L3G5N5V6I3V6C6V6L3A6H6H6F6G6Q5 K3V3L3K3F5S5D5G3I2A5H5L3D5F5A4H4 T7Y7F8P8H8F8D8L4S8H8G7S7A7K6V8Q4

a Superscript on each residue indicates the positional confidence of that residue. Incorrectly determined residues are underlined. b Cbm-, carbamylated.

its simulated spectrum (SIM score). The final score of the sequence (DACSIM score) is defined as the average of the DAC score and the SIM score. Table 1 (right) shows the top 10 DACSIM sequences, sorted by their DACSIM scores, derived from spectra shown in Figure 1. Please note that, in this case, isoleucine (I) and leucine (L) as well as lysine (K) and glutamine (Q) are considered different. After the refinement, the correct sequence moved from number three in the DAC sequences (Table 1, left) to the top in the DACSIM sequences (Table 1, right); only the second residue L is incorrectly determined as an I. In fact, all top eight DACSIM sequences are derived from the third DAC sequence, with different I residues replaced by L residues. Estimation of Positional Confidence. Ideally, one would like to know the confidence level for a specific sequence or partial sequence to be correct or, even better, the confidence level of the assigned residue for each position. To do that, DACSIM was applied to 1303 tandem spectra of different known peptides acquired under similar conditions. Analyses of these results revealed that the probability for the correct sequence to be included in the list of DACSIM sequences is related to the SIM score of the top DACSIM sequence, and the probability of each DACSIM sequence to be correct (confidence of the sequence) is

related to the difference of DACSIM score between that sequence and the top DACSIM sequence. Given the DACSIM scores of all the sequences and SIM score of the top DACSIM sequence, the confidence of each sequence can be estimated. Table 1 (right) shows the confidences of the top 10 DACSIM sequences calculated this way. The probability for each residue at a specific location (positional confidence) is calculated by adding the confidences of all sequences that has the specified residue at the specified location. In this paper, the positional confidence for each location is indicated as a superscript after each residue. For example, according to the result shown in Table 1, the confidence that the N-terminal residue is a V is 70-80% (represented as a superscript of 7), and the confidence that the residue following the N-terminal sequence corresponding 269 u (e.g., sum of residue mass of VIG) is an A is 30-40% (represented as a superscript of 3), etc. It can be seen that all I and L residues have low confidences because DACSIM cannot distinguish I and L confidently in this case. The third and fourth residues GA have low confidence because GA may potentially be a K or Q residue. C-terminal K is determined with rather high confidence, indicating that DACSIM determines with a fairly high degree of confidence that the C-terminus is a K instead of a Q residue. Analytical Chemistry, Vol. 76, No. 21, November 1, 2004

6379

Table 3. De Novo Sequencing Results for Peptides in Glu-C/Tryptic Digest of Apomyoglobina peptide mass

ion charge

peptide sequence

top DACSIM de novo sequence and positional confidence

502.3 520.3 536.3 576.2 626.3 630.3 672.4 678.3 734.5 734.5 771.5

1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2

DLKK LGFQG KFDK GLSDGE HLKTE NDIAAK WQQVL GHHEAE HKIPIK HKIPIK TALGGILK LFTGHPE KGHHEAE KFDKFK KFDKFK HLKTEAE NVWGKVE ADIAGHGQE KKGHHEAE KKGHHEAE AIIHVLHSK LFTGHPETLE LFTGHPETLE LKPLAQSHATK WQQVLNVWGK FISDAIIHVLH HGTVVLTALGGILK FISDAIIHVLHSK WQQVLNVWGKVE HPGDFGADAQGAMTK HGTVVLTALGGILKK KHGTVVLTALGGILK GLSDGEWQQVLNVWGK

D8I4K5K8 L4G7F8K5G8 K6F8D8K8 G7L4S8D8G8E8 F6T6Q4C7E7 N8D8L4A8A8K8 W7Q6Q7V9L5 G3H3H3E3A4E4 H6K4L4P7L4K6 K0H1L0P0I0K0 T5A5L3G4G4I4L3Q5 L3H3G3T3F4P9E9 K5G6H6H6E6A6E6 K6F7D8K6F8K7 Q6F7D7K6F7K6 H7I4K6T7E7A7E8 N6V7W6G8Q6V8E8 A6D6L4A7G7H8G8Q6E8 A4G4K3G6H6H6E6A6E6 G3A3Q3G6H5H6E6A6E6 A8I5I4H9V9L5H9S9K8 L3F7T7G8H8P8E6T6I5E8 M6Q4G6T2H2R2A2T6I3E6 L4Q5P8L4A8Q6S8H8A8T6K4 W7Q6Q6V7L4N6V7W6G7K7 F8I6S9D9A9I5I5H8V8I5H9 H6G6T6V6V6I3T6A6L4G6G6I4L4K6 L3F5S8D8A8I5I4H8V8I5H8S8Q7 W6Q3Q5V7L4N7V7W7G7K7V7E7 H5P5G5D5T2C2A4D5A5Q3K4M4T4K4 H8G8T8V8V8I4T8A8L5G7G7I4I4K6K7 K2H3G3T3V3V3L1T3A3I1V2K2L2K2 L3G5S5D6G6E6W7Q7Q7V7L3N7V7W7G7K7

799.4 806.4 811.5 811.5 826.4 830.4 896.4 934.5 934.5 1016.6 1142.6 1142.6 1192.7 1256.7 1263.7 1377.8 1478.8 1484.8 1501.7 1505.9 1505.9 1814.9 a

Superscript on each residue indicates the positional confidence of that residue. Incorrectly determined residues are underlined.

Software Development. The algorithm was incorporated into MassAnalyzer, a program for automated “bottom-up” protein characterization.37 The program was written in Microsoft Visual C++. The current version of MassAnalyzer is able to read Thermo Electron Xcalibur raw files directly. Contact the author directly for a test of the program. For de novo sequencing, MassAnalyzer allows the user to set the maximum length of time to be used on each de novo sequencing task. All de novo sequencing results presented here used a maximum time of 1 min, although many small peptides can be done in much shorter time. The software also allows the user to specify any uncommon amino acid residues or modified residues to be included in the sequencing. Because each residue has its own set of parameters for simulating peptide tandem spectrum, when simulating the tandem spectrum of a peptide containing user-defined residues with unknown properties, the average parameters are used. All de novo sequencing analyses described here were (37) Zhang, Z. Q. Proc. 52nd ASMS Conf. Mass Spectrom. Allied Topics, Nashville, TN, 2004.

6380

Analytical Chemistry, Vol. 76, No. 21, November 1, 2004

performed on a Compaq desktop computer with an Intel 1.7-GHz Pentium 4 processor. RESULTS AND DISCUSSION Tryptic digest of hemoglobin and Glu-C/tryptic digest of apomyoglobin were analyzed by LC/MS/MS on a Thermo Finnigan LCQ DECA system. The tandem mass spectra of peptides in the tryptic digest of hemoglobin were not included in the data set used to optimize the model for tandem spectrum simulation.36 De novo sequencing was performed on all singly or doubly charged precursor ions with intensities above 107 counts and masses within 500-1900 u. Results for peptides larger than 1900 u are usually not acceptable, presumably because the first heavy isotope peak is more intense than the monoisotopic peak. Tables 2 and 3 show the de novo sequencing results of these peptides as compared to their correct sequences. Incorrectly determined residues are underlined. For each peptide, the maximum computation time was 1 min (average computation time ∼25 s on an Intel 1.7-GHz Pentium 4 processor). No terminal residues were specified by the user. It can be seen that most incorrectly determined residues have low confidence levels, and

Table 4. Success Rate (%) and False Positive Rate (%) of Determined Residues When DACSIM or DAC Was Tested on Peptides in Tryptic Digest of Hemoglobin and Glu-C/Tryptic Digest of Apomyoglobin for Various Confidence Thresholds DACSIM (25 s/spectrum) I/L distinguished K/Q distinguished

DAC only (1.7 s/spectrum)

K/Q distinguished I/L not distinguished

I/L not distinguished K/Q not distinguished

I/L not distinguished K/Q not distinguished

confidence threshold (%)

success rate

false positive rate

success rate

false positive rate

success rate

false positive rate

success rate

false positive rate

0 10 20 30 40 50 60 70 80 90

80.1 79.6 78.6 76.9 70.2 62.8 54.4 39.5 24.6 5.5

19.9 17.6 17.2 13.8 9.9 6.3 3.6 1.6 0.5 0.0

84.3 83.5 82.8 81.2 76.4 69.7 62.6 45.1 29.6 7.1

15.7 13.6 13.4 10.0 8.3 5.7 3.7 1.6 0.8 0.2

86.4 85.8 85.0 83.7 80.1 74.6 67.2 50.3 33.5 7.4

13.6 11.8 11.3 9.1 7.1 4.5 3.4 1.9 0.8 0.2

82.8 81.7 74.0 67.6 49.8 31.8 23.1 15.1 4.9 1.0

17.2 15.5 14.6 10.0 5.4 2.8 1.8 1.3 1.0 0.0

Table 5. Success Rate (%) and False Positive Rate (%) for Distinguishing Residues I/L and K/Q on Peptides in Tryptic Digest of Hemoglobin and Glu-C/Tryptic Digest of apomyoglobin for various confidence thresholds I/L

K/Q

confidence threshold (%)

success rate

false positive rate

success rate

false positive rate

0 10 20 30 40 50 60 70 80 90

64.0 62.7 60.0 56.0 38.7 16.0 1.3 0.0 0.0 0.0

36.0 33.3 32.0 29.3 20.0 6.7 0.0 0.0 0.0 0.0

82.2 80.8 80.8 78.1 69.9 61.6 52.1 26.0 8.2 0.0

17.8 17.8 17.8 17.8 15.1 13.7 6.8 1.4 0.0 0.0

many of them (∼43%) are I, L, K, or Q because of the difficulty in distinguishing them on a low-resolution instrument. Success Rate and False Positive Rate. To determine how successful this technique is, each de novo sequence is compared to the correct sequence. To take advantage of the positional confidence for each residue, a threshold of this confidence level needs to be defined. When the confidence level of any residue is above the threshold, a positive residue determination is confirmed. A true positive is defined as a correctly determined residue whose confidence is above the threshold, and a false positive is defined as an incorrectly determined residue whose confidence is above the threshold. The success rate is defined as the number of true positives divided by the total number of analyzed residues, and the false positive rate is defined as the number of false positives divided by the total number of analyzed residues. Table 4 (left) shows the success rate and false positive rate of determined residues for various confidence thresholds, for the results shown in Tables 2 and 3. It can be seen from Table 4 that when confidence is not considered (threshold 0%), 80% of the residues are determined correctly and 20% are determined incorrectly, when both I/L and K/Q are distinguished. A good choice of confidence threshold would maximize the success rate, while keeping the false positive rate reasonably low. For example, when a confidence threshold of 50% is used, the success rate is 63%

and false positive rate is 6.3%. When I/L are not distinguished, the success rate and false positive rate for a 50% confidence threshold change to 70 and 5.7%, respectively. When neither L/I nor K/Q is distinguished, the success rate and false positive rate for a 50% confidence threshold change to 75 and 4.5%, respectively. For comparison, the success rate and false positive rate of DAConly (no refinement from spectrum simulation) are also shown in Table 4. Even though DACSIM greatly improves the result from DAC-only, DAC-only also provides reasonable accuracy for de novo sequencing, with a success rate of 32% and false positive rate of 2.8% for a 50% confidence threshold. Similar to DACSIM, the positional confidence for sequences generated by DAC-only are calculated from analyzing results obtained when DAC-only is applied to the 1303 spectra of known peptides. The advantage of using DAC-only is its speed. The time DAC spends on a spectrum increases exponentially with the mass of the peptide, from about 1 s for a 1000 u peptide to 10-15 s for an 1800 u peptide. The average time DAC spends on each spectrum is less than 2 s. In addition, when spectrum simulation cannot be readily performed, such as in an instrument other than quadrupole ion traps, DAConly may become a very fast technique for peptide de novo sequencing. Please note that the success rates shown in Table 4 may be overestimated for real-world situation when the spectral qualities are not as good as the spectra used here because the LC/MS/MS experiments performed here are not sample-limited and only major ions in the runs are analyzed. To compare the results obtained here to a different de novo sequencing program, the spectra that generated results in Tables 2 and 3 were processed using Thermo Finnigan DeNovoX 1.0. Similar mass tolerance parameters and computation times were used in DeNovoX as in DACSIM. Compiling the top-score full sequences (equivalent to a confidence threshold of 0% shown in Table 4) determined by DeNovoX showed that 54.3% of the residues were determined incorrectly, as compared to 17.2% by DAC and 13.6% by DACSIM when I/L and K/Q were not distinguished. I/L and Q/K Assignments. Because the described technique uses fragment ion intensity information in determining peptide sequence, it is possible that I/L and Q/K can be distinguished Analytical Chemistry, Vol. 76, No. 21, November 1, 2004

6381

Table 6. De Novo Sequencing Results for Some Peptides in Glu-C/Tryptic Digest of Apomyoglobin When Different Numbers of MS3 Spectra Were Acquired correct sequence

0 MS3

1 MS3

2 MS3

3 MS3

HLKTE (1+) HKIPIK (2+) LFTGHPE (1+) KKGHHEAE (2+) LFTGHPETLE (2+) FISDAIIHVL HSK (2+) HPGDFGAD AQGAMTK (2+) KHGTVVLT ALGGILK (2+) GLSDGEWQ QVLNVWGK (2+)

F6T6Q4C7E7 K0H1L0P0I0K0 L3H3G3T3F4P9E9c G3A3Q3G6H5H6E6A6E6 M6Q4G6T2H2R2A2T6I3E6 L3F5S8D8A8I5I4H8V8I5 H8S8Q7 H5P5G5D5T2C2A4D5 A5Q3K4M4T4K4 K2H3G3T3V3V3L1T3 A3I1V2K2L2K2 L3G5S5D6G6E6W7Q7 Q7V7L3N7V7W7G7K7

H7I4K6T7E8 K2P2M2M2T2K3 L3H4G4T5F7P8E8 Q6K6G6H6H6E7A7E7 F5I3T5G4H4L3N4T4I2K4 L3F4S8D8A8I4I5H8V8I4 H8S8Q5 H5P5G5D5F5G3A3D4 A5K3T4G2T2T4K5 A2G2H3G3T3V3V3I1T3 A3I2V3K3L2K3 Q2E3D5G6E6W6Q6 Q6V6I4N6V6W6G6K6

H8I5K6T8E8h H2K1L1P2I1K2 L3H4G4T4F6P8E8 Q5Q4G6H5H5E6A6E6 n/a L3F5S8D8A8I5I5H8V8I4 H8S7Q6 H5P5G5D5F4G3A3D4 A4Q2K2M4T4K4 A2G2H3G3T3V3V3L1T3 A3G2H2K2M2K A3G1E3D7G3E3W7Q7 Q7V7L3N7V7W7G7K7

n/a K1H2L2P2I1K2 L3H5G6T6F7P8E8 K2Q2G3H3H3E4A4E4 n/a F4I3S8D8A8I5I5H8V8I4 H8S7Q6 H4P4G4D4F4Q4D4 S2T2E4D4V4R4 K0M0L0G0G0L0A0T0 I0V0K0F0T0K0 L1G2S5D5D5A5W5A2G2 A5G5V5I3N5V5W5G5K5

due to the differences in their chemical properties, although the instrument used in this work is not able to distinguish them based on their masses. Table 5 shows the success and false positive rates for determining I from L and Q from K, based on the results presented in Tables 2 and 3. In Table 5, the denominator in the success and false positive rates is the total number of I/L or K/Q residues that are determined to be I/L or K/Q. Therefore, if the algorithm is not capable of distinguishing I/L or K/Q, the success rate should equal the false positive rate. It is seen that, for 0% confidence threshold, 64% of I/L and 82% of Q/K are determined correctly; both significantly exceed the random value of 50%. For a confidence threshold of 50%, the success rate for distinguishing I and L is 16% with a false positive rate of 6.7%. For the same confidence threshold, the success rate for distinguishing K and Q is 62% with a false positive rate of 14%. It can be concluded that although the technique can distinguish I and L to some extent, the result is not reliable. The ability to distinguish K and Q is much more reliable. This result is expected because K and Q are quite different in their side-chain basicities, while I and L have very similar chemical properties. Please note that the results presented here did not take into account the frequency of occurrence of residues I, L, K, and Q. A better result may be achieved if the frequencies of occurrence for these residues are known. Charges of the Precursor Ions. From Tables 2 and 3 it can be seen that the results for singly charged species are not significantly poorer than the doubly charged species. In many cases, singly charged species generate better results than the doubly charged species. Although singly charged species often generate fewer sequence ions, their tandem spectra are usually predicted more accurately. In addition, there are no ambiguities in the charge states of the fragment ions. The advantage of sequencing singly charged peptide can be useful when sequencing MALDI-produced peptide ions.34,38-39 For triply charged precursor ions, the process for simulating tandem spectra is very tedious and time-consuming due to the (38) Yergey, A. L.; Coorssen, J. R.; Backlund, P. S.; Blank, P. S.; Humphrey, G. A.; Zimmerberg, J.; Campbell, J. M.; Vestal, M. L. J. Am. Soc. Mass Spectrom. 2002, 13, 784-791. (39) Zhang, W. Z.; Krutchinsky, A. N.; Chait, B. T. J. Am. Soc. Mass Spectrom. 2003, 14, 1012-1021.

6382

Analytical Chemistry, Vol. 76, No. 21, November 1, 2004

complexity in calculating the proton distribution of a peptide and charge distribution of b and y ions generated from a backbone cleavage.36 As a result, de novo sequencing of triply charged peptides based on a precise simulating procedure as described in ref 36 becomes impractical. However, DAC-only is found to be capable of providing useful sequence information for triply charged peptides. For example, two triply charged peptides, KVLGAFSDGLAHLDNLK (1797.0 u) and TYFPHFDLSHGSAQVK (1832.9 u), were observed in the tryptic digestion of hemoglobin with mass smaller than 1900 u. When DAC was applied to the tandem spectra of the two triply charged peptides, a result of K1V1L1G1A1F1S1D1G1S1P1F1S1M1G0K0G0L0 was obtained for peptide KVLGAFSDGLAHLDNLK and T1Y1F1P1H1F1D1L1S1H1G1A0N0P0C0K0 was obtained for peptide TYFPHFDLSHGSAQVK. Although with low confidence, 64% of the residues were determined correctly. A simplified model is under development for faster simulation of tandem spectra of peptide ions with charge three or higher. The model is expected to increase the reliability of de novo sequencing of triply charged peptides. Modified Peptides. For modified peptides, if the modification is known, the modified residue can be added to the amino acid list and used in de novo sequencing. If the modification is not known in advance, de novo sequencing will also provide a useful result. For example, Table 2 has a modified peptide (1191.7 u) due to N-terminal carbamylation (cbm, +43 u). Obviously, the algorithm will not give the correct answer for the N-terminal residue because it is not aware of the existence of such modification. However, it can be seen that, except for the N-terminal two residues, most of the remaining residues are determined correctly. The N-terminal two residues in the determined sequence happen to match the mass of the N-terminal two residues of the modified peptide, and they are determined with lower confidence. The modified peptide can be easily identified by searching the partial sequence consisting of residues with high confidence. Terminal Residues. For results shown in Tables 2 and 3, no terminal residues were specified. When the possible terminal residues were specified according to the specificity of the proteases used in the digestion, a better result was achieved (result not shown). Please note that the terminal residue information is used in DAC, but not in SIM. Therefore, when possible terminal

residues are specified, the algorithm puts more weight to sequences containing the specified terminal residues, but it does not exclude other sequences that fit the spectrum better. For example, in the tryptic digest of hemoglobin (Table 2), when K and R are specified as possible C-terminal residues, the result for peptide KVLGAFSDGLAHLDNLK (1797.0 u) becomes K3V3L3Q4F6S6D6G6I3A6H6L4D6N6L4K.6 However, the result for peptide VVAGVANALAHKYH (C-terminal of β-chain, 1448.8 u) is the same V5V5A5G5V7A7N6A7L3A7H7K4Y7H,7 and the result for peptide LLVVYPW (from chymotrypsin digest, 888.5 u) becomes P2E2V2Y2V4P5W,5 each one still having its own correct C-terminal residue. MS3 Experiments. In the introduction section the author mentioned the two advantages of an ion trap mass spectrometers its reproducibility/predictability in fragmentation pattern and its ability to perform multiple stages of MS. The technique described here takes advantage of the reproducibility and predictability of fragmentation pattern. The technique that takes advantage of multiple stages of MS for de novo peptide sequencing has been described previously.35 How much improvement can we make by taking advantage of both? In this work, MS3 spectra were used to generate MS2/MS3 intersection spectra35 and then added to the preprocessed spectra for generating sequencing candidates. Sequence refinements were performed the same way by comparing the predicted MS/MS spectrum to the experimental MS/MS spectrum. Table 6 shows some of the results when various numbers of data-dependent MS3 scans were collected after the MS/MS scan for the Glu-C/tryptic digest of apomyoglobin. Peptides listed in Table 6 are from those peptides shown in Table 3 that are not sequenced optimally. It can be seen that one or two MS3 scans do improve the accuracy of sequencing. However, when more than two MS3 scans are acquired, the results get worse (results for more than three MS3 scans are not shown). The reason is that when the experiment is sample limited or time limited (such as in an LC run), spending more time acquiring many MS3 scans will significantly reduce the

quality of the MS/MS spectrum due to lack of sufficient signal averaging. The conclusion is that one or two MS3 scans do improve the result when it is sample limited or time limited. More MS3 scans only help when it is not sample or time limited. Please note that the extra MS3 experiments are only used to help deriving sequence candidates in DAC algorithm; it is potentially possible to develop a more sophisticated algorithm to take full advantage of these MS3 spectra for de novo peptide sequencing. These more sophisticated algorithms are not explored at this stage because extra MS3 experiments are not desirable for most users. CONCLUSIONS For improving the reliability of MS/MS-based peptide de novo sequencing the focus has been on using high-resolution instruments for more accurate mass determination. This paper describes a technique that takes advantage of the reproducibility and predictability in fragment ion intensities of a less expensive and more widely accessible low-resolution ion trap instrument. The technique is fully automatic and can be used for sequencing random peptides up to 1900 u in mass. The technique was proved quite robust and reliable. A disadvantage of this technique is that training is required to optimize the model for predicting fragment ion intensities when the peptide fragmentation pattern in a mass spectrometer differs significantly from the patterns generated from instruments used in this work. ACKNOWLEDGMENT The author thanks Dr. Joseph Bordas-Nagy of Amgen for his continuing support and helpful discussions during this work. The author also thanks Dr. Joseph Bordas-Nagy and Dr. Hai Pan of Amgen for their help during preparation of the manuscript.

Received for review June 14, 2004. Accepted August 24, 2004. AC0491206

Analytical Chemistry, Vol. 76, No. 21, November 1, 2004

6383