Comparative Sequencing of Nucleic Acids by Liquid Chromatography

Nov 30, 2001 - Tandem mass spectrometric de novo sequencing of oligonucleotides using simulated annealing for stochastic optimization. Herbert Oberach...
60 downloads 10 Views 112KB Size
Anal. Chem. 2002, 74, 211-218

Comparative Sequencing of Nucleic Acids by Liquid Chromatography-Tandem Mass Spectrometry Herbert Oberacher,† Bernd Wellenzohn,‡ and Christian G. Huber*,†

Institute of Analytical Chemistry and Radiochemistry and Institute of General, Inorganic, and Theoretical Chemistry, Leopold-Franzens-University, Innrain 52a, 6020 Innsbruck, Austria

An algorithm was developed for the computer-aided interpretation of fragment ion spectra from collisioninduced dissociation of multiply charged oligodeoxynucleotide ions generated by electrospray ionization. The method compares the experimental spectrum to the m/z values predicted by employing established fragmentation pathways from a known reference sequence. The closeness of matching between the measured spectrum and the predicted set of fragment ions is characterized by the fitness, which takes into account the difference between measured and predicted m/z values, the intensity of the fragment ions, the number of fragments assigned, and the number of nucleotide positions not covered by fragment ions in the experimental spectrum. Smaller values for the fitness indicate a closer match between measured spectrum and predicted m/z values. To substantiate the identity of investigated sequence and reference sequence, or to identify point mutations or insertions/deletions, the reference sequence is systematically varied by incorporating all four possible nucleotides A, T, G, and C at each position in the sequence followed by identification of the correct sequence by the lowest fitness value. Collision energy was shown to have a major impact on the interpretability of the tandem mass spectra by the comparative sequencing algorithm, and the optimal collision energy depended on the length of the fragmented oligodeoxynucleotide. The analytical system was successfully applied to verify DNA sequences as well as to detect and localize point mutations or insertions/deletions in 5-51-mer oligodeoxynucleotides. The past two decades have seen nucleic acid sequencing evolve from specialized, complicated, and labor-intensive procedures to fully automated, high-throughput, commonplace techniques. Today’s most powerful sequencing method is the Sanger chain termination method combined with capillary gel electrophoretic analysis offering read lengths of more than 1000 bases, the possibility of multiplexing, and rapid, computer-based data interpretation, which altogether result in the very high throughput * Corresponding author: (tel) +43 512 507 5176; (fax) +43 512 507 2767; (e-mail) [email protected]. † Institute of Analytical Chemistry and Radiochemistry. ‡ Institute of General, Inorganic, and Theoretical Chemistry. 10.1021/ac015595a CCC: $22.00 Published on Web 11/30/2001

© 2002 American Chemical Society

necessary for de novo genomic sequencing.1 Nevertheless, in the “postgenomic era”, a lot of effort will be directed toward the comparison of small pieces of a genome against reference genomes in order to detect slight genetic variations or polymorphisms.2 Since polymorphisms occur only at a frequency of 1 in 800-62 000 base pairs, their discovery by Sanger sequencing would eliminate a lot of time, labor, and material for the determination of already known sequences. A plurality of studies, comprehensively reviewed by Murray,3 Nordhoff et al.,4 and Gross and Hillenkamp,5 have laid the groundwork for sequencing approaches by tandem mass spectrometry (MS/MS) based on the gas-phase collision-induced dissociation (CID) chemistry of multiply charged oligodeoxynucleotide ions. The fragmentation reactions of oligodeoxynucleotides in MS/MS have been studied extensively,6-14 and resulting product ion spectra are well predictable on the basis of known fragmentation pathways. Although mass spectrometric sequencing might not ever be able to offer the long read length of Sanger sequencing, it provides high speed and accurate mass measurements based on an intrinsic property of the investigated nucleic acid molecules,15 which make MS/MS a potent tool for the detection of nucleic acid sequence variation. Consequently, MS/ MS of oligodeoxynucleotides has been applied to characterize or (1) Dovici, N. J.; Zhang, J. Angew. Chem. 2000, 112, 4635-40. (2) Kristensen, V. N.; Kelefiotis, D.; Kristensen, T.; Borresen-Dale, A. L. Biotechniques 2001, 30, 318-22. (3) Murray, K. K. J. Mass Spectrom. 1996, 31, 1203-15. (4) Nordhoff, E.; Kirpekar, F.; Roepstorff, P. Mass Spectrom. Rev. 1996, 15, 76-138. (5) Gross, J.; Hillenkamp, F. In Encyclopedia of Analytical Chemistry; Meyers, R. A., Ed.; John Wiley & Sons Ltd.: Chichester, U.K., 2000. (6) McLuckey, S. A.; Berkel, G. J.; Glish, G. L. J. Am. Soc. Mass Spectrom. 1992, 3, 60-70. (7) Kirpekar, F.; Krogh, T. N. Rapid Commun. Mass Spectrom. 2001, 15, 8-14. (8) McLuckey, S. A.; Habibi-Goudarzi, S. J. Am. Soc. Mass Spectrom. 1994, 5, 740-7. (9) Little, D. P.; Chorush, R. A.; Speir, J. P.; Senko, M. W.; Kelleher, N. L.; McLafferty, F. W. J. Am. Chem. Soc. 1994, 116, 4893-7. (10) McLuckey, S. A.; Vaidayanathan, G.; Habibi-Goudarzi, S. J. Mass Spectrom. 1995, 30, 1222-9. (11) Ni, J.; Pomerantz, S. C.; Rozenski, J.; Zhang, Y.; McCloskey, J. A. Anal. Chem. 1996, 68, 1989-99. (12) Little, D. P.; Aaserud, D. J.; Valaskovic, G. A.; McLafferty, F. W. J. Am. Chem. Soc. 1996, 118, 9352-9. (13) Vrkic, A. K.; O’Hair, R. A. J.; Foote, S.; Reid, G. E. Int. J. Mass Spectrom. 2000, 194, 145-64. (14) Wan, K. X.; Gross, J.; Hillenkamp, F.; Gross, M. L. J. Am. Soc. Mass Spectrom. 2001, 12, 193-205. (15) Henry, C. Anal. Chem. 1997, 69, 243A-6A.

Analytical Chemistry, Vol. 74, No. 1, January 1, 2002 211

Table 1. Sequences and Molecular Masses Mr of Oligodeoxynucleotides Analyzed in This Study no.

sequence

length (nt)

Mr

1 2 3 4

GACAGGAAAG ACATTCTGGC GACAGGAAAG ACTTTCTGGC TGATGATGAT GCGTGAAGAC AGTAGTTCCC TGACTCTGA AAACCACATT CTGAGCATAC CCCCAAAAAA TTTCATGCCG AAGCTGTGGT C

20 20 39 51

6175.15 6166.13 12061.99 15580.33

confirm structures,9 to identify sequence variations,16,17 and to detect chemical modifications.8,18,19 It is generally recognized that the information content of product ion spectra critically depends on instrument type and experimental conditions.20,21 Moreover, spectrum complexity dramatically increases with the size of the fragmented oligodeoxynucleotide, resulting in considerable difficulties in spectrum interpretation. Little et al. outlined a sequencing strategy for “manual” interpretation of MS/MS spectra,12 but deduction of sequence information is time-consuming, highly technical, lacking strict objectivity due to reliance on human interpretation, and can be performed only in laboratories with extensive experience in MS/MS. Hence, automation of procedures for interpretation of fragment ion spectra represents a prerequisite for the applicability of MS/MS in the routine sequence analysis of nucleic acids. A computer-based algorithm for sequencing of oligodeoxynucleotides of completely unknown sequence by electrospray ionization tandem mass spectrometry (ESI-MS/MS) was elaborated by McCloskey and co-workers.11 The algorithm works by extending from both the 5′ (a-B ions) and 3′ (w ions) end ion series encoding the complete DNA sequence. Mass ladders are identified by sequentially adding each of the four possible nucleotide masses and searching the spectrum for the best match of expected ions. Yates et al. developed a method for automatically correlating measured tandem mass spectra mainly of peptides but also of oligodeoxynucleotides with predicted spectra of sequences derived from DNA or protein sequence databases.22,23 Despite considerable success, the aforementioned algorithms are restricted to oligomers at approximately the 15-mer level and below. Nevertheless, sequences in the 50-mer and longer range are usually generated upon amplification by the polymerase chain reaction (PCR), which represents one of the most important molecular biological tools for investigations involving nucleic acids.24 On the basis of recent findings that the quite selective CID conditions in quadrupole ion trap mass spectrometers are suitable (16) Genti, E.; Banoub, J. J. Mass Spectrom. 1996, 31, 83-94. (17) Krahmer, M. T.; Walters, J. J.; Fox, K. F.; Fox, A.; Creek, K. E.; Pirisi, L.; Wunschel, D. S.; Smith, R. D.; Tabb, D. L.; Yates, J. R. Anal. Chem. 2000, 72, 4033-40. (18) Premstaller, A.; Oberacher, H.; Huber, C. G. Anal. Chem. 2000, 72, 438693. (19) Barry, J. P.; Vouros, P.; Van Schepdael, A.; Lay, S.-J. J. Mass Spectrom. 1995, 30, 993-1006. (20) Premstaller, A.; Ongania, K.-H.; Huber, C. G. Rapid Commun. Mass Spectrom. 2001, 15, 1045-52. (21) Premstaller, A.; Huber, C. G. Rapid Commun. Mass Spectrom. 2001, 15, 1053-60. (22) Yates, J. R., III; Eng, J. K. Identification of Nucleotides, Amino Acids, or Carbohydrates by Mass Spectrometry. U.S. Patent 716256, University of Washington, 1994. (23) Eng, J. K.; McCormack, A. L.; Yates, J. R., III. J. Am. Soc. Mass. Spectrom. 1994, 5, 976-89. (24) Erlich, H., Ed. PCR Technology. Principles and Applications for DNA Amplification; Stockton Press: New York, 1989.

212 Analytical Chemistry, Vol. 74, No. 1, January 1, 2002

to generate fragment ions mainly of the a-B and w type from 20-mer oligodeoxynucleotides with high sequence coverage,21 we now investigated the possibility of extending the size range to longer nucleic acid molecules. A computer-aided algorithm that combines the correlation of measured spectra with MS/MS data predicted from a reference sequence with a systematic variation of the reference sequence is used both to confirm expected nucleic acid sequences and to identify unknown mutations in nucleic acids up to 50-mers. EXPERIMENTAL SECTION Chemicals and Oligodeoxynucleotides. Acetonitrile (HPLC gradient grade) was obtained from Merck (Darmstadt, Germany). Stock solutions (0.50 M) of triethylammonium bicarbonate (TEAB) or butyldimethylammonium bicarbonate (BDMAB) were prepared by passing carbon dioxide gas (AGA, Vienna, Austria) through a 0.50 M aqueous solution of triethylamine or butyldimethylamine (both from Fluka, Buchs, Switzerland) at 5 °C until pH 8.4 was reached. For preparation of all solutions, HPLC-grade water (Merck) was used. The synthetic oligodeoxynucleotides listed in Table 1 were ordered from Microsynth (Balgach, Switzerland) and used without further purification. Capillary High-Performance Liquid Chromatography Coupled to Electrospray Ionization Mass Spectrometry. The HPLC system consisted of a low-pressure gradient micropump (model Rheos 2000, Flux Instruments, Base, Switzerland) controlled by a personal computer, a vacuum degasser (Knauer, Berlin, Germany), a column thermostat made from 3.3-mm-o.d. copper tubing that was heated by means of a circulating water bath (model K 20 KP, Lauda, Lauda-Ko¨nigshofen, Germany), and a microinjector (model C4-1004, Valco Instruments Co. Inc., Houston, TX) with 500-nL internal sample loop. ESI-MS was performed on a Finnigan MAT LCQ quadrupole ion trap mass spectrometer (Thermo Finnigan, San Jose, CA) equipped with an electrospray ion source. The 60 × 0.2 mm i.d. monolithic capillary column was prepared according to the published protocol18 and connected directly to the spray capillary (fused silica, 90-µm o.d., 20-µm i.d., Polymicro Technologies) by means of a microtight union (Upchurch Scientific, Oak Harbor, WA). A syringe pump equipped with a 250-µL glass syringe (Unimetrics, Shorewood, IL) was used for adding a 3.0 µL/min flow of acetonitrile as sheath liquid through the triaxial electrospray probe. For analysis with pneumatically assisted ESI, an electrospray voltage of 3.4 kV and a nitrogen sheath gas flow of 40-60 arbitrary units was employed. The temperature of the heated capillary was set to 200 °C. Total ion chromatograms and mass spectra were recorded on a personal computer with the Xcalibur software version 1.1 (Thermo Finnigan). Mass calibration and coarse tuning were performed in the positive ion mode by direct infusion of a solution of caffeine (Sigma, St. Louis, MO), methionylarginylphenylalanylalanine

(Finnigan), and Ultramark 1621 (Finnigan). Fine-tuning for ESIMS of oligodeoxynucleotides in the negative ion mode was performed by infusion of 3.0 µL/min of a 20 pmol/µL solution of (dT)24 in 25 mM aqueous TEAB containing 20% acetonitrile (v/ v). Cations present in the oligodeoxynucleotide solution were removed by on-line cation exchange using a 20 × 0.50 mm i.d. cation-exchange microcolumn packed with 38-75-µm Dowex 50 WX8 particles (Serva, Heidelberg, Germany).25 For IP-RP-HPLCESI-MS/MS analysis, oligodeoxynucleotides were injected without prior cation removal. For MS/MS experiments, the isolation width and the relative collision energy were set to 4.0 mass units and 15-100%, respectively; 0-100% relative collision energy corresponds to a high-frequency alternating voltage for resonance excitation from 0 to 5 V maximum to maximum. Helium, present in the ion trap at a pressure of 0.1 Pa, served as collision gas. Computer-Aided Data Interpretation. All calculations were performed on a personal computer under Windows NT operating system (400 MHz Pentium, 128 MB RAM). m/z values for fragment ions were calculated from the sequences using Microsoft Excel for Windows 2000 (Mircosoft, Redmond, WA). Measured MS/MS spectra of oligodeoxynucleotides were exported from the Xcalibur Software (Thermo Finnigan) as ASCII files. Automated comparison of measured and predicted spectra was performed with a program written in Mathlab (version 5.3, Mathworks Inc., Natick, MA). All data were collected, evaluated, and prepared for final output in Microsoft Excel for Windows 2000. RESULTS AND DISCUSSION Comparative Sequencing Strategy. The starting point of our sequencing strategy relies on the fact that for many genomerelated problems there is no need to completely determine an unknown sequence but merely to confirm identity or to reveal small differences compared to a known sequence. Sequencing by MS/MS is an attractive alternative to conventional Sanger sequencing, because sequence data can be obtained very rapidly in a time frame of several seconds upon CID and subsequent mass analysis of the fragments directly in a mass spectrometer. Moreover, standard PCR without expensive fluorescent tagging followed by on-line liquid chromatographic purification are the only sample preparation methods required to investigate genomic nucleic acid sequences.26 The de novo MS/MS sequencing method of McCloskey and co-workers11 cannot be extended to longer oligodeoxynucleotides because usually several missing fragments in the series of a-B or w ions prevent the successful passage through the whole sequence. Our method, on the contrary, involves the comparison of a measured MS/MS spectrum to a set of fragment ion m/z values predicted from a reference sequence, irrespective of the completeness of ion series. Gaps within the ion series can be tolerated since the m/z values of fragments after the gap also incorporate the positions where fragments are missing. Moreover, fragment ions from the complementary ion series are utilized to complete the set of sequence-relevant ions. The closeness of matching between the measured spectrum and the predicted set of ions is characterized by a value FS for the fitness, which takes into account the difference ∆ between (25) Huber, C. G.; Buchmeiser, M. R. Anal. Chem. 1998, 70, 5288-95. (26) Oberacher, H.; Parson, W.; Huber, C. G. Anal. Chem. 2001, 73, 5109-17.

measured and predicted m/z values, the relative intensity I% of the fragment ions, the number K of fragments assigned, and the number M of nucleotide positions not covered by fragment ions in the experimental spectrum. The smaller the value for FS, the closer the match between measured spectrum and predicted m/z values. Finally, to find a sequence most closely matching the experimental spectrum, the first reference sequence is sequentially varied by incorporating all four possible nucleotides A, T, G, and C at each position in the sequence. The correct sequence is then identified by that reference sequence having the lowest FS value. Development of the Sequencing Algorithm. Figure 1 outlines the procedure for computer-aided spectrum interpretation of MS/MS spectra of oligodeoxynucleotides. The input parameters are the reference sequence together with the charge state of the precursor ion on one side and a list of m/z values and relative intensities I% of the fragments in the experimental spectrum on the other side. Subsequently, Microsoft Excel is used to calculate from the reference sequence a list of the monoisotopic m/z values for the a, a-B, w, and w-B ion series including all possible charge states from 1- up to the charge state of the precursor ion (step 1, Figure 1). Then, the predicted m/z values and those obtained from the experimental spectrum are compared using a Mathlab program (step 2a, Figure 1). The comparison yields the number of fragment ions that could be assigned to predicted m/z values (K), and the sum of noncovered a and w positions (M). Noncovered a positions are those in the sequence for which neither a-B nor a fragments of any possible charge state could be identified in the spectrum, whereas the noncovered w-positions are positions in the sequence lacking w or w-B fragments. To be assigned or not, the m/z values must or must not fall within a tolerance of ( mass deviation (∆). Values for K and M are determined for ∆ values ranging from 0.2 to 0.8 in 0.1 steps (step 2b, Figure 1). Subsequently, the optimal ∆ value is selected on the basis that the number of false positive (∆ too large) and false negative (∆ too small) assignments has to be minimized (step 2c, Figure 1). The optimum ∆ is readily deduced from a plot of M versus ∆ as the ∆ value just before the M curve reaches a constant level (Figure 2). The next step of the algorithm involves the calculation of a match factor MF representing the quality of the assigned m/z signals in terms of mass deviation and intensity (step 3, Figure 1). The match factor is defined as the sum of all quotients of mass deviation ∆ and intensity I (since the input value for intensity is given in percent, the value is multiplied by 100), averaged over the total number of assignments K, giving the equation MF ) 1/K∑(100 × ∆/I%) (Figure 1). The three parameters match factor MF, number of assigned fragments K, and number of noncovered positions M are summarized to describe the fitness FS of predicted m/z values and measured spectrum according to the equation FS ) aMF - bK + cM. It is obvious, that the fitness becomes smaller with smaller mass differences, higher signal intensities, larger number of assigned fragments, and lower number of noncovered positions. The three coefficients a, b, and c are used to weigh the terms related to the individual features of the spectrum. With the value for a chosen as 1 per definition, b and c are empirically determined. The coefficient b is selected in such a way that the ratio of the first and second terms, namely MF/(bK), equals 1.3: 1; thus b ) MF/K × 1.3 (step 4, Figure 1). A value of 0.1 for the Analytical Chemistry, Vol. 74, No. 1, January 1, 2002

213

Figure 1. Outline of the steps involved in comparative sequencing by MS/MS.

coefficient c has been found to be appropriate for all calculations in this study. Insertion of the coefficients in the equation above yields FS as (step 5, Figure 1):

FS ) 1/K

∑(1001% ) - bK + 0.1M ∆

This first calculated FS value represents a measure for the correspondence of the measured MS/MS spectrum and the set of fragments predicted from the reference sequence. To find or exclude alternative sequences showing a better correspondence, the reference sequence is systematically varied followed by calculation of new FS values for the mutated sequences. This is done by sequentially exchanging the base at each position by the other three possible nucleobases, while the rest of the sequence is kept constant (step 6a, Figure 1). Assuming only one base 214

Analytical Chemistry, Vol. 74, No. 1, January 1, 2002

substitution, 3n calculations of FS for an n-mer sequence are performed (steps 6b-e, Figure 1). At this point, it is important to emphasize that the weight factor b is kept constant for all varied sequences and equals the one calculated in step 4 of the algorithm from the match factor and number of assigned fragments using the original reference sequence. The final output of the algorithm is a matrix showing the FS values of the reference sequence (boldface numbers. Figure 1) and those of the mutated sequences at the different positions in the sequence (Figure 1). For instance, the FS(T) value in line 1 of the matrix (-5.53) represents the FS of the reference sequence, TGGC in this case. FS(A) in the first line is the value for the mutated sequence obtained upon exchange of T for an A in position 1 (15.93). It can be seen that in this example the lowest FS values were always obtained for the reference sequence leading to the conclusion that the best correspondence is between experimental spectrum and reference

Figure 2. Selection of an optimal value for the mass deviation ∆ from the number of noncovered positions M. The curve was deduced from the MS/MS spectrum of the 20-mer of sequence 1 (Table 1).

sequence. Base substitutions would be indicated by FS values that are smaller for mutated sequences than those for the reference sequence as shown in the examples discussed below. Comparative Sequencing of a 20-Mer Oligodeoxynucleotide. The success of MS/MS sequencing largely depends on the purity of the sample that is presented to the mass spectrometer. On-line hyphenation of MS/MS to chromatographic separation was chosen as the sample preparation method, because it not only efficiently removes adducted cations and other, low molecular mass contaminants coming from, for example, solidphase oligodeoxynucleotide synthesis or PCR but also fractionates mixtures of nucleic acids prior to their mass spectrometric investigation. Separation was accomplished by ion-pair reversedphase high-performance liquid chromatography (IP-RP-HPLC) in a monolithic, 200-µm-i.d. poly(styrene/divinylbenzene) capillary column by application of gradients of acetonitrile in 25 mM triethylammonium or butyldimethylammonium bicarbonate.18,27 Figure 3 illustrates the MS/MS spectrum of sequence 2 that is derived from sequence 1 upon the mutation of an A to T at position 13. The spectrum was extracted as the average of 22 scans from the peak eluting at 5.4 min from the monolithic separation column, and the mass spectral data served as input for the comparative sequencing algorithm. Using the nonmutated sequence 1 as the reference sequence, the matrix of FS values was calculated as outlined in the previous section. The result is conveniently presented in a diagram, where the FS values for the different bases, distinguished by the letters A, C, G, and T, are plotted versus the position in the oligodeoxynucleotide sequence. It can be deduced from Figure 4a that, for the first six bases, the reference sequence represented the best fit for the experimental data. However, starting from position 7, the FS values for other bases were fitting better. Especially T for A exchanges were giving significantly smaller fitness values, strongly indicating an A-to-T mutation in the sequence. The exact position of the mutation was spotted by the minimum FS value (-6.75 at position 13, -6.72 at position 11) clearly identifying the mutation as A to T at position (27) Huber, C. G.; Krajete, A. Anal. Chem. 1999, 71, 3730-9.

Figure 3. MS/MS spectrum extracted from the IP-RP-HPLC-MS/ MS analysis of a 20-mer oligodeoxynucleotide at 35% collision energy. Column, monolithic PS-DVB, 60 × 0.20 mm i.d.; mobile phase, (A) 25 mM TEAB, pH 8.40, (B) 25 mM TEAB, pH 8.40, 20% acetonitrile; linear gradient, 5-40% B in 10.0 min; flow rate, 3.0 µL/min; temperature, 50 °C; product ions of the 4- charged species at m/z 1540.1, 4.0 amu isolation width, 35% relative collision energy; scan, 420-2000 amu; electrospray voltage, 3.4 kV; sheath gas, 40 units; sheath liquid, 3.0 µL/min acetonitrile; sample, 506 ng of raw product of sequence 2.

Figure 4. Sequence correlation diagrams of fitness versus position for a 20-mer oligodeoxynucleotide resulting from comparison of (a) sequence 1 and (b) sequence 2 to the spectrum of Figure 3.

13, which is in complete accordance with the theoretical base sequence (sequence 2, Table 1). If sequence 2 is used as the reference sequence in the algorithm together with the mass spectral data of Figure 3, the diagram of Figure 4b is obtained. As expected, the reference sequence yields the minimum FS values throughout the whole sequence, because sequence 2 already incorporates the A-to-T mutation. The diagram obviously proves the identity of reference sequence 2 and the sequence of the investigated oligodeoxynucleotide. Consequently, both previous examples demonstrate that the fully automatable algorithm can be used to easily confirm the identity of an unknown sequence with a reference sequence as well as to detect point mutations in comparison to a given reference sequence. Analytical Chemistry, Vol. 74, No. 1, January 1, 2002

215

Figure 6. Selection of optimal collision energy based on the minimum energy necessary to effect complete fragmentation of oligodeoxynucleotides of varying length.

Figure 5. Sequence correlation diagrams for 39-mer oligodeoxynucleotides at relative collision energies of (a) 35 and (b, c) 15%. Experimental conditions: mobile phase, (A) 25 mM TEAB, pH 8.40, (B) 25 mM TEAB, pH 8.40, 20% acetonitrile; linear gradient, 1060% B in 10.0 min; flow rate, 3.0 µL/min; product ions of the 7charged species at m/z 1721.6, 4.0 amu isolation width, 35 and 15% relative collision energy; scan, 470-2000 amu; sample, 603 ng of raw product of sequence 3; other conditions as in Figure 3.

Dependence of Optimal Fragmentation Energy on Sequence Length. In the course of investigations about the applicability of the algorithm to sequences longer than 20 nucleotides, we found that collision energy had a major impact on the ability to interpret the MS/MS spectra. For MS/MS experiments in the ion trap, a relative collision energy of 35% was usually applied to induce fragmentation of oligodeoxynucleotides up to the 20-mer, because the number of fragments that could be assigned in the spectra reached a maximum between 30 and 40% relative collision energy. However, attempts to apply the sequencing algorithm to the MS/MS spectrum of a 39-mer were not successful until the relative collision energy was reduced from 35 to 15%. Figure 5 compares the fitness diagrams resulting from application of the sequencing algorithm to sequence 3 as reference sequence and the MS/MS spectra of sequence 3 obtained at 35 and 15% relative collision energy, respectively. The spectrum recorded at 35% relative collision energy did not match at all the reference sequence (Figure 5a) and implied a G-to-A mutation at position 11. However, recalculation of the fitness matrix after incorporation of the mutation into the reference sequence did not result in any match with the experimental spectrum (diagram not 216 Analytical Chemistry, Vol. 74, No. 1, January 1, 2002

shown). At 15% relative collision energy however, a clear correspondence between spectrum and reference sequence was obtained, proving that the sequencing algorithm can be successfully applied to spectrum interpretation of longer nucleic acid sequences under optimized fragmentation conditions (Figure 5b). Furthermore, an A-to-T mutation introduced into the reference sequence 3 at position 19 could be readily detected at 15% relative collision energy (Figure 5c). In this case, incorporation of the mutation into the reference sequence (basically giving the original sequence) and recalculation of the fitness matrix resulted in a perfect match between spectrum and reference sequence (Figure 5b), providing strong evidence for the correctness of the detected mutation. Accordingly, the general strategy for the detection of point mutations by the sequencing algorithm includes two stages: (1) calculation of the fitness matrix and identifying type and position of a mutation by the minimum FS value and (2) incorporation of the mutation into the reference sequence and recalculation of the matrix. Recalculation with the mutated sequence must yield a complete match in order to validate the mutation. Otherwise, mutations indicated by FS values ranked second, third, etc., lowest are considered for recalculating the fitness matrix until a perfect match between mutated reference sequence and MS/MS spectrum is obtained. The failure to interpret the MS/MS spectrum of the 39-mer recorded at 35% relative collision energy is most probably due to secondary or alternative fragmentation reactions viable at higher collision energy21 and giving rise to fragments that may be misinterpreted as regular a-B or w ions. Thus, the optimum collision energy has to be high enough to generate fragments covering the sequence as completely as possible by a-B and w ions but low enough to suppress secondary or alternative fragmentation pathways. This compromise was empirically found to be in the region of the collision energy just sufficient for complete fragmentation of the precursor ion. The minimum relative collision energy (Emin) required to effect complete fragmentation of the precursor ions of oligodeoxynucleotides of different size was determined using IP-RP-HPLC-MS/MS, and the result is shown in Figure 6. It is seen that Emin decreases from 30 to 17% in the size range of 4-20 nucleotides (nt) and goes through a minimum between 20 and 30 nt to increase again for oligodeoxynucleotides longer than 30 nt. In all subsequent MS/MS experiments, the optimum relative collision energy was deduced from the graph in Figure 6 and set within a window of Emin ( 5% at the corresponding size of the oligodeoxynucleotide to be fragmented.

Figure 7. Fully automated IP-RP-HPLC-ESI-MS/MS analysis of failure sequences in the raw product of a 20-mer oligodeoxynucleotide prepared by solid-phase synthesis. Mobile phase, (A) 25 mM BDMAB, pH 8.40, (B) 25 mM BDMAB, pH 8.40, 40% acetonitrile; linear gradient, 5-12% B in 3.0 min, 12-20% B in 10 min; flow rate, 2.0 µL/min; temperature, 50 °C; precursor ions for MS/MS scans were automatically selected as the most intense ions from a full-scan spectrum in the data-dependent mode of operation of the mass spectrometer, 4.0 amu isolation width, 27-20% relative collision energy, programmed as indicated in the chromatogram; sample, 222 ng of raw product of sequence 1; other conditions as in Figure 3.

Application To Identify Failure Sequences in a Synthetic 20-Mer. A distinct advantage of IP-RP-HPLC as the sample preparation method is its ability to on-line fractionate mixtures of nucleic acids before structural investigation by MS/MS. Figure 7 illustrates as an example of the analysis of the raw product from solid-phase synthesis of a 20-mer of sequence 1. In this run, the relative collision energy was programmed to change as a function of time in order to account for the different sizes of the eluting oligodeoxynucleotides. Moreover, the mass spectrometer was operated in the data-dependent mode, which means that the m/z values for candidate precursor ions to be fragmented were automatically selected from a full MS scan performed immediately before an MS/MS scan. Significant amounts of as many as 12 failure sequences ranging in size from 5- to 18-mers were readily detected under the fully automated analysis conditions. The MS/ MS spectra extracted from the chromatogram were compared to truncated sequences of sequence 1 as reference sequence, and the algorithm was applied to establish the identity of the byproducts. All sequences were unambiguously correlated, and the byproducts were indeed failure sequences of the 20-mer. Figure 8 gives some examples for the fitness diagrams of the identified failure sequences. Neither mutations nor internal sequences were detected, indicating that low efficiencies in the coupling reactions during solid-phase synthesis were the reason for the occurrence of these byproducts. Sequencing and Localization of a Deletion in a 52-Mer Oligodeoxynucleotide. As already outlined in the introduction, an interesting area of application of comparative sequencing is the detection of sequence variation in genomic DNA segments amplified with high specificity by the PCR. Since PCR primers usually have lengths around 20 nt, an amplified PCR product has to be longer than 40 bp in order to contain relevant sequence information. We chose a 51-mer oligodeoxynucleotide to evaluate the performance of the comparative sequencing method for the sequencing of relatively long DNA molecules. Moreover, the 51mer was used to demonstrate both the successful identification

Figure 8. Sequence correlation diagrams for the identification of failure sequences in the raw product of a synthetic 20-mer oligodeoxynucleotide.

and the localization of a deletion by the sequencing algorithm. A hypothetic 52-mer sequence (theoretical molecular mass, 15909.54) was generated upon introducing an additional G at position 26 of sequence 4. Measurement of the molecular mass of sequence 4 in a full-scan MS experiment yielded a mass of 15580.0 differing from the theoretical mass of the 52-mer by 329.54 mass units. Although this mass difference is consistent with the deletion of one deoxyguanosine, the exact position of the deletion could not be inferred from the intact molecular mass measurement. Consequently, in a second step, the 51-mer was subjected to fragmentation and the resulting MS/MS spectrum was compared to the 52-mer as reference sequence yielding a fitness value of -15.84, as indicated by the dotted line in Figure 9a. Subsequently, the 52-mer sequence was varied by sequentially deleting all G’s present in the sequence. The resulting 51-mer reference sequences were compared to the experimental spectrum, and the lowest fitness value of -27.47 located the deletion at position 26 in the 52-mer sequence (Figure 9a). The correct sequence of the 51-mer was corroborated by deletion of a G at position 26 in the 52-mer and comparison of the new 51-mer reference sequence with the MS/MS spectrum (Figure 9b). CONCLUSIONS Electrospray ionization tandem mass spectrometry under optimized fragmentation conditions in combination with a computerAnalytical Chemistry, Vol. 74, No. 1, January 1, 2002

217

from a few to more than 50 nucleotides. The comparison of the experimental tandem mass spectrum with a set of fragments predicted from a known reference sequence facilitates the proof of identity of the investigated sequence with the reference sequence or the detection of sequence alterations in the form of insertions, deletions, and point mutations. The major advantage of the comparative sequencing algorithm over methods based on the identification of complete ion series in the tandem mass spectrum11 rests within the considerable tolerance for missing fragment ions, which significantly extends the applicable size range of tandem mass spectrometry for nucleic acid sequence analysis. At this stage of the study, solely single-point mutations and insertions/deletions have been successfully characterized, but preliminary results have shown that the utilization of additional information such as partial sequences or base compositions compatible with the intact molecular mass28 enable the characterization of multiple mutations and even de novo sequencing by tandem mass spectrometry. Future investigations will focus on a further extension of the size range to 80-100-mer sequences as well as on the incorporation of all steps of the sequencing algorithm into one, fully automated software program. Figure 9. Comparative sequencing for the detection and localization of a deletion, fitness diagrams with (a) 52-mer as reference sequence and (b) 51-mer with a deleted G at position 26. Experimental conditions: mobile phase, (A) 25 mM BDMAB, pH 8.40, (B) 25 mM BDMAB, pH 8.40, 40% acetonitrile; linear gradient, 5-70% B in 10.0 min; flow rate, 2.0 µL/min; product ions of the 9- charged species at m/z 1729.7, 4.0 amu isolation width, 17% relative collision energy; scan, 470-2000 amu; sample, 635 ng of raw product of sequence 4; other conditions as in Figure 3.

ACKNOWLEDGMENT This work was supported by a grant from the Austrian Science Fund (P13442-PHY).

Received for review August 14, 2001. Accepted October 30, 2001. AC015595A

aided comparative sequencing algorithm is applicable to the fully automatable sequence analysis of nucleic acids ranging in size

218

Analytical Chemistry, Vol. 74, No. 1, January 1, 2002

(28) Muddiman, D. C.; Anderson, G. A.; Hofstadler, S. A.; Smith, R. D. Anal. Chem. 1997, 69, 1543-9.