Anal. Chem. 2000, 72, 2331-2336
Accelerated Articles
STAT: A Saccharide Topology Analysis Tool Used in Combination with Tandem Mass Spectrometry Sara P. Gaucher, Jeff Morrow, and Julie A. Leary*
Department of Chemistry, University of California, Berkeley, California 94720
Sequential stages of mass spectrometry (MSn) have the potential to provide a great deal of structural information in glycan analysis. The saccharide topology analysis tool (STAT) presented here is a Web-based computational program that can quickly extract sequence information from a set of MSn spectra for an oligosaccharide of up to 10 residues. After information such as precursor ion mass, possible monosaccharide moieties, charge carrier, and product ion mass has been input, all possible connectivities are generated and evaluated against the MSn data. The list of possible structures is given a rating based on the likelihood that it is the correct sequence. Examples are given to demonstrate the feasibility of applying STAT to MSn data generated from bacterial lipooligosaccharides and an N-linked glycan. The major advantage of STAT is that the list of possible structures is generated quickly and the rating system pushes the more likely structures to the top of the list. Combining the data generated by STAT with data on the branching patterns of the glycan serves to eliminate all but a handful of structures. These remaining structures could then be used to guide further structural analysis.
It is well established that oligosaccharide moieties play prime roles in biological processes.1 Unfortunately, there is no one analytical method that can be used exclusively to determine the structures of these important and complex molecules.
* Corresponding author: (phone) (510) 643-6499; (fax) (510) 642-9295; (email)
[email protected]. (1) Varki, A., Cummings, R., Esko, J., Eds. Essentials of Glycobiology; Cold Spring Harbor Laboratory Press: Cold Spring Harbor, NY, 1999. 10.1021/ac000096f CCC: $19.00 Published on Web 04/29/2000
© 2000 American Chemical Society
Mass spectrometry, and in particular sequential stages of mass spectrometry (MSn), can be used in glycan analysis1,2 even when only microgram amounts of sample are available. Also, samples containing microheterogeneity within the glycan structure can be examined because the ion of interest is isolated in the gas phase prior to analysis. Furthermore, MSn spectra are convenient to obtain, particularly with the advent of ion trap instrumentation. There is a wealth of information available within a set of MSn spectra. The loss of a single monosaccharide unit from the intact oligosaccharide indicates an “end” unit (either from the reducing or nonreducing end(s)). The product ions themselves are smaller oligosaccharides contained within the original structure. Therefore, there is great potential in MSn data for obtaining topological information for an unknown oligosaccharide. There are several programs available that generate probable connectivities of the amino acids in a peptide given CID data;3 however, this technology is largely absent in the analysis of oligosaccharides. Our goal was to create a tool that could quickly extract sequence information from a set of MSn spectra. The saccharide topology analysis tool (STAT)4 is a Web-based computational program that can be accessed from any computer with an Internet browser once the software has been installed on the client’s Web server. After information such as precursor ion mass, charge carrier, and product ion mass has been input, all possible connectivities are generated and evaluated against the MSn data. Essentially, the composition of each product ion is calculated and is used to restrict possible structures; i.e., a product ion corre(2) (a) Burlingame, A. L.; Boyd, R. K.; Gaskell, S. J. Anal. Chem. 1994, 66, 634R-638R. (b) Harvey, D. J. J. Chromatogr., A 1996, 720, 429-446. (c) Ko ¨nig, S.; Leary, J. A. J. Am. Soc. Mass Spectrom. 1998, 9, 125-1134 and references therein. (d) Costello, C. E. Curr. Opin. Biotechnol. 1999, 10, 22-28. (3) Caporale, C. J. Pept. Res. 1998, 52, 421-429. (4) STAT is available from the University of California at Berkeley, Office of Technology Licensing, 2150 Shattuck Ave., Suite 510, Berkeley, CA 947201620; (phone) (510) 643-7201; (e-mail) http://otl.berkeley.edu.
Analytical Chemistry, Vol. 72, No. 11, June 1, 2000 2331
sponding to Hex3 + Na+ implies that three hexoses5 must be connected in the original structure. For an oligosaccharide containing seven or eight monosaccharide units, a list of all the possible saccharide sequences can be generated in a few seconds. Such a list, quickly and easily obtained from routine MSn experiments, can guide subsequent analysis to further narrow the choices. In this paper, we describe STAT and demonstrate the feasibility of our approach. EXPERIMENTAL SECTION Software. Algorithm. Possible compositions of the input molecule are determined given the component monomers,6 the total mass of the molecule, the mass of the charge carrier, a margin of error for the mass, and whether the oligosaccharide contains a reducing terminus. STAT first computes an adjusted total mass by subtracting the mass of the charge carrier and subtracting 18 if the saccharide is reducing. Then, using a version of the well-known “knapsack algorithm”,7 STAT computes all possible combinations of the selected monomers that sum to the adjusted total mass, plus or minus the specified margin of error. The possible compositions of each product ion mass input from MSn spectra are computed by essentially the same algorithm. The only difference is that for product ions the possible compositions are computed twice: once assuming that the masses correspond to oligomers with reducing termini and once assuming that the oligomers are “anhydro” species. The final phase of the algorithm involves the computation of all possible topologies that contain all of the substructures implied by the product ions.8 In general terms, the algorithm first computes each possible topology given the molecule’s composition. It then computes a mathematical requirement corresponding to each substructure and eliminates any topologies that do not satisfy all of the restrictions. More specifically, the algorithm first computes all possible topologies from the chosen composition information by combining algorithms for generating free trees9 and generating multiset permutations10 and removing duplicates by using a tree isomorphism algorithm.11 This generates a list of all possible vertexcolored tree structures,12 which is the mathematical equivalent of a list of all possible topologies of the given molecule. There is one assumption implicit in the calculations; because other structures are not biologically relevant, all trees listed in (5) The following abbreviations are used in this article: Hex, hexose; Hep, heptose; Hxn, N-acetylhexosamine; KDO, 2-keto-3-deoxyoctulosonic acid. (6) The currently allowed monomers are hexose, fucose, N-acetylhexosamine, xylose, N-acetylneuraminic acid, heptose, KDO, and hexosamine. Other monomers could be added to the program easily. (7) Winston, W. L. Introduction to Mathematical Programming: Applications and Algorithms; Duxbury Press: Belmont, CA, 1995. (8) A more detailed description of the algorithm used is given by Joel Sokol and Jeff Morrow, manuscript in preparation. (9) Wright, R. A.; Richmond, B.; Odlyzko, A.; McKay, B. D. SIAM J. Comput. 1986, 15, 540-548. (10) Korsh, J.; Lipschutz, S. J. Algorithms 1997, 25, 321-335. (11) Aho, A. V.; Hopcroft, J. E.; Ullman, J. D. The Design and Analysis of Computer Algorithms; Addison-Wesley: Reading, MA, 1974. (12) For the purposes of this article, a graph is a collection of vertexes V and a set of undirected edges E. A tree is a connected graph with no cycles. Since oligosaccharide molecules can be represented as trees with colored vertexessone color for each monosaccharide type in the moleculesmany efficient algorithms are available for their study. See, for example: Anderson, S. S. Graph Theory and Finite Combinatorics; Markham: Chicago, 1970.
2332
Analytical Chemistry, Vol. 72, No. 11, June 1, 2000
the output only have vertexes of degree13 four or lower. Equally important is that, in general, the algorithm will not be able to calculate a monosaccharide composition for an ion resulting from cross ring cleavage. Rather, a warning will appear to indicate that “[this] product ion...cannot be justified with the given structure and...margin of error,” and these ions will not be included when topologies for the ion of interest are calculated. However, the user must be vigilant in identifying possible cross ring cleavage ions as they may occasionally be isobaric with a possible monosaccharide composition. (In addition, these ions can be used later to obtain information on linkage position.14) An aid to interpretation is the fact that sequential stages of MSn on successive product ions will eventually identify and deconvolute the genesis of the cross ring fragments. This also assists in eliminating misidentified ions which might rarely be produced from monosaccharide rearrangement.15 In many cases, the problem of whether a given tree meets a specific requirement can be reduced to checks of adjacency between monomers. In graph-theoretical terms, these requirements may be expressed as the existence of a connected subgraph consisting of a specified color set. Thus, STAT enforces each requirement by checking every connected subgraph of the given tree for the correct color set. Once a list of topologies meeting all the requirements has been found, each element in the list is given a numerical rating based on the likelihood that it is the correct topology, with a lower rating indicating a more likely structure. To find a particular topology’s rating, the algorithm begins by assigning a score of zero. It then examines every requirement derived from the product ions. For each requirement, the algorithm calculates the minimum number of bond cleavages needed to separate the corresponding subgraph from the rest of the topology and adds that value to the topology’s rating. Interface. The Web software utilizes the Apache Web server (http://www.apache.org) and the PHP scripting language (http://www.php.net), both of which are freely distributable, Open Source software. The STAT server software is written partly in PHP (for the Web front end) and partly in POSIX-compliant C++ (for the computation-intensive portions). The software has been tested on the Linux and Irix platforms. Methods. Materials. Bacterial lipooligosaccharides from Haemphilus influenzae strains 2019 and 281.25 were obtained from Dr. Bradford Gibson, Department of Pharmaceutical Chemistry, University of California at San Francisco. The commercially available N-linked oligosaccharide NA2 was donated by Genetics Institute (Andover, MA). HPLC grade acetonitrile was purchased from Sigma (St. Louis, MO). Millipore H2O (18 Ω) was obtained by using a MilliQ Ultrafiltration system. Mass Spectrometry. Oligosaccharides were dissolved in 50/50 acetonitrile/water at a concentration of 5-50 pmol/µL for analysis by electrospray ionization (ESI). Samples were directly infused into the instrument at a flow rate of 0.5-2 µL/min, and MS and MSn spectra were recorded using a quadrupole ion trap (LCQ, Finnigan MAT, San Jose, CA). Typical instrument parameters were (13) The degree of a vertex equals the number of edges originating from it. (14) Hofmeister, G. E.; Zhou, Z.; Leary, J. A. J. Am. Chem. Soc. 1991, 113, 5964-5970. (15) Brull, L. P.; Kovacik, V.; Thomas-Oates, J. E.; Heerma, W.; Haverkamp, J. Rapid Commun. Mass Spectrom. 1998, 12, 1520-1532.
Figure 2. Step two in data analysis using STAT: Selection of precursor ion composition and input of product ion masses.
Figure 1. Step one in data analysis using STAT: Input of preliminary data.
as follows: sheath gas 40 arbitrary units, auxillary gas 0 arbitrary units, spray voltage 3-4.5 kV, heated capillary 175 °C, and capillary voltage 35-45 V. Precursor ions were isolated prior to collision-induced dissociation (CID) with the “isolation width” parameter set to 2-4 amu. CID was performed at a q value of 0.25 at the secular frequency of the precursor ion. The collision energy was applied for 30 ms and was set to 30% as defined by Xcalibur software version 1.0 or 1-1.5 V as defined by the LCQ Navigator software version 1.1. RESULTS AND DISCUSSION STAT Software. An advantage of STAT is that it runs on a Web server, thereby giving users the convenience of a crossplatform user interface that is probably already quite familiar to them. Once STAT is running on an organization’s Web server, users within the organization may connect to STAT from anywhere in the world using only their Internet browser. The system has been tested with Netscape Navigator versions 3 and 4, and Microsoft Internet Explorer versions 4 and 5 on the Macintosh, Windows, Irix, and Linux platforms. A minimal amount of preliminary information is required prior to data analysis using STAT. Information about the monosaccharide composition is particularly important. The required information may be obtained by mass spectrometry, chemical assay, or knowledge of the sample origin. To use the program, the user indicates the monosaccharides present in the sample, the oligosaccharide precursor ion that is to be analyzed, the charge carrier, a margin of error, and whether the sample is a reducing sugar or an N-linked sugar (see Figure 1). Negative ions generated by deprotonation are accommodated by entering “-1” for the mass of the charge carrier. STAT then computes all possible combinations of the selected monosaccharides that could add up to the proper mass of the intact oligomer, plus or minus the margin of error. The user is then asked to choose one of the possible compositions and enter the mass-to-charge ratio (at this time STAT
Figure 3. Step three in data analysis using STAT: STAT generates possible compositions for product ions. The user then chooses between compositions if more than one is possible or indicates that a product ion should not be used in the subsequent structure calculation.
only recognizes singly charged species) for MSn product ions that are to be included in the final topology analysis (see Figure 2). STAT computes all possible compositions of each ion, given the selected composition for the intact oligosaccharides as illustrated in Figure 3. Again, the user must choose between the possibilities if more than one composition is calculated for a given ion. STAT may also be instructed by the user to ignore an ion if an unlikely composition appears that could correspond to the ion. Finally, the query is submitted, and STAT generates a list of possible structures, which may be categorized and printed as a report such Analytical Chemistry, Vol. 72, No. 11, June 1, 2000
2333
Figure 4. A report generated by STAT for data analysis of MS2 and MS3 spectra from H. influenzae 2019 lipooligosaccharide.
as the one shown in Figure 4. In addition, the user has the option to save computation time by setting an upper limit to the number of structures generated, in case the MSn data are insufficient to narrow the choices to a reasonable number. STAT has been tested on low-end Linux machines (an AMD K6-2 300-MHz processor, 64-MB RAM, Red Hat Linux 5.2, and a Pentium 300-MHz MMX laptop, 32-MB RAM, Red Hat Linux 6.1) and can compute possible structures of an octasaccharide nearly instantaneously. A sample nonasaccharide was computed in ∼1 min, and decasaccharides have been computed in ∼5-10 min. The reason for the increase in computation time is that as the number of monosaccharides within a given structure increases, i.e., from n to n + 1, the number of tree structures generated and requiring evaluation increases exponentially. The rating mechanism works well, as will be discussed below; however, there are two main caveats. The first is that if a single bond cleavage results in two restrictions being satisfied, then that bond cleavage will be counted twice: once to satisfy the first restriction and then again to satisfy the second. The other caveat is that if the calculation is terminated after, for example, 100 structures, it must be noted that the structure with the “best” rating is only the closest fit to the data as compared to the other 99 structures initially generated. Application to Real Samples. Bacterial Lipooligosaccharides (LOS). The first example is the major component of the heterogeneous LOS mixture embedded in the outer membrane of the respiratory pathogen H. influenzae strain 2019.16 The first step in data analysis is shown in Figure 1. The sodiated species has a mass-to-charge ratio of 1143 Da. This species was selected and allowed to undergo CID. The result of this MS2 experiment is shown in Figure 5A. The product ion at m/z 923 was further selected and subjected to CID as shown in Figure 5B. The (16) Phillips, N. J.; Apicella, M. A.; Griffiss, J. M.; Gibson, B. W. Biochemistry 1992, 31, 4515-4526.
2334 Analytical Chemistry, Vol. 72, No. 11, June 1, 2000
Figure 5. Tandem mass spectra for the oligosaccharide portion of a lipooligosaccharide from H. influenzae 2019: [Hex2Hep3KDO + Na]+. (A) is the MS2 spectrum m/z 1143 f, and (B) is the MS3 spectrum m/z 1143 f 923 f.
information contained in these two spectra alone is enough to determine the most likely connectivity of the oligosaccharide at m/z 1143. If nothing else were known about this sample other than the MS2 and MS3 product ions, all monosaccharides could be selected as possible components. In this case, a list of five possible compositions is generated. If the MS2 spectra is studied briefly, however, losses of 192 (- heptose), 220 (- KDO), and 162 Da (- hexose) are seen from the precursor ion. Only one out of the five possible compositions corresponds to an oligosaccharide that contains at least one heptose, hexose, and KDO moiety. The composition Hex2Hep3KDO is therefore selected as shown in Figure 2. Alternatively, information about the monosaccharide composition could have been previously obtained through chemical methods. Each MS2 and MS3 product ion is then input (Figure 2), and all possible monosaccharide compositions for these ions are generated (Figure 3). Given this MSn data there are only four possible structures for the original oligosaccharide as shown in Figure 4. In fact, the correct structure (structure B in Figure 4) is tied with two other possible topologies as the most likely structure according to the rating system. Biological data can also be used to eliminate unlikely structures from a list of calculated structures. For example, structures A and D are unlikely because they do not contain the correct KDOHep3 core which is known for LOS from H. influenzae.17 Thus, in this case, knowledge of the biological origins of the sample serves to eliminate half of the possible structures. (17) Bernlind, C.; Oscarson, S. J. Org. Chem. 1998, 63, 7780-7788.
Table 1. Possible Configurations for the Hex2Hep3 Substructure of the Major Oligosaccharide Component of Lipooligosaccharides from H. Influenzae 2019
a
Table 3. Possible Structures Calculated for the Oligosaccharide Portion of a Lipooligosaccharide from Haemophilus Influenzae 281.25 That Correlate to Methylation Linkage Analysis Data
A lower rating indicates a more likely structure (see text).
Table 2. Product Ions from MS2, MS3, and MS4 Spectra Used in Topology Computation for the Oligosaccharide Component of a Lipooligosaccharide from H. Influenzae 281.25a product ion, m/z
composition
731 749 761 789 819 923 941 951
hex2hep2 hex2hep2 hex1hep3 hex1hep2kdo1 hep3kdo1 hex2hep3 hex2hep3 hex2hep2kdo1
product ion, m/z
composition
981 1073 1113 1143 1247 1275 1305
hex1hep3kdo1 hex4hep2 hex3hep2kdo1 hex2hep3kdo1 hex4hep3 hex4hep2kdo1 hex3hep3kdo1
aThe precursor ion corresponding to [Hex Hep KDO + Na]+ 4 3 appears at m/z 1467.
Another method that can be used to narrow the list of possible structures involves performing the search first on the precursor ion (MS2) and then applying the algorithm again, consecutively, on the MS3 data from one of the MS2 substructures. For example, the ion at m/z 923 in the MS2 spectrum shown in Figure 5A corresponds to the composition Hex2Hep3. As shown in Table 1, there are only two configurations calculated using STAT for this Hex2Hep3 oligosaccharide that satisfy the MS3 (m/z 1143 f 923 f) restrictions. Furthermore, the top choice for this “core” (Table 1, rating value 9, structure 1) is contained in only structures A and B of Figure 4. In fact, previous studies16 have confirmed B as the correct structure. A similar analysis can be performed on the oligosaccharide from an LOS species from H. influenzae strain 281.25.18 Here the sodiated oligosaccharide appears at m/z 1467. If Hex, Hep, and KDO are input as the possible component monosaccharides, the composition is determined to be Hex4Hep3KDO. Product ions from five MSn spectra are input into the program, and compositions are subsequently generated (see Table 2). At this stage, a total of 44 structures is returned (out of a possible 2290 if no restrictions on connectivity had been specified). The possible structures for (18) Phillips, N. J.; McLaughlin, R.; Miller, T. J.; Apicella, M. A.; Gibson, B. W. Biochemistry 1996, 35, 5937-5947.
a
A lower rating indicates a more likely structure (see text).
this octasaccharide can be divided into groups based on the degree and type of each vertex12,13 in the molecule above degree 2. Data generated from a methylation linkage analysis could be used to eliminate most of these structures. In this particular case,18 methylation analysis indicates two types of heptoses having degree 3. Only five structures out of the original 44 match these data and these are shown in Table 3; therefore, the structure must be one of these five choices. From biological considerations, the “core” structure of this species is known,17 so structures 3a and 4 can be considered unlikely. Finally, a prominent ion at m/z 749 has the composition Hex2Hep2 and most likely arises from a onebond C cleavage from the precursor. Such a cleavage could only occur from structure 2 (see Table 3). Together these factors lead one to conclude that structure 2 is most likely. In fact, this topology has been confirmed in another publication.18 STAT can support calculations for structures of up to 10 monosaccharide units. At this point, the rating system becomes an even more important resource. For example, ions from a MALDI PSD spectrum19 of a sodiated decasaccharide Hex4HexNAc2Hep3KDO at m/z 1874 were input into STAT. If no limit (19) Currently unpublished data, obtained from Drs. Bradford Gibson and Birgit Schilling, University of California at San Francisco Department of Pharmaceutical Chemistry.
Analytical Chemistry, Vol. 72, No. 11, June 1, 2000
2335
possibilities. The inability of the rating mechanism alone to distinguish between the initial possibilities was most likely due to the high ratio of total monosaccharides present (i.e., nine) to number of monosaccharide types (i.e., two) in this molecule. At this time, although the STAT program still requires further improvements, it has clearly provided a launching point for further computer-automated saccharide structure analysis.
Figure 6. N-linked core used in computations with STAT. Arrows indicate allowed points of attachment for other monosaccharides or branches. Table 4. MS2 Product Ions Used in Topology Calculation for the N-Linked Glycan NA2a product ion, m/z 874 933 1077 1136 aThe
composition hex4hxn1 hex3hxn2 hex4hxn2 hex3hxn3
product ion, m/z 1239 1280 1442 1501
composition hex5hxn2 hex4hxn3 hex5hxn3 hex4hxn4
precursor ion [Hex5Hxn4 + Na]+ appears at m/z 1663.
Figure 7. Structure of the N-linked glycan “NA2”.
is placed on the number of structures generated, a total of 1277 topologies are returned. However, inspection of the first few structures shows that the proposed structure for the intact oligosaccharide appears tied with six other structures for second place (data not shown). Therefore, even though a very large number of structures is generated, the more likely ones are pushed to the top of the list. N-Linked Glycans. To simplify the output list, the user may designate a particular glycan as N-linked before the calculation is performed. This will automatically eliminate all structures that do not contain the “core” shown in Figure 6. The arrows shown in the figure indicate possible points of attachment for other monosaccharides within the structure. Note that bisecting HexNAc and core fucosylation are not considered as possibilities in this version of the algorithm to preserve computational simplicity. An example computation was performed on the N-linked glycan NA2, a Hex5HexNAc4 nonasaccharide. The sodiated precursor ion appears at m/z 1663. Table 4 lists the MS2 product ions used in the computation. A total of 57 possible structures are returned, all of which contain the N-linked core. These structures were generated in under 1 min. The correct configuration, shown in Figure 7, was ranked as a second choice structure. However, this structure was tied with a number of other possible configurations. In instances such as this where other structural data are unavailable, methylation analysis could be used to further eliminate the 2336
Analytical Chemistry, Vol. 72, No. 11, June 1, 2000
CONCLUSIONS STAT, the saccharide topology analysis tool, can generate a list of all possible structures for a glycan of up to 10 monosaccharide units. Only those structures which conform to restrictions on monosaccharide connectivity implied by MSn product ions are produced. Such a list is quickly obtainedsthe calculation takes ∼1 min for a nonasaccharidesand utilizes data that are usually routinely obtained for these molecules. Data generated by CID using an ion trap and MALDI-PSD data from a time-of-flight instrument have been used as examples to demonstrate the operation and versatility of STAT. Data from other methods of generating smaller oligomers contained within the intact structure such as partial acid hydrolysis or glycosidase array digestion may be used as well. For structures derived from tandem mass spectrometry data, a rating system based on number of bond cleavages in MSn is used to indicate which structures in the list are more likely to be correct. This ensures that useful information may be extracted from the calculation even when the list of possible structures is extensive. In addition, although the user must be assured of the quality of the data entered, STAT has sufficient flexibility to ensure that it does not become a “black box”. Combining the data generated by STAT with data on the branching patterns of the glycan serves to eliminate all but a handful of structures. Alternatively, MSn data can be used consecutively to analyze smaller units within the intact structure to narrow the possibilities. In addition, knowledge of the biological origin of the sample can serve to refine this list of possible structures. These remaining structures could then be used to guide further structural analysis, for example, to create an enzyme array or further MSn experiments that would distinguish between them. Improvements for the future include analysis of permethylated oligosaccharides and allowing the user to enter information such as number and type of vertexes present or known “core” structures. Improvements to algorithm efficiency are also being explored. ACKNOWLEDGMENT The authors thank Professor Joel Sokol, School of Industrial and Systems Engineering, Georgia Institute of Technology, for mathematical analysis of the algorithms used in STAT. Drs. Bradford Gibson, Nancy Phillips, and Birgit Schilling from the University of California at San Francisco Department of Pharmaceutical Chemistry provided the LOS samples and MALDI-PSD data. Dr. Hubert Scoble from Genetics Institute generously donated the NA2 sample. Funding for this research was provided by NIH Grant GM-47356. Received for review February 1, 2000. Accepted April 3, 2000. AC000096F