Anal. Chem. 2003, 75, 5362-5373
Small Molecules as Mathematical Partitions Daniel L. Sweeney*
Pfizer Global R&D, Skokie, Illinois 60077
Small molecules can be represented as modular structures: small numbers of unbreakable cells, of known elemental composition, joined together at cleavable seams. The cells are a mathematical partition of the molecular weight. A systematic process is described here for converting mass spectral data into these simple modular structures; a computer program was then developed and tested using this process. On the basis of this preliminary work, it appears that this partitioning approach may be practicable for many compounds. Examples illustrating some of the limitations encountered with this approach are also presented. Many aspects of LC/MS (sample preparation, chromatography, quantitation) have been streamlined and automated so that large numbers of samples can be rapidly analyzed. Methods for rapidly identifying unknown compounds from their corresponding mass spectra have also evolved. The first approach is library or database matching. The combined NIST and Wiley libraries have hundreds of thousands of spectra. Algorithms such as probability based matching (PBM) have been developed to optimize searching these libraries.1 Library matching is especially powerful for electron impact spectra, but recently, much progress has also been made toward obtaining and using libraries of CID data.2 A second approach is predictive software, such as ACD’s Spectrum Manager3 and HighChem’s Mass Frontier.4 This predictive software is structure-based, starting with a proposed molecular structure and then assigning fragment ions to the spectrum by applying fragmentation rules to the structure. Specialized programs, such as SEQUEST,5 are extremely important for identifying proteins and peptides. These programs essentially work from the product ion spectra to chemical structures. SEQUEST also utilizes the database matching approach previously mentioned. Developing a basic understanding of the fragmentation of peptides and other small organic compounds is an active area of * Corresponding author. E-mail:
[email protected]. (1) McLafferty, F. W.; Zhang, M.-Y.; Stauffer, D. B.; Loh, S. Y. J. Am. Soc. Mass Spectrom. 1998, 9 (1), 92-95. (2) Hough, J. M.; Haney, C. A.; Voyksner, R. D.; Bereman, R. D. Anal. Chem. 2000, 72 (10), 2265-2270. (3) http://www.acd.com/. (4) http://www.highchem.com/mf.htm. (5) Eng, J. K.; McCormack, A. L.; Yates, J. R. J. Am. Soc. Mass Spectrom. 1994, 5 (11), 976-989.
5362 Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
Figure 1. Comparison of the molecular structure and modular structure of xemilofiban.
research,6,7 and consequently, these fundamental studies will influence the development of better software. Many compounds can be described in a modular format that will account for most of the fragments observed in the product ion spectra. Essentially, a molecule can be represented in the form of unbreakable cells of known elemental composition joined together at cleavable seams. One such compound is xemilofiban. The compound is shown in Figure 1 in two formats; the modular structure is shown below the corresponding molecular structure. The modular structure in Figure 1 is a convenient way of summarizing product ion mass spectral data. On the basis of this modular structure, fragment ions are viewed as different groups of connected cells. This modular structure of xemilofiban was derived after detailed analysis of its spectrum, the spectra of its analogues, and correlation of spectral features with structural features (e.g. by a “mental” process). This report describes preliminary efforts to derive modular structures directly from product ion spectra, using a computerized mathematical process. Modular structures very closely resemble the molecular structures. This resemblance would be very helpful for identifying unknown compounds for which background information is very limited (e.g., forensics). In addition, a simple change in the structure of a molecule (e.g., enzymatic oxidation) will often shift the masses of many fragment ions in the product ion mass spectrum. Using modular structures, these shifts can often be attributed to a change in the mass of a single cell, and so one can (6) O’Hair, R. A. J. In Mass Spectrometry in Drug Discovery; Rossi, D. T., Sinz, M. W., Eds; Marcel Dekker: New York, 2002; Chapter 4. (7) Cheng, X.; Gao, L.; Buko, A.; Miesbauer, L. Proc. 46th ASMS Conf., Orlando, Florida, 1998, 85. 10.1021/ac034446k CCC: $25.00
© 2003 American Chemical Society Published on Web 09/19/2003
observable only when the chemical structure is known by some other means. The third assumption is that the simplest solution to a spectrum is the most plausible solution. Like the two previous assumptions, this assumption is not always true.14
Figure 2. The cells visualized as small molecules.
easily pinpoint the region where the original molecule was altered. In addition, if one could get both high accuracy and sufficiently small cells, only one formula of all the possible formulas for the whole compound would fit the elemental data for all of the cells. This is similar to the theory behind the “basket-in-the-basket” approach.8 Finally, a large amount of mass spectral data is often obtained on a metabolite, degradation product, or impurity because the same sample is analyzed on several types of instruments to yield both accurate mass data and MSn data;9 accurate mass MS/MS data can be easily related to MSn or CID-MS/MS data through the use of these modular structures. Recent technological developments have made it practicable to generate modular structures from product ion mass spectra. First, the recent proliferation of both quadrupole-time-of-flight and Fourier transform ion cyclotron resonance mass spectrometers has resulted in readily obtainable accurate mass fragmentation data. Second, high-speed desktop computers can now do the very intensive calculations needed. Three assumptions about the product ion spectra of protonated molecules are being made here. The first assumption is that fragment ions, like protonated molecular ions, are even-electron ions. Indeed, fragment ions, when neutralized by the removal of a proton, are assumed to be molecules. The cells can also be visualized as hypothetical molecules, as shown in Figure 2. (Some neutralized fragment ions, such as C7H6, mass 90, the neutralized benzyl carbocation, would be difficult to visualize as neutral molecules.) This even-electron assumption is usually correct for positively charged product ion spectra obtained under normal collision energies. Exceptions are known (e.g., substituted anilines, such as clenbuteral10), but even-electron fragment ions predominate in the product ion spectra of most protonated compounds. There are two consequences of this assumption: the nitrogen rule applies to the cells, and the number of rings and double bonds increases by one each time two cells are cleaved. The second assumption is that no rearrangements occur. Although rearrangements are well-known,11-13 rearrangements are (8) Wu, Q. Anal. Chem. 1998, 70, 865-872. (9) Clarke, N. J.; Rindgen, D.; Korfmacher, W. A.; Cox, K. A. Anal. Chem. 2001, 73, pp 430A - 439A. (10) Willoughby, R.; Sheehan, E.; Mitrovich, S. A Global View of LCMS; Global View Publishing: Pittsburgh, PA, 1998; pp 554. (11) Warrack, B. M.; Hail, M. E.; Triolo, A.; Animati, F.; Seraglia, R.; Traldi, P. J. Am. Soc. Mass Spectrom. 1998, 9, 710-715. (12) Brull, L. P.; Heerma, W.; Thomas-Oates, J.; Haverkamp, J.; Kovacik, V.; Kovac, P. J. Am. Soc. Mass Spectrom. 1997, 8, 43-49. (13) Tiller, P. R.; Raab, C.; Hop, C. E. C. A. J. Mass Spectrom. 2001, 36, 344345.
EXPERIMENTAL SECTION Chemicals. HPLC grade water was obtained from a Millipore Simplicity system. Methanol was obtained from Burdick and Jackson (Catalogue no. UN1230); acetonitrile, from EM Science (Catalogue no. AX0142); and ammonium acetate (Catalogue no. A639-500), from Fisher Scientific. Acetic acid (Ultrex grade, Catalogue no. 6903-05) was purchased from Baker. Leucine enkephalin (L-9133) and sodium acetate trihydrate (S-7670) were obtained from Sigma. Acetamide was purchased from Aldrich (A105-3), and ESI tuning mix (G2421A) was obtained from Agilent. The other compounds are not commercially available, although the structures were well-established by both synthesis and NMR analyses. Solutions. All compounds were dissolved at 1 µg/µL in mobile phase B (95% methanol, 5% water, 12.5 mM ammonium acetate, 12.5 mM acetic acid), except compound C, which was dissolved in 14% mobile phase B and 86% mobile phase A (5% methanol, 95% water, 12.5 mM ammonium acetate, 12.5 mM acetic acid) at 140 ng/µL. Compounds B and C were used without further dilution. All other compounds were diluted 1:40 with additional mobile phase B to a final concentration of 25 ng/µL. Lock Spray Solution. Agilent ESI tuning mix was diluted 1:3 with acetonitrile and loaded into a 50-mL ISCO µLC500 syringe pump. Accurate Mass. All accurate mass spectra were obtained with a Micromass Q-TOF-2 mass spectrometer equipped with orthogonal Z-SPRAY and LockSpray. The lockmass compound used was the 622.0295 compound of the Lock Spray Solution above.15 An Agilent 1100 binary pump (Catalogue no. G1312A) with an Agilent degasser (G1322A) was used to deliver mobile phase B to the mass spectrometer at a flowrate of 100 µL/min. Samples were injected into the mass spectrometer directly using an Agilent 1100 autosampler (Catalogue no. G1329A). A Zorbax SB-Phenyl (3.5 µm, 2.1 × 100 mm) column (Agilent) was placed between the pump and the injector to provide some backpressure to the pump. The lockmass solution was infused at a flowrate of 10 µL/min using the ISCO µLC500 syringe pump. The instrument was scanned from 50 to 1020 Da in 2 s, with a 123-µs pusher time. W-mode was used in the +ESI mode with a resolution of ∼17 500 (peak width at half-height). Details about the instrument settings and calibration, chosen to maximize accuracy, are found in the Supporting Information. To obtain accurate mass MS/MS spectra, a collision energy was chosen for which the relative intensity of the parent ion was roughly 10%. Generally, this energy gave a good distribution of high- and low-mass fragment ions. Experimentally, three different collision energies were generally tried simultaneously for each compound. Details of conditions used to obtain the spectra are found in the Supporting Information. Only ions greater than 2% (14) Hoffman, R.; Minkin, V. I.; Carpenter, B. K. Int. J. Philos. Chem. 1997, 3, 3-28 (www.hyle.org/journal/issues/3/hoffman.htm). (15) Flanagan. U.S. Patent 5872357; Feb 16, 1999.
Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
5363
release 7.1 (Seawolf), operating system version no. 1 Sun Apr 8 20:41:30 EDT 2001, release 2.4.2-2. The C compiler was gcc version 2.96. All programs were written in C. Computer Parameters Used: minimum_number_of_ions ) number of cells + 1; MaxDefect ) 25 (equivalent to 2.5 milli-Daltons); least_squares_fit ) 5; MinimumCoverage: initially set at 65%, but varied if necessary to increase or decrease the number of solutions.
Figure 3. Brief summary of the overall process showing the seven steps. Table 1. Accurate Mass Product Ion Spectrum of Protonated Compound B Obtained at a Collision Energy of 12 V, Positive ESI Mode mass ion
rel. inten.
found
neutralized
1 2 3 4 5
12 18 16 100 5
639.1859 477.1330 325.1135 315.0807 163.0614
638.1781 476.1252 324.1057 314.0729 162.0536
relative intensity were used in the calculations. The relative intensities of these ions were rounded to the nearest integer. Low-Resolution CID-MS/MS. Low-resolution CID-MS/MS spectra were obtained on the same sample solutions using a Micromass Quattro II mass spectrometer equipped with orthogonal Z-SPRAY. Most sample solutions were injected with a cone voltage of 70 V (for CID) and a collision energy of 20 V. The labile compound B was fragmented at a cone voltage of 35 V and a collision energy of 12V (Supporting Information). Argon at a pressure of 3 × 10-3 mBar was used as the collision gas. For the CID-MS/MS work, only two or three of the largest fragment ions that were detected in the accurate mass spectra obtained above were fragmented. Only product ions with the same integral mass as ions observed in the accurate mass spectrum and >10% relative intensity were considered present. Computations. Accurate masses and predicted isotope ratios were obtained using MassLynx version 3.5 (Waters). All other programs were run on a Dell Dimension 340 with an Intel Pentium 4 clocked at 1.7 GHz. The operating system was Red Hat Linux 5364 Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
DESCRIPTION OF THE PROCESS The overall process is summarized briefly in Figure 3. Compound B, molecular weight 638 and having only five ions in its product ion spectrum, will be used to illustrate each individual step. (1) Finding Partitions: finding the integral masses of the cells by systematically generating every partition of the integral molecular weight having a given number of cells, summing up every combination of those partitions, and comparing those sums to the integral masses of the neutralized fragment ions, looking for partitions that account for as many fragment ions as possible. The integral molecular weight is partitioned. A partition of a positive integer is any set of positive integers adding up to that number.16 The product ion spectrum of protonated compound B is tabulated in Table 1. First, all of the product ions (the protonated molecular ion is included) are neutralized prior to partitioning. Since this is a positive ion spectrum and the electron mass was ignored in the calibration, 1.0078 (the mass of a hydrogen atom) is subtracted from all of the masses. The molecular weight of compound B is 638. The number 638 has 319 two-cell partitions (e.g., 319 different sets of two positive integers that add up to 638), 33 920 three-cell partitions, 1 811 911 four-cell partitions, and over 58 million five-cell partitions. A twocell partition can account for a maximum of three ions in a spectrum, whereas a three-cell partition can account for as many as six ions (discussed in more detail later). Since compound B has five ions in its spectrum, the three-cell partition, being the simplest possible solution, is considered first. For each of the 33 920 three-cell partitions, there are seven combinations of three cells taken one, two, or three at a time. Each of the seven combinations is summed, and the resultant sum is compared to the masses of the neutralized fragment ions to look for matches. The number of neutralized fragment ions that match a sum is compared to the minimum_number of_ions criterion; in this case, four ions was chosen. Every partition whose sums can account for at least four ions is saved as a possible solution. The other partitions are all eliminated at this stage. The problem of finding the better partitions can be simplified to some extent. As mentioned previously, the cells can be viewed as hypothetical molecules. For molecules composed of the common elements such as C, H, N, O, S, Cl, and F, there are no simple molecules of masses from 1 to 16 (loss of methane would be unexpected), except H2. In addition, there are no molecules between 21 and 25 made up of these common elements. Limiting the elements in this way reduces the number of three-cell partitions of the number 638 from 33 920 to 27 562. (16) Biggs, N. Discrete Mathematics; Clarendon Press: Oxford, 1989.
In this simple example, there are only five neutralized fragment ions in the spectrum. If the minimum_number_of_ions criterion is set to four ions, it is found that only 2 of the 27 562 partitions can account for four or more ions in the spectrum. Both of these partitions can actually account for all five neutralized fragment ions, using the capital letters A, B, and C to represent individual cells.
A + B + C ) 162 + 152 + 324 A + B + C ) 162 + 162 + 314 The first partition can account for all five ions observed in the spectrum, and it also has two “silent” ions at 153 (B) and 487 (A + C) Da. This partition is summarized below: combination
sum
+ H+
cell assignment
152 162 324 152 + 162 152 + 324 162 + 324 152 + 162 + 324
152 162 324 314 476 486 638
153 163 325 315 477 487 639
B A C A+B B+C A+C A+B+C
(2) Generating Systems of Equations: assigning a system or systems of simultaneous linear equations relating the masses of the cells to the observed neutralized fragment ions. After a partition accounting for the minimum_number_of_ions is found, the fragments that match a sum of cells can then be written as a system of linear simultaneous equations. (Sums that do not correspond to an observed neutralized fragment ion, 152 and 486 in this example, and neutralized ions that are not assigned to cells are ignored.) In this case, where A ) 162, B ) 152, and C ) 324, the linear simultaneous equations are
1*A + 0*B + 0*C ) 162 1*A + 1*B + 0*C ) 314 0*A + 0*B + 1*C ) 324 0*A + 1*B + 1*C ) 476 1*A + 1*B + 1*C ) 638 In some cases, a particular fragment can be assigned in more than one way. This situation arises when two or more cells are equal or when a cell is the sum of two or more other cells. In these cases, all of the possible ways of accounting for the fragments are tested in the next steps until either all of the sets of multiple assignments are tested or until a solution is found with one set of assignments. As an example of multiple assignments for the same fragment ions, note that the first partition (A + B + C ) 162 + 152 + 324) had a unique assignment for each neutralized fragment ion, whereas the second partition (A + B + C ) 162 + 162 + 314) can have two assignments for two of the neutralized fragment ions, 162 (A or B) and 476 (A + C or B + C). The reason for this is that two of the cells are mass 162.
As integers, these two cells of mass 162 are equal; thus, it would appear that this would not make any difference. However, it may be assumed that the two cells may have different mass defects. A second consideration is that the two cells may be exactly equal in mass, but not interchangeable spatially. For example, there could be two ammonia moieties in a molecule, and the assignments must be consistent with respect to the spatial configuration of the molecule. Since there are two neutralized fragment ions in the second partition that have duplicate assignments, four systems of linear equations derived from this partition must be considered in subsequent steps. (3) Removing “Linked” Systems of Equations: removing any system of equations if two or more cells are always assigned together (“linked”). If two cells are always found together, those cells are considered “linked” because that data can always be described with fewer cells. An example of linked cells is the four-cell partition for compound B shown below: combination
sum
+ H+
assignment
80 + 82 324 80 + 152 + 82 152 + 324 80 + 152+ 324 + 82
162 324 314 476 638
163 325 315 477 639
A+D C A+B+D B+C A+B+C+D
Although this four-cell partition above works just as well as the two three-cell partitions and also accounts for all five ions, the 80 and 82 cells (A and D) are always assigned together, and these two cells should be replaced by a single cell of mass 162. These less simple partitions are therefore eliminated. (4) Solving for Mass Defects: generating a system of simultaneous linear equations using the coefficients derived in step 2 above to relate the integerized mass defects of the cells to the integerized mass defects of the corresponding neutralized fragment ions and solving those simultaneous linear equations for the integerized mass defects of the cells. As mentioned previously, the fragment assignments can be viewed as a system of linear simultaneous equations. Each equation is an assignment of a neutralized ion; the coefficients are all either 1 or 0. Basically, each cell is either present (1) or absent (0) in a neutralized fragment ion. The same set of coefficients must apply to the simultaneous equations relating the mass defects, because each cell represents an elemental composition contributing both an integral mass and a mass defect. For mathematical convenience, the mass defects of the neutralized ions are integerized by multiplying by 1000, and the mass defects are rounded to the nearest milli-Dalton. The mass defects of the cells are subsequently calculated to the nearest milli-Dalton. In the simple three-cell example, for the partition A + B + C ) 162 + 152 + 324, the corresponding system of simultaneous equations relating the integral cell masses to the integral masses of the neutralized ions was shown in step 2 above. Analogous equations (same coefficients) can be written in terms of the integerized mass defects, where the small letters a, b, and c are the unknown integerized mass defects of each cell, Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
5365
and the sum is the integerized mass defect of the corresponding neutralized fragment ion (in milli-Daltons).
1*a + 0*b + 0*c ) 54 (e.g., the mass defect of the 162 ion was 0.0536) 1*a + 1*b + 0*c ) 73 0*a + 0*b + 1*c ) 106 0*a + 1*b + 1*c ) 125 1*a + 1*b + 1*c ) 178
Three equations are required to solve for three variables. [The minimum_number_of_ions criterion must therefore be greater than or equal to the number of cells.] However, a multistage Monte Carlo optimization can be used to find the mass defects of the cells (a, b, and c) using all five equations and, thus, use all of the data points.17 In this case, a two-stage Monte Carlo algorithm was written to minimize the sum of the squares of the differences between the calculated defects and the actual defects (least_squares_fit). In many cases, the extra equations will reveal a contradiction (revealed in the form of a high least_squares_fit) that will rule out a partition or a set of assignments. In addition, the calculated mass defects of the cells should be more accurate than the fragment ion mass defects, since the cells are “weighed” in groups rather than one at a time18 and because there are usually more equations than variables. (In the C programs, the mass defects of the cells are calculated to the nearest milli-Dalton; the mass defects could be calculated to the nearest tenth of a milliDalton at the cost of additional calculation time.) In this case, there is a solution (a ) 54, b ) 19, and c ) 106). Thus, the exact masses of the cells are 162.054, 152.019, and 324.106. In the case of the other partition, A + B + C ) 162 + 162 + 314, where there were four possible ways of assigning the fragments as a result of two identical cells, all four of the solutions are essentially identical (a ) 53, b ) 53, c ) 72), and thus, cells A and B apparently are identical. The exact masses of the cells, calculated by the program, are 162.053, 162.053, and 314.072. (5) Checking MS3 or CID-MS/MS Data: removing any system of equations having third (or higher)-generation product ion spectra inconsistent with the assigned equations. Logically, the cells of each product ion must be a subset of the cells of its parent ions. Three product ions of compound B were generated by source CID fragmentation and then further fragmented with MS/MS. The 477 fragment ion gave a 315 product ion; the 325 fragment ion gave a 163 product ion, but the 315 fragment ion did not give the 163 product ion (the only ion smaller than 315 in the MS/MS spectrum). The first partition was: A + B + C ) 162 + 152 + 324. The 476 is assigned as 0*A + 1*B + 1*C; the 314 is assigned as 1*A + 1*B + 0*C. Thus B + C fragments into A + B; this is contradictory. This first partition must be eliminated from further consideration on the basis of this contradiction. (17) Conley, W. Computer Optimization Techniques; Petrocelli Books: Princeton, NJ, 1984; pp 250. (18) Sloane, N. J. A. In Fourier, Hadamard, and Hilbert Transforms in Chemistry; Marshall, A. G., Ed.; Plenum Press: New York, NY, 1982; pp 562.
5366
Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
The second partition was A + B + C ) 162 + 162 + 314, where cells A and B were found to be equal, since their mass defects were essentially the same. The 476 is assigned as 0*A + 1*B + 1*C or 1*A + O*B + 1*C; the 314 is assigned as 0*A + 0*B + 1*C. B + C or A + C fragments into C, which is valid. The 324 is assigned as 1*A + 1*B + 0*C; the 162 is assigned as 0*A + 1*B + 0*C or 1*A + 0*B + 0*C. Thus, A + B breaks up into A or B, which is also consistent with the data. In this simple example, all 5 ions in the spectrum were assigned. However, because of multiple fragmentation pathways or the presence of odd electron ions, it is seldom that all of the fragment ions in a spectrum are assigned. If there were 11 total ions and 6 ions were assigned, a solution that accounts for the molecular ion and the 5 largest fragment ions is probably better than a solution that accounts for the molecular ion and the 5 smallest fragment ions. The concept of coverage was developed to give more weight to the larger fragments. The square root of the relative intensity of each fragment ion, rounded down to the nearest integer, is calculated and saved. The protonated molecular ion is given a value of 0. The total sum if the square roots of the fragment ions is the total coverage. The coverage value of each fragment ion is its percent of the total coverage. For example, the relative intensity of the 325 ion in compound B is 16 (Table 1); its coverage is 20. In general, the better solutions have the highest values of coverage. Other factors to compare are the least_squares_fit, the difference between the calculated and experimentally determined neutralized fragment ion masses, and the number of ions that were assigned. (6) Finding Configuration: finding configurations that are consistent with the fragment assignments by checking every permutation of the coefficients of the equations against truth tables for each configuration. On the basis of the previous assumption of no rearrangements, no two cells can be attached to each other in a fragment ion if the cell or cells connecting the pair are not present. As a result, it is possible to make a connection table or “truth table” for every possible combination of cells that make up a modular structure. The fragment combination will either be “true” (1) if the fragment is made up of connected cells, or “false” (0) if the fragment is made up of cells that are missing the connecting cells. For example, every combination of cells is true for the three-cell modular structure except for the combination of the two end cells. The modular structures and connection tables for the three- and four-cell configurations are shown in Figure 4. As shown in Figure 4, the cells of a three-cell modular structure can only be arranged in one way. Two configurations are needed to describe a four-cell modular structure. As the number of cells increases, the arrangements become more complex. Three configurations are required to describe all of the five-cell modular structures, as shown in Figure 5. (The connection table for fivecell modular structures is in Supporting Information). The designations 1, 2, 3, 4, ... are positions in the modular structures where the cells (designated A, B, C, D, ....) are sequentially tested in a systematic way. For example, there are (3!) 6 ways to arrange three cells, and (5!) 120 ways to arrange the cells in a five-cell modular structure.
Figure 6. pound B.
Figure 4. Three- and four-cell configurations and connection tables.
Figure 5. Three configurations, designated W, X, and Y, are needed to describe five-cell partitions.
For compound B, the only partition remaining is A + B + C ) 162 + 162 + 314. There were four possible sets of assignments, but the multistage Monte Carlo solutions to the linear equations were essentially the same for all four. One system of the equations is arbitrarily chosen to check possible permutations.
1*A + 0*B + 0*C ) 162 0*A + 0*B + 1*C ) 314 1*A + 1*B + 0*C ) 324 1*A + 0*B + 1*C ) 476 1*A + 1*B + 1*C ) 638 The coefficients of the equations can be written as a 1-0 matrix. The six permutations are then essentially obtained by shuffling the columns. Each color represents a particular cell (A ) red; B ) magenta; C ) green) and the numbers 1, 2, and 3 represent the cell positions on the configurations. This is shown in Figure 6 for compound B.
Permutations of the assigned fragments of com-
ABC and its rotational duplicate CBA can be ruled out, because both have assigned a 101 fragment (here, the 476 fragment ion). This fragment, as shown in Figure 4, is not compatible with the three-cell configuration. ACB, like its rotational duplicate BCA, also has a fragment assigned as 101 (here, the 324 fragment ion). The only permutation without a 101 fragment assigned is CAB and its duplicate BAC. Thus, this modular structure has cell A in the middle. (A and B are both the same mass in this case, but are not the same cell.) (7) Distributing the Elements: for a given molecular formula, assigning elemental compositions to the cells in such away that the maximum difference between the calculated mass of each cell and the theoretical mass of the elemental composition of each cell is less than the MaxDefect. Now elemental compositions are assigned to each cell. For a given molecular formula, assigning elements to one cell will limit the choice of elements available to the other cells. The current software only will test one possible formula for the whole compound at a time. The maximum difference between the calculated mass of the cell and the theoretical mass of the elemental composition of the cell is the MaxDefect parameter, which is in units of tenths of milli-Daltons. The nitrogen rule19 is also applied to the cells. If the mass of the cell is odd, the number of nitrogens is forced to be odd. Conversely, if the mass of the cell is even, the number of nitrogens in the cell is forced to be even. The application of the nitrogen rule is based on the assumption, previously stated, that the fragment ions are even-electron species. RESULTS AND DISCUSSION Compound B. The results for compound B from the C program for a three-cell partition, which was used as the detailed example, are shown in Table 2. Only one solution was found using the parameters in the table. This printout will be used as an example of the output of the C programs. The top section of Table 2 lists the inputs, the parameters for MinimumCoverage, MaxDefect, minimum_number_of_ions, and the least_squares_fit criterion, the accurate mass data (after subtraction of the mass of a hydrogen), and the elemental formula being tested (the correct formula was used except where noted). The middle section shows the solution or solutions suggested by the C program. The first part of each solution is information about the cells: the elemental compositions, the integral masses, the calculated defects from the Monte Carlo optimization (defect), and the calculated defects based on the elemental composition (calcdefect). All mass defects are in tenths of milli-Daltons. [The (19) McLafferty, F. W. Interpretation of Mass Spectra, 3rd ed.; University Science Books: Mill Valley CA, 1980; p 303.
Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
5367
Table 2. Results for Compound B elemental composition used: C28H34N2O13S1Cl0F0 data file: CompoundB.dat minimum hits required: 4 least square error per hit: 5 max mass error accepted: 25 min coverage, %: 65 compd’s mol wt: 638; tot no. ions: 5 162.0536 162 54 314.0729 314 73 324.1057 324 106 476.1252 476 125 638.1781 638 178 Solution no. 1; Linear Fit 2; Coverage 100 C H N O S Cl F mass defect calcdefect cell C cell B cell A
16 6 6
14 10 10
2 0 0
3 5 5
1 0 0
0 0 0
0 0 0
314 162 162
720 530 530
722 525 525
fragment composition mass LSdefect CalcDefect MeasDefect 1 2 3 4 5
A C AB AC ABC
162 530 525 540 314 720 722 730 324 1060 1050 1060 476 1250 1247 1250 638 1780 1772 1780 best permutation: CAB best permutation: BAC tot. partitions of 3 cells: 27 562 partitions accounting for less than 4 fragments: 27 560 no. partitions with required no. of fragments: 2 no. of linked cells rejected: 0 no. of partitions failing least squares criterion: 0 no. of partitions not matching any configuration: 0 no. of partitions with contradictory CID-MS/MS data: 1 no. of partitions failing MinimumCoverage test: 0 no. of partitions not fitting the elemental data: 0 no. of partitions duplicated by multiple assignments: 0 no. of partitions having multiple elemental compositions: 0
cells, in the modular structures, are designated by the following colors: dark blue, cell E; light blue, cell D; green, cell C; magenta, cell B; and red, cell A.] The fragment assignments are summarized below the cell data. The cellular composition is followed by the integral mass. The last three columns are again mass defects: the LSdefects (the sum of the defects of the cells assigned to the fragment that were calculated from the Monte Carlo optimization, the calculated defects (CalcDefect) from the elemental compositions, and the actual experimentally measured mass defects (MeasDefect) of the neutralized fragment ions. The units are tenths of milli-Daltons. Next the “best permutations” are listed, CAB and its rotational equivalent BAC. This compound is unusual because there is only one solution possible. Additional examples will make it evident that more commonly, there are multiple solutions and multiple configurations for each solution. The last section summarizes the statistics on the computations, for example, the total number of partitions of 638 into three cells (27 562) and how many partitions were rejected because of a conflict with the CID-MS/MS data (1). The modular structure (permutation BAC) listed in Table 2 is diagrammed in Figure 7, together with the partial molecular structure of the compound. Since two of the cells are equal, this is a case in which one might expect four solutions, since it was previously shown that the fragments of this particular partition could be assigned in four 5368 Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
Figure 7. Modular assignment for compound B, which is similar to the molecular structure.
different ways. However, the program, in the case of multiple fragment assignments generating multiple systems of equations, tests each system of equations sequentially until either a solution is found, or until all systems of equations are tested. By dividing a molecule into smaller cells, there is a much greater possibility that each cell will have a unique elemental composition, and thus, only one overall composition will work. This same principle is behind Wu’s “basket-in-a-basket” approach. It was tested here on compound B by generating all of the possible formulas having C(24-32), H(0-100), N(0-10), O(0-20), and S(0-2) for compound B, within 3 mDa of the experimental value, 639.1859. (The carbon was set at 24 to 32 to be at least close to matching the intensity of the first isotope.) Seven possible formulas were found meeting these criteria. All seven formulas were inputted, one at a time, using the three-cell program. All but one gave one solution; that one gave two solutions. Eight total solutions were found. Changing the MaxDefect from 25 to 10 narrowed the mass accuracy windows on the cell masses to about 1 mDa; now only two of the possible formulas gave one answer each. The results are very consistent with Wu’s hypothesis that accurately determining the masses of smaller pieces of a compound should limit the elemental composition of the whole compound.8 However, a 1-mDa window on the cell masses, although it did work well for this compound, is probably too narrow at this point; the program is presently doing a two-stage Monte Carlo optimization, and the mass defects of the cells are being calculated only to the nearest milli-Dalton. Except for this experiment, a 2.5-mDa window was used to generate all of the results reported in this study. The Q-TOF-2 instrument would appear to be capable of supporting a 1-mDa window for the cell mass defects if the Monte Carlo optimization were revised. In the case of compound B, perhaps because of the presence of a sulfur atom, the isotope ratios of the protonated molecular ion would actually be as useful
Figure 8. Two other xemilofiban solutions found. Left: solution no. 1. Right: solution no. 2. Table 3. Computer Output for Solution No. 3, Xemilofiban linear fit 3, coverage 90
cell E cell D cell C cell B cell A
C
H
N
O
S
Cl
F
mass
Lsdef
calcdef
7 5 4 2 0
6 5 2 6 3
2 1 0 0 1
0 1 2 1 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
118 95 82 46 17
530 370 60 420 260
530 370 54 417 265
fragment
composition
mass
1 2 3 4 5 6 7 8 9 10 11 12
E D CE CD BD BCD AE ACE ABCDE
118 95 200 177 141 223 135 217 358 124 175 216
LSdefect
CalcDefect
MeasDefect
530 530 520 370 370 370 590 584 590 430 424 430 790 787 790 850 841 850 790 795 800 850 849 860 1640 1636 1640 0 0 0 0 0 0 0 0 0 configuration used, W; best permutation, AECDB configuration used, W; best permutation, BDCEA
as the narrower cell mass window in eliminating formulas for the whole compound. The isotope ratios were graphically generated for the seven possible formulas and compared to the isotopic ratios found experimentally. Only the formulas C28H35N2O13S1 and C24H31N8O11S1 appeared to have matching isotopic ratios (Supporting Information). The combination of the narrower cell window and the matching isotope ratios gives a unique (and correct) elemental composition for this 638 Dalton compound. Xemilofiban. The modular structure of xemilofiban is shown in Figure 1, derived on the basis of knowledge of the fragmentation of this compound. Its spectral data was analyzed using the five-cell partition program. A large number of partitions were generated (1 506 841) and tested; only four solutions were found. The coverage values were 78, 78, 90, and 78% respectively. The solution having 90% coverage (solution no. 3) is shown in
Table 3. The configurations for solution no. 3 are a rotationally equivalent pair. Basically, this is the same as the six-cell modular structure in Figure 1 but where the black ammonia cell is combined with the light blue cell. Two of the other solutions, solution no. 1 and solution no. 2, gave similar results but divided the molecule into cells differently, as might be expected. Their modular structures are compared to the molecular structure in Figure 8. In solution no. 1, the ethanol cell has been incorporated into the alkyne, and an ammonia cell has been taken out of the alkyne. Solution no. 2 is similar to solution no. 1, but the ammonia of the amidine has been included with the aromatic amine, and the ethanol has been separated out again. Solution nos. 1 and 2 have ambiguity with respect to spatial orientation of the cells. Solution no. 1 works whether the C7H8O2 or the NH3 is on the end. Solution no. 2 works with either the Y Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
5369
Figure 9. Top: structure of compound C. Bottom: a solution/configuration for compound C shown overlayed on the accurate mass product ion spectrum of protonated compound C at a collision energy of 25 V, positive ESI mode.
Figure 10. Top: compound D. Bottom: highest coverage solution (X configuration illustrated) overlayed on the accurate mass product ion spectrum of protonated compound D at a collision energy of 25 V, positive ESI mode.
or W configuration, likewise with ambiguous placement of groups on the right side as drawn. Solution no. 4 is quite different. Here, it appears that the 17, 78, and 46 cells of Figure 1 have been replaced with 93 (C6H7N1) and 48 (C1H4O2) cells, allowing an assignment of the 175 ion. Solution no. 4 appears to be an unlikely solution; a C1H4O2 cell would be very unusual from a chemistry standpoint. However, solution no. 4 does appear to fit the fragmentation data as well as the other solutions. 5370
Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
The statistics on xemilofiban indicate that most partitions are selected out. For example, xemilofiban had 1 506 841 partitions since its mass was 358. Out of these, 12 545 partitions that accounted for six or more fragments were transposed into systems of equations, and 1132 more systems of equations were generated as a result of multiple assignments, giving a total of 13 677 systems of linear simultaneous equations that were tested further. The MinimumCoverage test (set at 65%) removed 9743 of these; another 1940 failed to fit any configuration; 926 failed the
Figure 11. Top: compound A. Bottom: a solution found for compound A by the C program overlayed on the accurate mass product ion spectrum of protonated compound A at a collision energy of 25 V, positive ESI mode.
least_squares_fit criterion; 851 had linked cells; 124 were not compatible with the molecular formula; and 89 had contradictory CID-MS/MS data. That left the four solutions that were found by the program. Compound C, a Symmetrical Compound. Symmetry in a compound was not a problem for the program. Like compound B, compound C had one solution. Although the compound had only one solution in terms of cell masses and compositions, there were 12 possible permutations that included all three configurations. The annotated spectrum is shown in Figure 9, showing the known structure, the correct W configuration, and correct permutation BECDA. Three of these configuration/permutations had two saturated cells (ammonia-ammonia) that were adjacent, and these three were eliminated (to connect two cells together always requires the loss of one ring or double bond). All nine possible configurations (including BECDA) are shown in the Supporting Information. Plausible alternative modular structures and configurations are one of the objectives of the program. Compound D, an X Configuration Compound with Multiple Cells Equal. A solution for compound D with the highest coverage (X configuration permutation with E as the middle cell) is shown in Figure 10. There were a total of four solutions. Compound D is definitely an X-type compound; however, the solutions all indicate that all configurations can be shown to fit the data, perhaps because three of the cells are identical here. Three Examples of Unanticipated Problems Observed when Analyzing Six- and Seven-Cell Compounds with the Five-Cell Program: Compound A, Leucine Enkephalin, and Orbofiban. Xemilofiban is a six-cell organic compound (Figure 1); as previously noted, analyzing xemilofiban with the five-cell program gave three solutions that basically combine cells in different ways. It was expected that most six-, seven-, and eightcell compounds could be analyzed with the five-cell program, at the tradeoff of lower coverage solutions. However, if a compound
has two or more equal cells and the total number of cells in the compound exceeds the cells in the program, problems can occur. Compound A is similar to xemilofiban but it has an acetyl group (mass 43) in place of the alkyne (mass 25) group. The five-cell program was run, and 11 solutions were found. Three solutions appeared to be reasonable answers: solution nos. 2, 3, and 6. Solution nos. 5 and 6 had the highest coverage at 76%. Solution no. 6, with the Y configuration, is shown in Figure 11, and it is a very acceptable solution. It was expected that one solution would be identical to solution no. 3 (Table 3) for xemilofiban, but with one cell increased in mass by 18 mass units. However, that expected solution was not found. A program that can trace a partition was run, and it was found that this expected solution did not fit any configuration, so it was eliminated. The NH3 cell was used with the 82-mass cell to account for the 100-Da fragment ion in the spectrum, and the same NH3 cell was also being used with the aromatic amine (118 cell) to account for the 136-Da fragment ion in the spectrum. As noted, there are two NH3 cells possible: xemilofiban and compound A are actually six- and seven-cell compounds, respectively. However, a solution having a single NH3 cell cannot use the same NH3 in two different places in the modular structure, so this solution for compound A was eliminated. In the case of xemilofiban, the 100Da fragment ion was under the 2% relative intensity limit, so this contradiction was not observed (the 100-Da fragment ion has 5% relative intensity in compound A). Leucine enkephalin has the protonated molecular ion and 23 fragment ions in its product ion spectrum. The maximum number of ions that can be assigned with five cells is 20 ions, and therefore, the MinimumCoverage parameter was set to 50%. The five-cell computer program assigned 10 ions (solution no. 4, Supporting Information), and the cells were calculated at the predicted masses, but the expected W configuration was not found. A single CO (mass 28) cell was used in two locations, like the ammonia Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
5371
Table 4. Assignments for Compound E Showing That Cell C Is Not a Simple Difference. solution no. 1 fragment composition mass LSdefect CalcDefect MeasDefect 1 2 3 4 5 6
CD 141 790 787 BCE 200 580 584 BCD 169 730 736 AE 135 800 795 ABCE 217 860 849 ABCDE 304 1530 1531 model used, W; best permutation, AEBCD model used, Y; best permutation, DCEBA model used, Y; best permutation, BCEDA model used, W; best permutation, DCBEA
790 590 740 800 850 1530
had been in compound A. However, in compound A, no configuration fit those assignments, so that system of equations was eliminated. In this case, the correct W configuration was eliminated, a less plausible Y configuration was chosen that allowed the assignments of 10 ions. A third example is orbofiban, which is a six-cell compound. The best solution for orbofiban had 87% coverage and assigned 8 out of 11 ions (solution no. 6). However, this solution was not a W configuration modular structure, as expected for this linear compound, but an X or Y configuration. In orbofiban, there are two equal cells of mass 71, C3H5NO. To properly assign the same 8 ions using the W configuration, a six-cell modular structure would have to be used. The object of partitioning is to find all of the most plausible solutions. Currently, once a set of neutralized fragment ion assignments are made by the C program, those assignments as a group will either pass or fail the subsequent screening steps. Contradictory assignments can occur when six- and seven-cell molecules that have two or more identical cells are treated mathematically as five-cell modular structures. The earlier examples (compounds B, C, and D) each had two or more identical cells. Configuration contradictions were not observed because compound B is a three-cell compound that was analyzed with a three-cell program, and compounds C and D are five-cell compounds that were analyzed with the five-cell program. Compound E. Previous solutions that have been discussed have cells that are arithmetic differences between two fragments or between a fragment and a proton. For example, for compound C (Figure 9), the cells are 17, 118, and 82. There are pairs of fragment ions in the spectrum of compound C that have those differences (353-336; 336-218; 218-136). It would appear to be simpler and faster to use arithmetic differences as possible cells instead of partitioning and trying all possible integers as cells. The cells are always differences between fragments or between fragments and a proton, but not always simple differences. Compound E is an example of a compound in which one of the cells (cell C) is not a simple difference between two fragment ions in its spectrum. The assignments for the best solution for compound E (solution no. 1) are shown in Table 4 (Details in Supporting Information.). In this case, cell C can be calculated as: ion 1 + ion 5 - ion 6 ) (C + D) + (A + B + C + E) - (A + B + C + D + E) ) C. Therefore, cell C is a sum/difference of three ions. 5372 Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
Analysis reveals that three-cell modular structures have cells in which the cell masses are always differences between two fragments or between a fragment and a proton. But this is the simple case. For example, at five cells, with the minimum_number_of_ions set at 6, there are cell masses that potentially are sum/ differences of as many as four fragment ions and a proton. This greatly increases the number of sums/differences possible. For example, using the 12 neutralized ions in the product ion mass spectrum of xemilofiban (Table 3), it is possible to generate every integer between 1 and 358 (the molecular weight of xemilofiban) using sums/differences of 1, 2, 3, and 4 ions. (This analysis is shown in the Supporting Information.) With the exception of threecell modular structures, rather than generating all of the possible sums/differences, storing these possible cell values in an array, and then using this array to generate partitions, it is computationally faster to just generate and test all possible partitions of the molecular weight. CONCLUSIONS Computer programs were developed to generate modular structures of three, four, and five cells from mass spectral data and tested on molecules with well-known and straightforward fragmentation pathways. These computer programs were often successful in assigning fragments and in finding the exact masses and elemental compositions of the cells. The programs were also able to easily handle a symmetrical compound and a compound having multiple cells of the same mass. Dividing a molecule into smaller cells was also found to effectively limit the elemental composition of the whole compound. This approach was not usually successful in determining a unique configuration of the cells in the modular structure by checking all possible permutations against all possible configurations; multiple configurations were usually found for each solution. Presently, the most common cause of problems is analyzing six- and seven-cell compounds, having at least two identical cells, with a five-cell program. For compound A, this led to contradictory fragment assignments and deletion of one expected five-cell solution. This could be remedied with a seven-cell program. For two other examples, leucine enkephalin and orbofiban, the same situation led to incorrect configurations that use fewer cells to obtain the same coverage that a six-cell solution could achieve in theory. This could be remedied with the development of a sixcell program; however, these are two examples for which the simpler solutions (five-cell versus six-cell solutions) are not the correct solutions, and one would tend to favor simpler solutions in the case of unknown compounds. This preliminary study indicates that partitioning may be very useful for identification, especially in those situations in which background information is minimal. However, many improvements are needed: programming for six-, seven-, and eight-cell partitions; an additional stage for the Monte Carlo optimization to more precisely calculate the mass defects of the cells; a module to eliminate elemental compositions of the whole molecule based on the isotope ratios of the molecular ion; a module to check for rotational equivalence; and a module to apply some simple chemical rules (e.g., two saturated cells cannot be attached to one another). Much more additional work will be needed to
determine whether this approach could be applied to more complex spectra (e.g., fragmentation via multiple pathways.) ACKNOWLEDGMENT I thank Dr. Jeremy Hribar (Senior Research Advisor, Pharmacia/Pfizer) for helpful discussions over many years on using mass spectrometry to identify compounds. I thank Ms. Kerry Brown, Dr. Hans Westenburg, and Dr. Rick Rhinebarger (all of Pharmacia/Pfizer) for their critical reviews of this manuscript. I thank John Hoyes and Iian Lloyd of Waters/Micromass for helpful tips on the tuning parameters needed for obtaining excellent accurate mass MS/MS data on small molecules.
SUPPORTING INFORMATION AVAILABLE 1, Tuning parameters for the Q-TOF-2 mass spectrometer; 2, conditions used to obtain the spectra; 3, configurations; 4, cells as sum/differences of neutralized fragment ions; 5, compound B: inputting formulas/isotope ratios; 6, xemilofiban; 7, compound A; 8, compound B; 9, compound C; 10, compound D; 11, leucine enkephalin; 12, orbofiban; and 13, compound E. This material is available free of charge via the Internet at http://pubs.acs.org. Received for review April 29, 2003. Accepted August 1, 2003. AC034446K
Analytical Chemistry, Vol. 75, No. 20, October 15, 2003
5373