Differentiation of Closely Related Isomers: Application of Data Mining

Sep 29, 2011 - Data mining is a set of techniques for evaluation and grouping of complex sets of data depending on multiple selected criteria.(32, 33)...
0 downloads 13 Views 1MB Size
ARTICLE pubs.acs.org/ac

Differentiation of Closely Related Isomers: Application of Data Mining Techniques in Conjunction with Variable Wavelength Infrared Multiple Photon Dissociation Mass Spectrometry for Identification of Glucose-Containing Disaccharide Ions Sarah E. Stefan,† Mohammad Ehsan,† Wright L. Pearson,†,|| Alexander Aksenov,†,^ Vladimir Boginski,‡ Brad Bendiak,§ and John R. Eyler†,* †

Department of Chemistry, University of Florida, P.O. Box 117200, Gainesville, Florida 32611-7200, United States Department of Industrial & Systems Engineering, University of Florida, 1350 North Poquito Road, Shalimar, Florida 32579-1163, United States § Department of Cellular and Developmental Biology and Program in Structural Biology and Biophysics, University of Colorado at Denver and Health Sciences Center, Aurora, Colorado 80045, United States ‡

bS Supporting Information ABSTRACT: Data mining algorithms have been used to analyze the infrared multiple photon dissociation (IRMPD) patterns of gas-phase lithiated disaccharide isomers irradiated with either a line-tunable CO2 laser or a free electron laser (FEL). The IR fragmentation patterns over the wavelength range of 9.2 10.6 μm have been shown in earlier work to correlate uniquely with the asymmetry at the anomeric carbon in each disaccharide. Application of data mining approaches for data analysis allowed unambiguous determination of the anomeric carbon configurations for each disaccharide isomer pair using fragmentation data at a single wavelength. In addition, the linkage positions were easily assigned. This combination of wavelength-selective IRMPD and data mining offers a powerful and convenient tool for differentiation of structurally closely related isomers, including those of gas-phase carbohydrate complexes.

I

somers remain a problem in mass spectrometry, especially closely related isomers that yield very similar ratios of product ions upon unimolecular dissociation. This situation is frequently encountered among isomers varying in their stereochemistry at one or more positions (e.g., carbohydrates and glycoconjugates1 3), positional isomers (e.g., phospholipids, di- and triacylglycerols4 6), and isomers varying in the nature of double bonds (cis or trans or multiple double bonds, as in various lipids, wax esters, hydrocarbons, and other compounds7 9). The problem has been approached in different ways, some more successful than others, depending on the nature of the precursor ions. Use of differences in fragmentation patterns after collision-induced dissociation (CID) is still the most common and well-studied approach.1 3,5 7,9,10 Other techniques exploit differences in physical properties of isomeric ions: migration rates of ions through a neutral gas (ion mobility spectrometry),11,12 reactivities of ions,13,14 charge exchange MS,15,16 direct photon absorption or emission,17,18 variable wavelength photodissociation in the infrared,19 23 visible24 or ultraviolet25 with one or more26,27 lasers, differences in photodissociation by rapid pulse shaping over an approximate Gaussian wavelength distribution centered at 800 nm,28,29 and relatively high-resolution spectroscopy r 2011 American Chemical Society

in cold traps.30,31 In cases relying on dissociation of a precursor ion, the mass spectra may be too similar to assign particular isomers from the product ion ratios with confidence. This can occur for several reasons: (1) product ion ratios frequently differ somewhat among different instruments; (2) product ion ratios can vary somewhat using the same instrument under supposedly identical conditions; (3) methods of processing digitized spectral data vary, and at least some error is introduced in quantitation of spectral peaks depending on the processing method; (4) white noise is present to varying extents; and (5) a proper statistical evaluation of data is often not performed. Data mining is a set of techniques for evaluation and grouping of complex sets of data depending on multiple selected criteria.32,33 In mass spectrometry, the analyses can be applied to virtually any quantitative values (e.g., product ion intensities at different m/z values, multiple secondary or tertiary product ion intensities in MS2 or MS3 experiments, chromatographic retention times, ion mobility drift times) which exhibit a difference among the target Received: July 2, 2011 Accepted: September 29, 2011 Published: September 29, 2011 8468

dx.doi.org/10.1021/ac2017103 | Anal. Chem. 2011, 83, 8468–8476

Analytical Chemistry compounds. The confidence in discrimination increases as the number of measurable differences increases. In some cases, mining methods easily discriminate spectral features that are markedly different (such as the presence of product ions having completely different m/z), in which case detailed statistical evaluation of data is not needed. In other cases when spectra are more similar, a number of techniques have been assessed to differentiate structures or to cluster structural features, including simple univariate analysis of variance (ANOVA)34,35 and multivariate analysis of variance (MANOVA).36,37 A key goal in multivariate analyses is to reduce the dimensionality to the most crucial variables by unsupervised techniques (principal component38,39 or independent component40 analyses) or by supervised learning techniques, many of which are still under development. These include separating-hyperplane optimization techniques,32,41 artificial neural networks,42,43 regression analysis,44,45 evolutionary-type algorithms,46 discriminant analysis,47 49 and regression trees.50,51 These techniques have often been tailored to assess specific attributes of spectra within large mass spectral data sets, for example, in proteomics. However, in a consideration of closely related isomers, the possible presence of complete unknowns in complex mixtures brings into question the validity of some of the above techniques. The statistical method must address several important questions: (1) How many potential chemical isomers may exist at this m/z, including enantiomers, diastereomers, and regioisomers? (2) How frequently might isomers be misidentified? (3) Does the mass spectral technique provide sufficient information to avoid identification errors? (4) If isomers have very similar physical parameters, is it better to include additional physical attributes or can small collective differences in these parameters serve as valid discriminatory evidence (given that an appropriate statistical basis is used)? Data can be assessed in many ways, but for the purposes of differentiating two quantitative patterns, individual “training data sets” for each compound must be accumulated. In the case of isomers and mass spectrometry, this involves repetitions of mass spectral acquisitions to assess variability of the data on a given instrument as well as evaluation of different algorithms that may determine the most effective quantitative indicators to differentiate the spectra. With the vector-based representation of data and separating-hyperplane techniques used in this paper,32,41 the best multidimensional hyperplane between two isomers is determined so that, given another experimental repeat, the mass spectrum can be classified as being on the correct side of the hyperplane at a certain probability (confidence) level. With a sufficient number of physical measurements, the goal is to make effective use of the collective data to assign a specific structure from a potentially large number of isomers. Herein, we describe the value of multiple photon dissociation spectra at different IR wavelengths in the differentiation of disaccharide isomers. The isomers chosen were a closely related set of eight glucose-containing disaccharides having different linkages and anomeric configurations to assess the utility of variable wavelength dissociation spectral patterns when combined with pattern analysis algorithms. As both sugars were identical, a number of comparisons could be made between structures having identical stereochemistries that varied solely in their linkage positions or between structures having identical linkage positions that varied solely at a single stereochemical center. In the past, other sets of disaccharide isomers having different sugars (e.g., mannose, galactose, or fructose) have also

ARTICLE

been examined using collision-induced dissociation, both in the positive52 54 and negative10,55,56 ion modes. Differences were noted in their mass spectra, particularly when isomers having different linkage positions were compared. In the positive ion mode, the lithium ion is most useful among the alkali metals as the binding energies of alkali metals decrease in an order Li+ > Na+ > K+ > Rb+ > Cs+,57 59 although Na+ has also been effectively utilized. The Li+ ion tends to result in complex dissociation patterns more readily during multiple stages of isolation/dissociation (MSn) and it remains more strongly coordinated with multiple product ions. Anionic adducts of disaccharides (such as Cl , Br , F , HSO4 , and OAc ) have also been examined in the negative ion mode,60,61 in addition to deprotonated disaccharides.10,55,56 Their dissociation patterns would also be worth examining by methods described herein because isomers in those studies often showed differences in dissociation using collision-induced dissociation. With the use of variable wavelength infrared dissociation, each wavelength provides an independent mass spectrum that can be unique for each isomer. The probability that two compounds will render statistically identical spectra at many wavelengths becomes lower as the number of measurements increases. With the use of data mining techniques, multiple repeats of training data sets, and statistical analyses, it is demonstrated that even isomers that yield very similar spectra visually can be statistically differentiated at high levels of confidence.

’ EXPERIMENTAL SECTION Fragmentation Methods. Samples of individual glucosecontaining disaccharides with various linkages (1-2, 1-3, 1-4, and 1-6) and anomeric configurations (α and β) were irradiated by a free electron laser (FEL) at the FOM Institute for Plasma Physics Rijnhuizen and a line-tunable CO2 laser at the University of Florida. The fragmentation methods and procedures have been described elsewhere.22,62 Additional experiments were carried out using single wavelengths from a line-tunable CO2 laser for the 1-3-linked disaccharides nigerose and laminaribiose in order to assemble sufficiently large data sets for data mining analysis purposes. For these studies, the disaccharide samples were prepared at a concentration of 0.1 mM with 0.1 mM LiCl in a solution of general-use grade methanol and Milli-Q ultrapure water (80:20%). Ionization of disaccharides as lithium cation-coordinated adducts in the gas phase was achieved using a commercial electrospray ionization (ESI) source (Analytica of Branford, Branford, CT) with a usermodified heated metal capillary63 65 having a conical capillary inlet66 maintained at a temperature of 125 C. Mass spectra were acquired on a Bruker Apex 47e Fourier transform ion cyclotron resonance (FTICR) mass spectrometer (Bruker Daltonics; Billerica, MA) with a 4.7 T superconducting magnet (Magnex Scientific Ltd.; Abington, U.K.) and an Infinity cell.67 The lithium cation-attached disaccharide parent ions (m/z 349) were mass isolated and irradiated for 1 s with a Lasy-20G tunable continuous wave CO2 laser (Access Laser Co.; Marysville, WA) with a power range of 0 20 W and a wavelength range of 9.2 10.8 μm. The irradiation time was controlled by a laboratory-constructed mechanical shutter that allowed light to enter the Infinity cell via a coated ZnSe window on the opposite side of the vacuum chamber from the ion transfer optics. The laser power was adjusted to keep the intensity ratio of the m/z 169 fragment ion to the precursor ion approximately 2:1 for each anomer. No internal mirrors were used, so ions were exposed to 8469

dx.doi.org/10.1021/ac2017103 |Anal. Chem. 2011, 83, 8468–8476

Analytical Chemistry only a single pass of infrared irradiation. A highly reflective gold mirror was mounted on the mechanical shutter, and the laser beam was reflected onto a power meter whenever the beam was not passing into the cell. A total of 15 scans were averaged before Fourier transformation to obtain the mass spectrum at each wavelength. A total of 200 mass spectra, 100 for nigerose and 100 for laminaribiose, containing precursor and infrared multiple photon dissociation (IRMPD) fragment ions, were obtained for each of the three different wavelengths (9.282, 9.588, and 10.611 μm). From the perspective of data mining, the precursor and product ion intensities of a single mass spectrum comprise the Cartesian coordinates of the end of a multidimensional vector that is defined as one “data point” associated with a specific disaccharide. An additional blind study was carried out with the nigerose and laminaribiose anomers to verify the predictions based on the output of the data mining algorithms (see the Results and Discussion). These two anomers were analyzed using the CO2 laser by one of the authors (M.E.) with no knowledge of their identities. Over the course of several days, 35 IRMPD spectra were collected for each anomer at 10.611 μm using the same acquisition parameters given above. Classification via Data Mining Techniques. Each data point was mathematically represented as a pair (x,y), where x = (x1, x2,..., xn) is a vector of attributes (i.e., values of fragment ion intensities at the given wavelength in the dissociation spectra), and y is a class attribute (disaccharide identity), which is known for the elements in the training data set and unknown for those in the test data set. Therefore, for every element in the test data set, a classification algorithm was needed to predict the unknown value of y based on the known values of (x1, x2,..., xn). For the 200 data points obtained for nigerose and laminaribiose with the CO2 laser, 160 points (80 spectra for each compound) were randomly assigned to the training data set to adjust the parameters of the classification model, and the remaining 40 points were used to test the performance of the model. The classification was repeated 50 times for various permutations of points in the training and test data sets. For data from the FEL IRMPD experiments,22 only one data point of precursor and IRMPD fragment ion intensities at a particular wavelength was available for each disaccharide. Therefore, prior to the data mining analysis, these data were partitioned by removing every 10th point of the data set for a particular disaccharide and placing it into the test data set for that disaccharide. The resultant test data sets therefore contained approximately 10% of the data points spanning the entire range of wavelengths. The remaining data (∼90% of the data points) comprised the training data set. It was found that linkage isomers were readily differentiated by their different IRMPD fragment ions but that the IRMPD spectra of anomers were more closely related, having the same product ions which only varied somewhat in their relative abundances (see the Results and Discussion). For a given linkage, only two anomers must be differentiated. Thus, the classification problem to be solved is dichotomous and the data mining algorithm’s output is either 0 for one anomer or 1 for the other for all data points. The geometric separating surface approach was employed.32 A brief description is given in the Supporting Information. The linear programming (LP) optimization problem that is the basis of this approach was solved using XPress-MP optimization software.68 The hyperplane found by the software was then used

ARTICLE

to classify the “unknown” data points in the test data set, i.e., to determine which points fell on which side of the hyperplane. This procedure was followed for all anomeric pairs of disaccharides. The trained model was further tested to establish robustness of the resulting classification procedure. The anomeric pair of nigerose and laminaribiose was used as an example. Certain fragments at the wavelengths of interest were selected, and the experimental data for those fragments were intentionally impaired by adding white noise, as described in the Supporting Information. The signalto-noise (S/N) ratios (using the averaged intensity of the fragment ion of interest as S and the random added noise coefficient C as N, see the Supporting Information for more detail) which resulted in 99% and 95% classification accuracy (confidence limits) were determined. In addition, the probability values for the Student’s t test were calculated for the original data and data with added noise using the standard t test function in the Microsoft Office XP Excel spreadsheet software. For analysis of the CO2 laser IRMPD data, in addition to the LP method discussed above, a decision tree analysis classification was employed. A decision tree can be represented by a set of nodes that consecutively split the data points into subgroups using criteria (logical rules) involving certain attributes of the data. These rules are constructed using data mining algorithms that find the best splits of features to maximize the ability to separate the data, i.e., a node should accurately split data points in one class from data points in another class. For the pairs of disaccharides considered in this analysis, Tiberius Predictive Modeling software69 was used to construct decision tree models containing a relatively small number of branches.

’ RESULTS AND DISCUSSION Fragmentation patterns produced by the CO2 laser and the FEL for the lithiated disaccharides studied in this work are shown in Figures 1 and 2, respectively. The dissociation profiles can be seen to be somewhat different for the two lasers employed. For example, fragmentation of all lithiated disaccharides with the FEL led to significantly higher abundances of lower mass fragments compared to fragmentation using the CO2 laser. The discrepancies result in part from the different nature of the output from the CO2 and FEL lasers. Unlike the continuous wave CO2 laser used at the University of Florida, the FEL generates a train of low frequency macropulses (duration of ∼5 μs, frequency of either 5 or 10 Hz) consisting of ∼1 ps duration micropulses, which are produced at a frequency of 1 GHz.70 Even when corrected for laser power differences, subjecting ions to a train of short but relatively intense and repetitive laser irradiation pulses as opposed to a continuous beam may affect the IRMPD process for an ion and thus alter fragmentation. Spectra are also affected by laser alignment and exposure of product ions for periods of time to continued heating. Hence, spectra are the result of the energy pathways that are accessible to dissociation of both precursor and product ions, and different ion ratios result from the unique setup of specific FTICR-laser instruments. However, as evaluated below, ratios can be highly reproducible on a specific instrument. For both CO2 and FEL irradiation, the dissociation patterns were uniquely wavelength-dependent for each disaccharide. Different fragments were obtained in the IRMPD spectra of the linkage isomers (e.g., compare panels A, C, E, and G or B, D, F, and H in Figures 1 and 2), while IRMPD of the α- and β-anomers of a particular linkage isomer resulted in sets of fragments with same m/z (e.g., compare panels A to B, C to D, E to F, and G to H, 8470

dx.doi.org/10.1021/ac2017103 |Anal. Chem. 2011, 83, 8468–8476

Analytical Chemistry

ARTICLE

Figure 1. Wavelength-dependent fragmentation patterns produced by line-tunable CO2 laser IRMPD for the various lithiated, glucose-containing disaccharides.62 (A) kojibiose (α1-2), (B) sophorose (β1-2), (C) nigerose (α1-3), (D) laminaribiose (β1-3), (E) maltose (α1-4), (F) cellobiose (β1-4), (G) isomaltose (α1-6), and (H) gentiobiose (β1-6).

Figures 1 and 2). The assignment of the positional isomers (1 2 linkage vs 1 3 linkage, etc.) was therefore straightforward, whereas discrimination between varying ion abundance ratios was necessary to discern the α- and β-anomers of each linkage set. In some cases, the differences in IRMPD fragmentation between anomers were not immediately visually obvious, such as the 1 3- and 1 4-linked disaccharide pairs (Figure 1, compare panels C to D or E to F). Unlike the discrimination between methyl glycosides of monosaccharides,71 IRMPD spectra of these pairs of disaccharides

gave the initial impression that differentiation of anomers might not be feasible. However, application of data mining algorithms enabled unambiguous identification of each isomer in every case. Because the IRMPD spectra of nigerose and laminaribiose visually appeared to be most similar, this anomeric pair was selected for data mining analysis, as shown in Table 1. Sufficiently large data training sets were collected to allow for appropriate training of the algorithm. Consequently, the classification accuracy of the test data sets was 100% for the disaccharides studied at 8471

dx.doi.org/10.1021/ac2017103 |Anal. Chem. 2011, 83, 8468–8476

Analytical Chemistry

ARTICLE

Figure 2. IRMPD wavelength-dependent fragmentation patterns22 produced with the free electron laser for infrared experiments (FELIX) at the FOM Institute for Plasma Physics Rijnhuizen for lithiated (A) kojibiose, (B) sophorose, (C) nigerose, (D) laminaribiose, (E) maltose, (F) cellobiose, (G) isomaltose, and (H) gentiobiose.

each of the three selected wavelengths (9.282, 9.588, and 10.611 μm; see Table1) for all of the test/training data set variations. Moreover, the models were able to identify one or two fragments that would correctly discriminate between isomers, as shown in

the last column of Table 1. This high classification efficiency of the LP-based method is particularly encouraging, since the technique is computationally efficient even for very large data sets and is rather simple to implement using modern software packages 8472

dx.doi.org/10.1021/ac2017103 |Anal. Chem. 2011, 83, 8468–8476

Analytical Chemistry

ARTICLE

Table 1. Data Mining Classification Outcomes for Nigerose (Glcα1-3Glc) and Laminaribiose (Glcβ1-3Glc) Using a Line-Tunable CO2 Laser for Irradiation decision tree disaccharides classified

LP

no. of points in

no. of times

classification

no. of points in

no. of times

classification

fragments used

the test data set

correctly

accuracy

the test data set

correctly

accuracy (%)

for classification

correctly classified

classified

(%)

correctly classified

classified

nigerose vs laminaribiose at 9.282 μm nigerose vs laminaribiose at 9.588 μm

40/40 40/40

50/50 50/50

100 100

40/40 40/40

50/50 50/50

100 100

109, 187 169, 259

nigerose vs laminaribiose at 10.611 μm

40/40

50/50

100

40/40

50/50

100

109

such as the Xpress-MP software package used in this analysis. Importantly, at least for the nigerose and laminaribose isomers, the 100% classification accuracy was achieved at 10.6 μm, one of the principal CO2 laser wavelength bands most commonly used in laboratories (an example of fragmentation data at this wavelength is given in Table S-3 in the Supporting Information). The results of LP analysis can be used directly for differentiation of the anomeric pairs studied in these experiments. To illustrate this, a blind study was conducted for the nigerose/ laminaribose pair of anomers, as described in the Experimental Section. According to the results obtained from LP analysis (not shown in Table 1), any points with m/z 109 fragment intensity values below 5.9% belonged to laminaribiose, while points with m/z 109 fragment intensities above that value belonged to nigerose. This criterion allowed for correct identification of the anomer for every one of 35 runs in the blind study (100% prediction accuracy). The fact that either nigerose or laminaribiose was correctly identified in each of the 35 runs of the blind study does not provide a valid statistical measure of the probability of correct identification in future studies. A more rigorous measure is provided by the Student’s t test analysis, which gives the probability that the two samples used in the blind study were the same compound, and these probabilities were found to be vanishingly small. For example, for the 9.282 μm wavelength for nigerose and laminaribiose, the t test for anomeric discrimination using the m/z 169 fragment returned a value of 1.74  10 26, while for 10.611 μm utilizing the m/z 109 fragment the t test probability value was 1.45  10 174. These extremely low values result from the large sample size (n = 100). However, for lower n values, the t test still indicates clear differences between nigerose and laminaribiose: e.g., for the m/z 109 fragment with irradiation at 10.611 μm when only n = 5 abundance values were used, the t test probability values ranged from 1.15  10 6 to 2.84  10 5 for four different randomly chosen sets of five IRMPD experiments. Also, even when only n = 3 abundance values were used, the t test probability values ranged from 1.90  10 5 to 4.77  10 3, again for four different randomly chosen sets of IRMPD experiments. The advantages of the data mining approach for the differentiation of closely related structures were also evaluated by deliberately increasing the levels of white noise relative to the signals for key discriminatory ions, as described in the Supporting Information. For nigerose and laminaribiose with an irradiation wavelength of 10.6 μm, the fragment with m/z 109 was shown by data mining to be the most useful for the differentiation of the anomers. By addition of noise to the experimental data, it was possible to lower classification accuracy below 100%. A classification accuracy of 99% was observed when the signal-to-noise ratio for the m/z 109 ion was reduced to approximately 2 for nigerose

(m/z)

and ∼0.7 for laminaribiose, and 95% when the signal-to-noise ratio was reduced to ∼1.5 for nigerose and ∼0.55 for laminaribiose. Consideration of other single product ions that were less predictive in discriminating the isomers quite expectedly resulted in lower classification accuracy upon inclusion of increased noise. When considering two product ions, the classification accuracy could be increased more than when using a single ion. This depended on two factors, the discrimination capability of each individual ion pair and the noise. For example, two fragments, m/z 169 and m/z 259, were most predictive individually using the CO2 laser at a wavelength of 9.588 μm in the comparison between laminaribiose and nigerose (Table 1, Figure 1, panels C and D). Noise was added to the overall spectra for nigerose and laminaribiose in the same fashion as in the above example for the 10.6 μm irradiation wavelength (same value of scaling coefficient C, see the Supporting Information for details), and when the m/z 169 ion was considered individually, the resulting classification accuracy was found to be 96%. When only the m/z 259 ion was compared in the same spectra, the classification accuracy achieved was 60%. However, when both ions were used, the classification accuracy increased to 98%. For the m/z 169 ion, the added noise resulted in S/Ns of approximately 4.6 and 5.9 for nigerose and laminaribiose, respectively, while for the m/z 259 ion the S/N was approximately 1.3 for nigerose and 1.6 for laminaribiose. In addition to establishing the effect of fragment choice, it was also important to evaluate classification accuracies at different wavelengths. Prediction accuracies of 100% were achieved for our data set at wavelengths of 9.282, 9.588, and 10.611 μm (Table 1). It is not possible to extrapolate these conclusions to additional wavelengths without actually performing the experiments. Although it is suspected that other wavelengths of the CO2 laser may be equally useful in differentiating isomers, the large number of repeats for these specific spectral comparisons limited the number of wavelengths that were examined. Data mining analysis could be performed over multiple wavelengths using data from FEL IRMPD22 experiments. This capability demonstrates several advantages of the FEL: (1) broad wavelength range (7.2 10.8 μm), covering most of the midinfrared wavelength range and the entire range of the CO2 laser; (2) closely spaced wavelengths (step size of less than 0.03 μm); (3) acquisition of the entire range of wavelengths in one run, thus minimizing scan-to-scan variability; and (4) reasonably constant power output over the entire wavelength range and capability to correct for systematic variations. Since only one set of fragmentation vs wavelength data was available from the FEL study, the data mining analysis was carried out somewhat differently, as described in the Experimental Section. As with the CO2 laser IRMPD data discussed above, 100% 8473

dx.doi.org/10.1021/ac2017103 |Anal. Chem. 2011, 83, 8468–8476

Analytical Chemistry

ARTICLE

Table 2. Wavelengths in the Upper and/or Lower Limit of the Range of FEL Irradiation22 for Which Minimal Fragmentation Led to Disaccharide Misclassificationa glucose-containing disaccharide classified

misclassification wavelength, μm

kojibiose (α 1 2) sophorose (β 1-2) nigerose (α 1-3) laminaribiose (β 1-3) maltose (α 1-4)

7.30 7.82; 10.75

cellobiose (β 1 4) isomaltose (α 1-6)

10.72

gentiobiose (β 1-6)

7.82

a

If no wavelength is listed, no misclassifications occurred over the entire range.

classification accuracy was achieved. In contrast to the CO2 laser, however, the FEL laser has a much wider tuning range and can reach regions where the fragmentation efficiency is diminished due to low IR absorption efficiency. For data at these extreme ends of the range, the much lower signal-to-noise ratio due to dramatically reduced fragmentation efficiency can lead to misclassifications. The wavelengths (on either the low or high end of the tuning range) where misclassifications occurred are summarized in Table 2. An example of a test data set for the IRMPD of nigerose and laminaribiose (Table S1 in the Supporting Information) showed that the misclassification occurred for the laminaribiose anomer at 7.3 μm, the shortest wavelength of the range. Indeed, for all of the anomeric pairs listed in Table 2, each of the observed misclassifications occurred for data at either the low or high end of the wavelength range. Elimination of the poor quality data from wavelengths at either end of the irradiation range resulted in 100% classification accuracy for the remaining test data set, similar to the CO2 laser IRMPD data discussed above. Therefore, in order to avoid possible misclassifications, only regions of strong absorption should be utilized for differentiation of isomers. The entire range of the CO2 laser (9.2 10.8 μm) was within the optimal region, making the CO2 laser a good choice for IRMPD studies of disaccharides. The LP analysis of the FEL IRMPD data for the nigerose and laminaribiose pair also revealed that the coefficients of the ndimensional separating plane for the fifth, sixth, seventh, and eighth vectors are null (see Table S1 in the Supporting Information). This implies that the fragments with m/z 169, 127, 109, and 97 are produced in similar abundances for both disaccharides and do not permit species differentiation. This effect was observed only for the 1-3 linkage pair. We also examined the classification accuracies for very small data sets. For the line-tunable CO2 laser, instead of a large (200 data points) data set as described above, a limited data set of only three to five points for each wavelength was independently collected.62 Table S2 in the Supporting Information gives an example of the data collected for kojibiose (α1-2) and sophorose (β1-2). In addition to the small sample size, the data set was also heterogeneous (in the sense that the data points corresponded to different wavelengths). Because of these factors, this data set was rather challenging to analyze, and data mining techniques, including the aforementioned separating surface approach, were not able to provide robust classification results. A relatively high prediction accuracy of approximately 90% was achieved for all four anomeric

pairs using decision tree classification algorithms (this classification procedure is described in the Experimental Section). Although these classification results cannot be considered reliable due to the small sample size of the analyzed data sets, this example does demonstrate that for difficult cases such as this, an appropriate choice of predictive modeling technique is essential. In general, it is strongly advised that an appropriate size data set be collected for optimal analysis. In other difficult cases, such as those where poor signal quality (due, for example, to low analyte concentration or poor ionization efficiency) leads to a high occurrence of misclassification, successful analyte identification may still be possible by applying alternative predictive modeling techniques to optimize significant differences between data sets.

’ CONCLUSIONS Data mining analysis was applied to data obtained for lithiated disaccharides fragmented with line-tunable CO2 and free electron lasers. The regioisomers were found to produce fragments with different m/z, while stereoisomers produced the same fragments but slightly different fragment ion intensity distributions. Differences in fragment ion ratios were used to successfully identify all species examined with 100% classification accuracy by applying data mining approaches. Even though the ion ratios appeared visually to be very similar, data mining techniques were able to provide a mathematical basis for the decision-making process. It was found that analysis of a data set comprised of multiple data points obtained at a single wavelength is advantageous (better classification accuracy) compared to analysis of a data set comprised of only a few data points obtained at multiple wavelengths. Moreover, since it is convenient to use a CO2 laser at one fixed wavelength and to obtain all the data sets at this wavelength, the approach described here is promising and efficient in practical applications. In some cases, classification is so straightforward that the optimal differentiation criteria can be used directly in future analyses. Deliberate addition of white noise helped to explore the usefulness of different fragments and to confirm the preferential selection of certain fragments by the algorithm. Because data obtained at wavelengths outside of the optimal range of disaccharides’ IR absorption bands (around and below ∼7.3 μm and around and above ∼10.8 μm) led to misclassifications, it is advised that wavelengths outside of this range not be used for IRMPD experiments on disaccharides, at least, using wavelengths that can be accessed with the CO2 laser. It is worth noting, however, that carbohydrates show significant differential absorption and photodissociation both in the hydroxyl stretch region23,72 (3200 4000 cm 1) and for monosaccharides in the negative ion mode, in the carbonyl stretch region73 (1650 1750 cm 1) in the gas phase. Therefore other stable sources that could reproducibly interrogate these regions of the infrared absorption region with high enough photon flux could add to the value of this technique overall, with recognition that the probability that two isomers will absorb equally at all wavelengths and that dissociation spectra will be identical at all wavelengths and should approach vanishingly small numbers. In addition, the potential value of the technique is that it is not only applicable to the direct MS/MS spectra of disaccharides, but variable wavelength spectra of any product ions (MS3 spectra) or products of product ions (MS4 spectra) may also be included as valid attributes that can be included in the discriminatory set characteristic of a specific isomer. 8474

dx.doi.org/10.1021/ac2017103 |Anal. Chem. 2011, 83, 8468–8476

Analytical Chemistry This approach should be applicable to differentiation of any isomers using the following stepwise procedure: (1) from the experimental data, eliminate wavelengths with poor S/N, (2) identify single ions with the best discrimination between isomers by using appropriate classification techniques, e.g., separating hyperplanes, which allow for identifying these discriminating fragments (features), and (3) use two or more ions with the best discrimination between isomers to increase discriminatory capability. The relatively low cost of the line-tunable CO2 laser makes this IRMPD differentiation method accessible to many laboratories. The 100% classification accuracy on our data set for relatively complex isomeric compounds demonstrated here is very encouraging. It appears that this method has the capability to differentiate significantly more complex isomeric species as well as mixtures of isomers. Where isomers in a mixture yield different product ions, this approach should be feasible, as dissociation of product ions should be capable of discriminating individual precursors. However, where isomers in a mixture yield identical product ions, as do the two 1-3 linked disaccharides examined in this work, either a physical separation as afforded by ion mobility spectrometry prior to dissociation would be necessary11,12 or a frequency-specific photodissociation would be required, which may be feasible in the hydroxyl stretch region for some isomers.23 Extension of this method to more complex glycans and/or other species is underway in our laboratories.

’ ASSOCIATED CONTENT

bS

Supporting Information. Brief description of geometric separating surface approach, white noise addition procedure, table with test data set for the FEL fragmentation of nigerose and laminaribiose, table with CO2 laser fragmentation data for kojibiose and sophorose, and table with CO2 laser fragmentation data at 10.611 μm for nigerose and laminaribiose. This material is available free of charge via the Internet at http://pubs.acs.org.

’ AUTHOR INFORMATION Corresponding Author

*E-mail: [email protected]fl.edu. )

Present Addresses

Department of Chemistry, University of Massachusetts, Amherst, MA 01003. ^ Bioinstrumentation and BioMEMS Laboratory, University of California, Davis, CA 95616.

’ ACKNOWLEDGMENT The authors thank the University of Florida and the National Science Foundation (under Grants CHE-0718007 and OISE0730072) for funding, Prof. N. C. Polfer for providing data files from FEL experiments that were analyzed by the data mining algorithms and used to draw Figure 2, and Dr. K. R. Williams for assistance with manuscript preparation. ’ REFERENCES (1) Zaia, J. Mass Spectrom. Rev. 2004, 23, 161–227. (2) Morelle, W.; Michalski, J. C. Curr. Anal. Chem. 2005, 1, 29–57. (3) Park, Y.; Lebrilla, C. B. Mass Spectrom. Rev. 2005, 24, 232–264. (4) Jakab, A.; Nagy, K.; Heberger, K.; Vekey, K.; Forgacs, E. Rapid Commun. Mass Spectrom. 2002, 16, 2291–2297.

ARTICLE

(5) Murphy, R. C.; James, P. F.; McAnoy, A. M.; Krank, J.; Duchoslav, E.; Barkley, R. M. Anal. Biochem. 2007, 366, 59–70. (6) Blanksby, S. J.; Mitchell, T. W. Ann. Rev. Anal. Chem. 2010, 3, 433–465. (7) Moeder, M.; Martin, C.; Harynuk, J.; Gorecki, T.; Vinken, R.; Corvini, P. F. X. J. Chromatog., A 2006, 1102, 245–255. (8) Marshall, A. G.; Rodgers, R. P. Proc. Natl. Acad. Sci. U.S.A. 2008, 105, 18090–18095. (9) Murphy, R. C.; Fitzgerald, M.; Barkley, R. M. Metabolomics, Metabonomics Metabolite Profiling 2008, 161–194. (10) Fang, T. T.; Bendiak, B. J. Am. Chem. Soc. 2007, 129, 9721– 9736. (11) Bohrer, B. C.; Merenbloom, S. I.; Koeniger, S. L.; Hilderbrand, A. E.; Clemmer, D. E. Ann. Rev. Anal. Chem. 2008, 1, 293–327. (12) Kanu, A. B.; Dwivedi, P.; Tam, M.; Matz, L.; Hill, H. H. J. Mass Spectrom. 2008, 43, 1–22. (13) Gao, H.; Petzold, C. J.; Leavell, M. D.; Leary, J. A. J. Am. Soc. Mass Spectrom. 2003, 14, 916–924. (14) Hyyrylainen, A. R. M.; Pakarinen, J. M. H.; Forro, E.; Fulop, F.; Vainiotalo, P. J. Am. Soc. Mass Spectrom. 2009, 20, 1235–1241. (15) Li, Y. H.; Herman, J. A.; Harrison, A. G. Can. J. Chem. 1981, 59, 1753–1759. (16) Keough, T.; Mihelich, E. D.; Eickhoff, D. J. Anal. Chem. 1984, 56, 1849–1852. (17) Pfluger, D.; Motylewski, T.; Linnartz, H.; Sinclair, W. E.; Maier, J. P. Chem. Phys. Lett. 2000, 329, 29–35. (18) Cage, B.; Friedrich, J.; Little, R. B.; Wang, Y.-S.; McFarland, M. A.; Hendrickson, C. L.; Dalal, N.; Marshall, A. G. Chem. Phys. Lett. 2004, 394, 188–193. (19) Woodin, R. L.; Bomse, D. S.; Beauchamp, J. L. J. Am. Chem. Soc. 1978, 100, 3248–3250. (20) Bomse, D. S.; Berman, D. W.; Beauchamp, J. L. J. Am. Chem. Soc. 1981, 103, 3967–3971. (21) Baykut, G.; Watson, C. H.; Weller, R. R.; Eyler, J. R. J. Am. Chem. Soc. 1985, 107, 8036–8042. (22) Polfer, N. C.; Valle, J. J.; Moore, D. T.; Oomens, J.; Eyler, J. R.; Bendiak, B. Anal. Chem. 2006, 78, 670–679. (23) Eyler, J. R. Mass Spectrom. Rev. 2009, 28, 448–467. (24) Dunbar, R. C.; Fu, E. W. J. Am. Chem. Soc. 1973, 95, 2716–2718. (25) Freiser, B. S.; Beaucham, J. L. J. Am. Chem. Soc. 1974, 96, 6260–6266. (26) Wight, C. A.; Beauchamp, J. L. Chem. Phys. Lett. 1981, 77, 30–35. (27) Watson, C. H.; Zimmerman, J. A.; Bruce, J. E.; Eyler, J. R. J. Phys. Chem. 1991, 95, 6081–6086. (28) Dela Cruz, J. M.; Lozovoy, V. V.; Dantus, M. J. Mass Spectrom. 2007, 42, 178–186. (29) Pastirk, I.; Zhu, X.; Lozovoy, V. V.; Dantus, M. Appl. Opt. 2007, 46, 4041–4045. (30) Rizzo, T. R.; Stearns, J. A.; Boyarkin, O. V. Int. Rev. Phys. Chem. 2009, 28, 481–515. (31) Wang, X. B.; Woo, H. K.; Wang, L. S. J. Chem. Phys. 2005, DOI: 10.1063/1.1998787. (32) Bradley, P. S.; Fayyad, U. M.; Mangasarian, O. L. INFORMS J. Comput. 1999, 11, 217–238. (33) Tan, P.-N.; Steinbach, M.; Kumar, V. Introduction to Data Mining, 1st ed.; Pearson Addison Wesley: Boston, MA, 2006. (34) Iversen, G. R.; Norpoth, H. Analysis of Variance, 2nd ed.; Sage Publications: Newbury Park, CA, 1987. (35) Turner, J. R.; Thayer, J. F. Introduction to Analysis of Variance: Design, Analysis, & Interpretation; Sage Publications: Thousand Oaks, CA, 2001. (36) Stevens, J. Applied Multivariate Statistics for the Social Sciences, 4th ed.; Lawrence Erlbaum Associates: Mahwah, NJ, 2002. (37) Simoes, J.; Domingues, P.; Reis, A.; Nunes, F. M.; Coimbra, M. A.; Domingues, R. M. Anal. Chem. 2007, 79, 5896–5905. (38) Davis, J. E.; Shepard, A.; Stanford, N.; Rogers, L. B. Anal. Chem. 1974, 46, 821–825. (39) Varmuza, K.; Werther, W.; Henneberg, D.; Weimann, B. Rapid Commun. Mass Spectrom. 1990, 4, 159–162. 8475

dx.doi.org/10.1021/ac2017103 |Anal. Chem. 2011, 83, 8468–8476

Analytical Chemistry

ARTICLE

(40) Mantini, D.; Petrucci, F.; Del Boccio, P.; Pieragostino, D.; Di Nicola, M.; Lugaresi, A.; Federici, G.; Sacchetta, P.; Di Ilio, C.; Urbani, A. Bioinformatics 2008, 24, 63–70. (41) Lu, B. W.; Ruse, C. I.; Yates, J. R. J. Proteome Res. 2008, 7, 3628– 3634. (42) Wan, C. H.; Harrington, P. D. J. Chem. Inf. Comput. Sci. 1999, 39, 1049–1056. (43) Madhusudanan, K. P.; Srivastava, M. J. Mass Spectrom. 2008, 43, 126–131. (44) Turner, P. G.; Taylor, S.; Goulermas, J. Y.; Hampton, K. Rapid Commun. Mass Spectrom. 2007, 21, 305–313. (45) Mihaleva, V. V.; Verhoeven, H. A.; de Vos, R. C. H.; Hall, R. D.; van Ham, R. C. H. J. Bioinformatics 2009, 25, 787–794. (46) Zou, W.; Tolstikov, V. V. Algorithms 2009, 2, 638–666. (47) Alsberg, B. K.; Kell, D. B.; Goodacre, R. Anal. Chem. 1998, 70, 4126–4133. (48) Du, X. X.; Yang, F.; Manes, N. P.; Stenoien, D. L.; Monroe, M. E.; Adkins, J. N.; States, D. J.; Purvine, S. O.; Camp, D. G.; Smith, R. D. J. Proteome Res. 2008, 7, 2195–2203. (49) Ding, Y.; Choi, H.; Nesvizhskii, A. I. J. Proteome Res. 2008, 7, 4878–4889. (50) Markey, M. K.; Tourassi, G. D.; Floyd, C. E. Proteomics 2003, 3, 1678–1679. (51) Granitto, P. M.; Furlanello, C.; Biasioli, F.; Gasperi, F. Chemom. Intell. Lab. Syst. 2006, 83, 83–90. (52) Hofmeister, G. E.; Zhou, Z.; Leary, J. A. J. Am. Chem. Soc. 1991, 113, 5964–5970. (53) Dongre, A. R.; Wysocki, V. H. Org. Mass Spectrom. 1994, 29, 700–702. (54) Asam, M. R.; Glish, G. L. J. Am. Soc. Mass Spectrom. 1997, 8, 987–995. (55) Garozzo, D.; Giuffrida, M.; Impallomeni, G.; Ballistreri, A.; Montaudo, G. Anal. Chem. 1990, 62, 279–286. (56) Dallinga, J. W.; Heerma, W. Biol. Mass Spectrom. 1991, 20, 215–231. (57) Ngoka, L. C.; Gal, J. F.; Lebrilla, C. B. Anal. Chem. 1994, 66, 692–698. (58) Cancilla, M. T.; Penn, S. G.; Carroll, J. A.; Lebrilla, C. B. J. Am. Chem. Soc. 1996, 118, 6736–6745. (59) Xie, Y.; Lebrilla, C. B. Anal. Chem. 2003, 75, 1590–1598. (60) Cai, Y.; Concha, M. C.; Murray, J. S.; Cole, R. B. J. Am. Soc. Mass Spectrom. 2002, 13, 1360–1369. (61) Guan, B.; Cole, R. B. J. Am. Soc. Mass Spectrom. 2008, 19, 1119–1131. (62) Stefan, S. E.; Eyler, J. R. Int. J. Mass Spectrom. 2010, 297, 96–101. (63) Ikonomou, M. G.; Kebarle, P. J. Am. Soc. Mass Spectrom. 1994, 5, 791–799. (64) Wigger, M.; Nawrocki, J. P.; Watson, C. H.; Eyler, J. R.; Benner, S. A. Rapid Commun. Mass Spectrom. 1997, 11, 1749–1752. (65) Winger, B. E.; Hofstadler, S. A.; Bruce, J. E.; Udseth, H. R.; Smith, R. D. J. Am. Soc. Mass Spectrom. 1993, 4, 566–577. (66) Wu, S.; Zhang, K.; Kaiser, N. K.; Bruce, J. E. J. Am. Soc. Mass Spectrom. 2006, 17, 772–779. (67) Caravatti, P.; Allemann, M. Org. Mass Spectrom. 1991, 26, 514–518. (68) XPress-MP Solver; FICO Corp.: Leamington Spa, U.K., 2008. (69) Tiberius Predictive Modeling Software; Tiberius Data Mining: Melbourne, Australia, 2009. (70) Oepts, D.; Vandermeer, A. F. G.; Vanamersfoort, P. W. Infrared Phys. Technol. 1995, 36, 297–308. (71) Stefan, S. E.; Eyler, J. R. Anal. Chem. 2009, 81, 1224–1227. (72) Simons, J. P.; Jockusch, R. A.; Carcabal, P.; Huenig, I.; Kroemer, R. T.; Macleod, N. A.; Snoek, L. C. Int. Rev. Phys. Chem. 2005, 24, 489–531. (73) Brown, D. J.; Stefan, S. E.; Berden, G.; Steill, J. D.; Oomens, J.; Eyler, J. R.; Bendiak, B. Carbohydr. Res. 2011, 346, 2469 2481.

8476

dx.doi.org/10.1021/ac2017103 |Anal. Chem. 2011, 83, 8468–8476