Carbon-13 nuclear magnetic resonance spectrum simulation

1987, 59, 1805-1811. 1805 has been found in the Atlantic Trench of the Red Sea (4). We suspect that the difficulties in the analyses at these concentr...
0 downloads 0 Views 962KB Size
1805

Anal. Chem. 1987. 59. 1805-1811

has been found in the Atlantic Trench of the Red Sea ( 4 ) . We suspect that the difficulties in the analyses a t these concentration levels explain the variations between laboratories and within a laboratory. The similarity in chemistry to iodate and molybdate ions and the stability of the perrhenate ion suggest a long residence time in seawater. We would then expect conservative behavior of rhenium and a near constancy in concentration (i.e., a covariance with salinity). On this basis the variations by a factor of 2 or 3 are disturbing. Still, we cannot dismiss real variations in rhenium concentrations. The other investigations have utilized neutron activation analyses. Although an essentially quantitative procedure was established with '=Re tracer for the preconcentration of Re from seawater to a final isolate for irradiation, these steps were not monitored with the real samples, thus, providing a yield uncertainty. The reported Re values in the range of 2.7-5.8 ng/L (3) may be low for this reason. Our seawater values range from 5.7 to 15.2 ng/L with an average of 9.1 f 2.5. The SI0 Pier waters had an average of 8.3 f 1.8. Thus, S I 0 Pier and open ocean waters cannot be distinguished. Ten replications of a 2.5 ng/L standard gave an average of 14.1 f 3.5 units of absorbance suggesting that the spread in seawater values is instrumental. Two pairs of replicate seawaters that went through the entire procedure are in accord with this result (Table IV). We suspect the primary source of error is the uncertainty in the manual dispensing of the 25 r L of sample of Re isolate into the graphite tube. Rhenium concentrations in marine sediments were found to vary over 3 orders of magnitude (Table V). Unusually high levels are found in anoxic sediments and hydrothermal sulfides. These results have led us to conclude that rhenium concentrations in excess of the parts-per-billion level in sed-

iments provide the most sensitive indicator of a reducing environment for the formation of solid phases (7). Since rhenium in crustal rocks probably exists in a reduced state, oxidizing weathering solutions are necessary to put it into the heptavalent form which can subsequently be reduced in anoxic milieus. Thus, the occurrence of rhenium in pre-Cambrian sediments with values of greater than a part per billion might be used to herald the first oxidizing atmosphere and oxidizing waters acting upon crustal rocks. ACKNOWLEDGMENT We thank Martha Stallard for her assistance in the laboratory, Ken Bruland of the University of California a t Santa Cruz for some seawater samples, and Gary Ehrhardt of the University of Missouri Reactor Facility for his assistance in the neutron irradiation of rhenium. LITERATURE C I T E D Koide, M.; Goldberg, E. D J. Environ. Radioact. 1985, 2 , 261-282. Cotton, F. A.; Wilkinson, G. Advance Inorganic Chemisfry, 4th ed.; Wlley: New York, 1980. Olafsson, J.; Riley, J. Chem. Geol. 1972, 9 , 227-230. Boiko, T. F.; Miiier, A. D. Geokbimiya 1978, 1 1 , 1736-1740. Scadden, E. M. Geochim. Cosmochim. Acta 1989, 33, 633-637. Matthews, A. D.; Riley, J. P. Anal. Chim. Acta 1970, 5 1 , 455-462. Koide, M.; Hodge, V. F.; Yang, J. S.;Stallard, M.; Goldberg, E. D.: Caihoun, J.; Bertine, K. K. Appl. Geochim., in press. Fukai, R.; Meinke, W. W. Limnol. Oceanogr. 1962, 7, 186-200. Tribaiat. S.Anal. Chim. Acta 1949, 3, 113-124. Fianagan, F. J. Geochim. Cosmochim. Acta 1973, 37, 1189-1200. Levinson, A. A. Introduction to Exploration Geochemistry, 2nd ed.; Applied Publishing: Wilmette, IL, 1980. Collier, R. W. Limnol. Oceanogr. 1985, 30, 1351-1354.

RECEIVED for review November 3, 1986. Accepted April 13, 1987. This research was supported by a grant from the National Science Foundation (OCE85-13486).

Carbon- 13 Nuclear Magnetic Resonance Spectrum Simulation Methodology for the Structure Elucidation of Carbohydrates Malcolm K. M c I n t y r e and G a r y W. Small*

Department of Chemistry, T h e University of Iowa, Iowa City, Iowa 52242

Computer-based procedures are developed for simulating the 13C NMR spectra of carbohydrates. By use of data from five sources, models are derlved that related observed chemical shlfts to numerical parameters encoding aspects of the chemlcai environments of the corresponding carbons. Molecular mechanics techniques are used to compute parameters encoding the effects of multiple oxygen atoms on the carbon atom envlronments. A calibration procedure Is introduced for adjusting experimental spectra to the computed models, thereby allowing valid comparisons to be made between the spectrum of an unknown and the slmultated spectra of possible candidate structures. The derlved models are tested by slmultatlng the spectra of 15 compounds not included in the modeilng study. Experimental spectra are used to evaluate the simulations.

Carbohydrates represent a class of organic compounds of important biological significance. As an example, one car-

bohydrate subclass, the aminoglycosides (e.g. streptomycin, gentamicin), is an important family of antibiotics. The world market for human clinical use of aminoglycoside antibiotics was estimated at 525 million dollars in 1978 (I). Much work is devoted to isolating aminoglycosides, as well as oligo- and polysaccharides, from natural sources, and to modifying known compounds in an effort to alter their biological properties. The determination or verification of an exact chemical structure is an essential part of each type of study. Carbon-13 nuclear magnetic resonance spectroscopy (13C NMR) is an important tool in the structural investigation of carbohydrates. The determination of an exact structure is often difficult, however, as many carbohydrates differ only by stereochemistry, resulting in highly similar I3C NMR spectra. As an example, consider the structures depicted in Figure 1. P-D-Allose (A) and a-D-mannOSe (B) are topologically identical, differing only in the geometrical orientations of three hydroxyl groups. Their corresponding broad-band decoupled 13C NMR spectra, depicted in Figure 2, are numerically distinct, but visually highly similar. Viewing the

0003-2700/87/0359-1805$01.50/00 1987 American Chemical Society

1806

ANALYTICAL CHEMISTRY, VOL. 59, NO. 14, JULY 15, 1987

dH

H

w

H

E

c

Figure 1. Chemical structures for /%allose (A) and aemannose (B). The rightmost carbon in each structure is designated atom 1.

200

150

100

bo'

'

'

'

b

I

Figure 2. Experimentally observed broad-band decoupled 13C NMR

spectra for p-o-allose and a-c-mannose. Both spectra were drawn from the work of Reuben (3). spectrum from a numerical standpoint, the information must exist for discriminating between the two structures. From a visual standpoint, however, the spectra are too similar to allow a reliable assignment of the structures. The above example motivates the potential value of .numerically based spectral interpretation aids in carbohydrate structure elucidation studies. This paper describes the development and testing of computer-based 13CNMR spectrum simulation methodology for basic carbohydrate structural units. 13CNMR chemical shifts in a set of 35 monosaccharides were modeled, giving rise to a carbohydrate spectral prediction system. The developed methodology provides essential components of a semiautomatic structural investigation system that shows significant promise for application to actual structure elucidation studies of carbohydrates. EXPERIMENTAL SECTION The 13C NMR spectra used in this research were assembled from five literature sources as well as from data collected locally. Twelve spectra were taken from a collection published by Pfeffer and co-workers (2). In the Pfeffer work, broad-band decoupled spectra were measured at 30' on a JEOL FX 60-Q NMR spectrometer operating at 15.04 MHz. Spectra were recorded in H 2 0

(100 mg/mL), and chemical shifts were measured relative to internal p-dioxane. For reporting purposes, the authors adjusted the chemical shifts mathematically to a tetramethylsilane (Me4Si) reference by use of a 67.4 ppm offset. Nine spectra were taken from Reuben (3). Compounds were prepared as 10% (w/v) solutions in Me2SO-ds. Spectra were recorded at 90.56 MHz and 24 OC on a Nicolet 360WB NMR spectrometer. The solvent was used as an internal reference,with the shifts adjusted to sodium 3-trimethylsilylpropionate-2,3,3,3-d4 (TSP) by use of a 41.105 ppm offset. The Me4Si and TSP reference lines are identical within experimental error. Two spectra from Dorman and Roberts ( 4 ) were used. Compounds were dissolved in H20,and spectra were recorded on a noncommerical instrument. Chemical shifts were measured relative to p-dioxane. For reporting purposes, the shifts were converted to external CS2by use of a 126.1 ppm offset. No sample concentrations or temperatures were given. Six spectra were derived from Perlin (5). Compounds were dissolved in H20 (0.5-0.6 g/mL), and spectra were collected at 25.15 MHz on a Varian HA-100 NMR spectrometer. A sample temperature of 55 "C was reported. Chemical shifts were measured relative to external 13CH31but reported relative to CS2. A 213.1 ppm offset was used to convert the shifts to CS2. Six spectra were taken from Bock and Pedersen (6). Data were collected at 30' and 22.63 MHz on a Bruker WH-90 NMR spectrometer. Samples were prepared as 20% solutions in DzO. Chemical shifts were measured relative to internal p-dioxane, and adjusted to a Me,Si reference by use of a 67.4 ppm offset. Twenty-three spectra were collected locally, 8 for comparison with literature data and 15 for use as test spectra for the developed methodology. Samples of the corresponding compounds were purchased from Sigma Chemical Co., St. Louis, MO, and used without purification. For data collection, the compounds were dissolved in D20 solution (approximately 1.0 F). For structures that mutarotate, spectra of both anomers were observed. Broad-band decoupled spectra were recorded at 25 OC and 90.56 MHz on a Bruker WM-360 superconducting magnet NMR spectrometer operating in The University of Iowa High-Field NMR Facility. The free induction decay size was 32K, and a 5-mm C/H probe was used. Chemical shifts were measured and reported relative to internal sodium 2,2-dimethyl-2-silapentane-5-sulfonate (DSS). The DSS reference line is coincident with those of Me4Si and TSP. All computer software used in this work was written in FORTRAN 77 and implemented on a PRIME 9955 interactive computer system operating in the Gerard P. Weeg Computing Center at The University of Iowa. The MINITAB statistical software system (Minitab, Inc., State College, PA) was used for simple linear regressions performed in the chemical shift calibration study. Plots were generated by use of the TELLAGRAF interactive graphics system (Integrated Software Systems Corp., San Diego, CA), and with original software. A Hewlett-Packard 7475A digital plotter was used as the output device. RESULTS AND DISCUSSION Overview of Spectrum Simulation. The natural progression of a structure elucidation study involves a cycle of postulating possible candidate structures for an unknown, followed by the use of analytical data to confirm or reject the validity of the candidates. The motivation behind spectrum simulation methods is to provide capabilities for predicting the spectra of candidate structures in cases in which no experimental spectra are available. For the case of 13C NMR spectra, simulation procedures allow highly detailed structural questions to be investigated (e.g. the specific orientation of a ring substituent) without the necessity for obtaining pure samples of test compounds. The simulation method employed here is based on the development of linear models that relate structural parameters to 13C NMR chemical shifts. These models have the form

where Sj is a predicted chemical shift for a given carbon atom, j , the X iare numerical parameters that encode aspects of the

ANALYTICAL CHEMISTRY, VOL. 59, NO. 14, JULY 15, 1987

1807

Table I. Results of Chemical Shift Calibration to Pfeffer Data source

n"

co

c1

R2,b

SC

td

Bock Dorman

20 58 62 37 44

0.267 192 193 0.210 -1.52

0.997 -0.991 -0.999 0.982 0.999

99.99 99.72 99.96 99.85 99.99

0.0752 0.603 0.255 0.450 0.0425

640 -142 -366 154

Perlin Reuben local

1633

"Number of chemical shifts used in calibration. bPercent of variance explained by calibration model. 'Standard error of estimate in chemical shift units (ppm). d t value for significance of C1 (absolute value 1 4.0 considered strong). chemical environment of j that influence its chemical shift, and the bi are coefficients determined through a multiple linear regression analysis of a set of known and unambiguously assigned chemical shifts. Once a given model has been developed, it becomes available for use in predicting chemical shifts of atoms whose chemical environments are similar to those upon which the model is based. A complete simulated spectrum is assembled through the separate predictions of the chemical shifts of each carbon in a structure giving rise to a distinct resonance. Depending on the structure, the use of several models may be necessary to obtain a complete spectrum. The key to the development of models accurate enough to allow detailed structural postulates to be investigated is the successful encoding of chemical environments into descriptive numerical parameters. Computer-based structural parameters offer the most flexibility, as they allow sophisticated encodings of steric and electronic effects. The efficient use of these parameters, however, requires computer-based structure handling, as well as a variety of other data processing utilities. These capabilities have been developed as a set of interactive software tools by Small and Jurs (7). This software was used in the development of the spectrum simulation capabilities reported in this paper. Assembly of I3C NMR Data for Modeling and Testing. As noted above, the development of chemical shift models is tied to the availability of spectral data that is representative of the chemical system being modeled. The desire for strong and reliable models necessitates the inclusion of as many spectra as possible in the modeling process. This often means that data must be drawn from a variety of sources. The use of multisource data is a particular problem with carbohydrates, as their 13C NMR chemical shifts exhibit great sensitivity to changes in experimental conditions. Spectra are often recorded in different solvents, at different temperatures, and against different reference compounds. This problem is compounded by the fact that literature spectral data for carbohydrates span approximately 20 years of instrumental changes and developments. Structural differences (and corresponding differences in chemical shifts) among the compounds are small enough that variations in the experimental data can effectively prevent the construction of chemical shift models that are accurate enough to be useful. A problem of equal importance is the difficulty in assigning chemical shifts to specific carbon atoms. Chemical shift differences among the ring carbons are quite small, historically leading to numerous discrepancies in the literature regarding shift assignments. Recent experimental advances based on deuterium isotope effects (2,3)have greatly aided the solution of this problem, however. A principal goal in this work was to develop procedures that would allow multisource data to be used in modeling carbohydrate chemical shifts. Toward that end, we have employed data from five literature sources in the modeling phase of the study, with data collected locally being used subsequently to test the accuracy of the computed models. The literature data were collected over a span of 14 years, encompassing a range

of solvents and reference compounds. Our approach is based on the selection of a set of standard experimental conditions for a modeling study, and subsequent calibration of all data to those experimental conditions. When literature data from multiple sources are used, this procedure is implemented by selecting one data source as the standard, and calibrating the other sources to this standard. Successful calibration requires that there be a set of duplicate spectra collected under both sets of conditions. Examination of the monosaccharide data revealed a sufficient number of duplicate spectra for the calibration procedure to be employed. The calibration model chosen was a simple two-parameter linear fit of the form where Co and C1are coefficients that relate a chemical shift to be calibrated, Seal, to the corresponding chemical shift collected under the standard conditions, SsM. Simple linear regression was used to derive Co and C1 for each set of literature data. For each pairwise comparison of data sources, all available duplicate spectra were selected and merged. Ten discrepancies in shift assignments were noted. In each case, the discrepancy could be resolved by use of the Pfeffer or Reuben data in which deuterium isotope effects were used to make assignments. Correlation coefficients were computed between each pairwise combination of data. Inspection of the correlation coefficients revealed that the Pfeffer data had the highest overall correlation with the other data sources. The Pfeffer spectra were thus chosen as the standard, and the calibration procedure was applied to each of the other data sources. The results of this calibration study are presented in Table I. It is evident from an inspection of the table that a linear model is highly appropriate in each case q d that each calibration was performed with great accuracy. The statistics describing each model indicate that the effects of experimental parameters such as solvent or choice of reference have been largely removed by the calibration procedure. The derived calibration coefficients are physically meaningful. For example, Co is computed as approximately 193.0 ppm for both the Dorman and Perlin calibrations. These data are reported relative to CSz, while the Pfeffer data are reported relative to Me4Si. The chemical shift of CS2 relative to Me,Si is typically reported in the range of 193.0 ppm. Given that each spectrum can be calibrated to the same chemical shift scale, a remaining problem was the decision as to which of the duplicate spectra to use in the modeling study. Of the 35 compounds represented among the five data sources, 24 were available in two or more sources. The Pfeffer data were used when available, followed by the Reuben, Bock, Perlin, and Dorman data in decreasing order of priority. The Reuben spectra were given high priority, as they were collected most recently. The other sources were ranked by their quality of fit to the Pfeffer data. By way of example, the Dorman spectra were used only where necessary, as they exhibited the worst fit in the calibration. Based on these criteria, twelve spectra were derived from Pfeffer, nine from Reuben, six from

1808

ANALYTICAL CHEMISTRY, VOL. 59, NO. 14, JULY 15, 1987

Table 11. Compounds Used in Modeling and Testing

no

compound name

ref"

Modeling 1 2 3 4

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

a-D-allose P-D-allose methyl a-D-altropyranoside a-D-arabinose (3-D-arabinose methyl a-D-arabinopyranoside methyl P-D-arabinopyranoside a-L-arabinose @+arabinose a-D-galaCtOSe @-D-galaCtOSe methyl a-D-galactopyranoside methyl P-D-galactopyranoside a-D-glUCOSe

@-D-glUCOSe

methyl a-D-glucopyranoside methyl @-D-glucopyranoside methyl a-D-idopyranoside a-D-lyXOSe

D R

P Pf Pf B R R R

Pf Pf Pf Pf Pf Pf Pf Pf

@-D-lyXOSe

P R P

methyl a-D-lyxopyranoside a-D-mannOSe

B R

p-D-mannose

P

methyl a-D-mannopyranoside methyl (3-D-mannopyranoside a-L-rhamnose 6-L-rhamnose a-D-ribose (3-D-ribose methyl a-D-ribopyranoside methyl (3-D-ribopyranoside

R

P R D

P R B B

@-D-XylOSe

Pf Pf

methyl a-D-xylopyranoside methyl 0-D-xylopyranoside

B B

a-D-XylOSe

Testing 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

a-D-altrose 8-D-altrose methyl @-L-arabinopyranoside a-D-fucose

@-D-fUCOSe a-L-galactose P-L-galactose a-L-glucose @+glucose a-L-mannose

@-L-mannose a-L-ribose p-L-ribose a-L-xylose

P-L-xylose

L L L L L L L L L L L L L L L

=Key: B, Bock; D, Dorman; L, local; P, Perlin; Pf, Pfeffer; R,

Reuben. Bock, six from Perlin, and two from Dorman. Table I1 lists the compounds used, along with the corresponding data source for each. Thus, models computed based on the calibrated data will be keyed to the experimental conditions and reporting procedure used by Pfeffer. When these models are used for prediction purposes, the experimental conditions employed by a user must also be calibrated to the Pfeffer data if simulated spectra obtained from the application of the models are to be useful for comparison with a spectrum of an unknown. Our local conditions were calibrated by collecting eight spectra (a and p anomers of D-arabinose, D-galactose, Dglucose, and D-XylOSe) t k t are duplicated in the Pfeffer data and computing the cor1 ponding Coand C1.The results of this procedure are also given in Table I, with the calibration

again judged to be highly successful. Assembly of Chemical S t r u c t u r a l Information. The 35 structures used in modeling, plus the 15 structures used in testing, were entered into computer disk files via a graphical procedure developed by Brugger and Jurs (8). The fact that structural differences among the monosaccharides are largely geometrical differences (i.e. differences in substituent orientations) necessitates the use of three-dimensional modeling techniques to approximate the geometries of the compounds. The techniques of molecular mechanics represent the most practical means for estimating geometries. These procedures provide approximate three-dimensional coordinates for the atoms in a modeled structure. An interactive molecular mechanics procedure described by Stuper et al. (9) was used to convert the input two-dimensional structures into basic three-dimensional form. Final modeling was performed by use of the MM2 program of Allinger (IO). An important consideration in modeling the monosaccharide structures is the fact that each compound can exist in two possible chair conformations. In most cases, one of these conformations is significantly preferred over the other on energetic grounds. The structure exists in this conformation a majority of the time, dictating that the observed I3C NMR spectrum derives largely from the chemical environmental effects felt within this conformation. Successful modeling of structure vs. spectrum is clearly keyed to determining the most accurate structural conformation. Several workers have addressed this problem from a theoretical perspective ( I I - I 3 ) , but tabulated conformational preferences are not available for all compounds that might be encountered. For example, no tabulated preferences were found for 15 test compounds. We have implemented a semiautomated procedure for this determination that has proven quite workable. The interactive modeler described above was used to create both chair forms for any compound whose conformational preference was unavailable from tabulated theoretical treatments. This is accomplished easily by switching the substituent orientations. MMZ was then used to model both chair forms. One diagnostic provided by molecular mechanics is the overall strain energy for a structure. These strain energies were compared for each pair of chair forms, with the lower energy judged to correspond to the more energetically favorable conformation. This procedure was tested by randomly selecting 10 monosaccharides whose conformational preferences are well established theoretically. The ~ ~ 2 - b a s approach ed clearly selected the proper conformation in each case. These results were deemed quite favorable, and conformational preferences for the 15 test compounds were subsequently determined by implementation of the MMZ procedure. Calculation of S t r u c t u r a l P a r a m e t e r s f o r Modeling. Chemical shifts in the monosaccharides are influenced largely by the presence of an ether oxygen in the ring system and multiple hydroxyls bonded either axially or equatorially to the ring system. The successful characterization of these oxygen effects was considered essential in modeling the monosaccharide chemical shifts. In previous studies of this type (14,15), numerous structural parameters have been devised based on the three-dimensional coordinates returned from molecular mechanics calculations. Typically, the coordinates have been used to compute interatomic distances, thereby encoding the surrounding chemical environment of carbon atoms. In addition to these distance-based parameters, topological counts of atom types, etc. have proven useful, as well as simple quantum mechanical treatments of inductive effects (16). The chemical systems treated in these studies were fundamentally simpler, however, in that no cases of multiple heteroatoms were encountered.

ANALYTICAL CHEMISTRY, VOL. 59, NO. 14, JULY 15, 1987

1809

Table 111. Computation of Parameters Encoding Effects of Oxygen Lone Pairs

parameter CLPD HLPD

CLPE HLPE

calculation Sum of l/distance3 from the carbon being modeled to lone pairs associated with oxygens i bonds" away Sum of l/distance3 from the CY hydrogen* to lone pairs associated with oxygens i bonds away van der Waals' energyC describing interactions between the carbon being modeled and all lone pairs associated with oxygens in the molecule van der Waals' energy describing interactions between the a hydrogen and all lone pairs associated with oxygens in the molecule

" i is specified by the user, allowing several parameters to be computed, each detailing interactions a t different distances from the carbon being modeled. bThe a hydrogen is the hydrogen atom attached to the carbon whose chemical shift is being modeled. cvan der Waals' energy computed from the potential function used by Allinger in the MMZ program (IO). Table IV. Summary of Model Statistics group

I I1 I11 IV V

na

Pb

R'

sd

Fe

S,ackf

35

6 4

0.995

0.430

0.926

35

6

32

4 2

0.959 0.947 0.999

0.850 0.482 0.590 0.625

485 125

0.576

88 17

54.0

0.903 0.580

59.0 3217

0.778 0.778

Number of chemical shifts used to define model. Number of parameters in computed model. Correlation coefficient. Standard error of estimate in chemical shift units (pprn). e F value for significance of the model. f Standard error from iackknifing procedure. It was hypothesized that many of the same parameters would be useful in modeling the monosaccharides but that the existing parameters would have to be augmented with information regarding the effects of multiple oxygen atoms. Numerous attempts were made to encode these oxygen effects. Two schemes proved effective, and both found use in the modeling work. Each is described below. In the three-dimensional modeling of the monosacchride structures, MM2 was used to obtain approximate coordinates for the oxygen electron lone pairs. These coordinates were used in computing a family of four structural parameters based on throughspace interactions between the lone pairs and both the carbon being modeled and its attached hydrogen(s). The carbon-lone pair and hydrogen-lone pair interactions were encoded in both distance-based parameters and energy-based parameters. The calculation of these parameters is detailed in Table 111. In addition, an analogous family of parameters was developed based on the modeled oxygen coordinates. The same computations described in Table I11 were used, substituting the oxygen coordinates for the lone pair coordinates. In many cases, the corresponding oxygen and lone pair parameters were highly correlated and effectively interchangeable. In other cases, however, one was found markedly superior to the other. Together, they appear to provide a powerful set of parameters for use in encoding the effects of oxygen atoms on carbon environments. Calculation of Chemical Shift Models. Definition of Atom Groups. When stereoisomers are modeled, the great similarity of the structures demands that models of extremely high accuracy be formed. We judge accurate models to be those that achieve standard prediction errors of less than 1.0 ppm. This necessitates a subdivision of the data set into groups of atoms of similar structure. In modeling, the determination of these groupings is often a trial-and-error procedure. One would like to derive models that are as global as possible, while, at the same time, maintaining a high prediction accuracy. For the monosaccharide study, five atom groupings were required to achieve the desired prediction accuracy. Separate models were generated for each atom group. These atom subsets were defined as: group I-atom 1;group 11-atoms 2 , 3 , and those atom 5's with substituted hydroxyls; group 111-atom 4; group IV-unsubstituted atom 5's and -CH,OH carbons; group V-methyl carbons in

rhamnose and -OCH3 carbons in all of the methylglycosides. In Figure 1, the rightmost carbon is designated atom 1,with subsequent numbers assigned in a clockwise manner around the ring. Multiple Regression Analysis. A standard multiple linear regression analysis was performed with each atom group in an attempt to define optimum models. A combination of stepwise regressions and best subset regressions was used to generate possible models (17).In virtually all such studies, a variety of models can be constructed, many possessing similar statistics. Three additional criteria were used in selecting one model to represent each atom group. First, models were screened for collinearity problems. If linear relationships exist among the independent variables used to define a model, the chance exists that the regression coefficients (bi in eq 1)will be imprecisely determined. The diagnostic procedures recommended by Belsley et al. (18) were used in this determination. Second, the prediction accuracy of each model was estimated by the jackknifing procedure of Allen (19). In this procedure, the chemical shift of each atom is predicted by using the model computed without that atom. An overall standard prediction error is calculated over the entire set of atoms. Third, to the degree possible, models were selected that featured continuous variables. We have discovered that discrete structural parameters based on atom counts, etc., sometimes respond unfavorably in predictions when the chemical environment of the atom being predicted is altered from the environments used to define the model. Effectively, discrete variables define models that are less flexible to structural changes, as the variables themselves can assume a limited range of values. Table IV presents summary statistics for the models selected to define each atom group. The descriptive statistics indicate that, without exception, the models achieve high accuracy and are statistically very sound. The standard error of estimate in each case is well below 1.0 ppm, our established goal for prediction accuracy. In order to provide an illustration of the specific parameters used in defining the models, Table V presents a detailed description of the model for group I. Of particular note are the excellent t values for each regression coefficient. While a six-variable model might be considered somewhat large when

ANALYTICAL CHEMISTRY, VOL. 59, NO. 14, JULY 15, 1987

1810

Table V. Model for Group I Carbons to

structural parameter

119.1

13.3

-1.876 -50.15 -29.19

-5.94

Sum of l/distance3 from the a hydrogen to heavy atoms (carbon or oxygen) in the molecule three bonds away Indicator variable-value of 1 if hydroxyl attached, 0 if methoxy attached to the carbon being modeled Sum of l/distance3 from the a hydrogen to lone pairs associated with oxygens four bonds away. Sum of l/distance3 from the a hydrogen to hydrogens in the molecule five bonds away Sum of l/distance3 from the carbon being modeled to lone pairs associated with oxygens three bonds away Sum of l/distance3 from the a hydrogen to hydrogens in the molecule four bonds away

coeff

-8.40 -6.81

40.20

4.88

-15.84

-3.95

t value for significance of the coefficient (absolute value t 4.0 considered strong). 0.251

a

a

0

5

10

15

20

25

30

35

COMPOUND NUMBER

Flgure 3. Bar graph displaying values of the first parameter used in the group I model. The eight labeled subgroups correspond to specific structural environments of the group I carbons. Compound numbers

refer to Table 11. based upon only 35 observations, each coefficient is clearly significant. The group I carbons can be divided into two broad categories-those with methoxy groups attached (the glycosides) and those with hydroxyls attached. In the model, an indicator variable (value 1or 0) was used to differentiate these two types of carbons. The remaining parameters largely encode the chemical environment of the hydrogen attached to the carbon of interest (the a hydrogen). Note that two of the lone pair parameters find use in the model. I t is instructive to examine the actual function of these parameters. For example, the first variable in the model is based on interatomic distances between the CY hydrogen and carbon or oxygen atoms three bonds away. This variable alone has a correlation coefficient of 0.948 with the dependent variable of observed chemical shifts. Figure 3 is a bar graph that displays the parameter values for each of the 35 carbons used. The numbers along the lower axis correspond to the compound identification numbers in Table 11. It is clear that this parameter divides the carbons into eight structural subgroups. An inspection of the structures reveals that these subgroups are all permutations of the cases of axial/equatorial hydroxyl at atom 2, axial/equatorial hydroxyl at atom 1, and axial/equatorial methoxy a t atom 1. Thus, a parameter that appears arbitrary a t first glance, actually represents a numerical means for assigning atoms to specific structural subclasses. I t is useful to note that parameters describing the environment of the a hydrogen were also found useful in modeling the chemical shifts of the other atom groups. In the other four models, 11 of the 16 total parameters are based on a hydrogen interactions with other hydrogens, heavy atoms, or the oxygen lone pairs. While the a hydrogen environment would be expected to have a pronounced effect on the chemical shift of the attached carbon, we find it extremely interesting

that the monosaccharide chemical shifts can be modeled almost solely by parameters based on this environment. Evaluation of Simulated Spectra. As noted previously, simulated spectra are constructed by merging the simulated chemical shifts for a compound. The utility of the monosaccharide models as a set can be judged by evaluating the complete simulated spectra. An overall standard spectral prediction error was computed for each of the 35 compounds by comparing the actual and predicted chemical shifts. These values range between 0.368 ppm and 1.407 ppm, with a mean prediction error of 0.659 ppm. Thirty-two of the 35 spectral prediction errors are less than 1.0 ppm, and 27 of 35 are less than 0.75 ppm. A second evaluation was performed by comparing each simulated spectrum to the 35 actual spectra. Comparisons were made by sorting the chemical shifts from smallest to largest in each spectrum and computing the sum of the squared differences between corresponding shifts in the sorted lists. This comparison score was used to determine the most similar actual spectrum to each simulated spectrum. If a given simulation were perfect, the most similar of the actual spectra would be that of the corresponding compound. This result was obtained in 28 of the 35 cases. In four cases, the corresponding actual spectrum was found as the second most similar, while in three cases it was judged third most similar. Given the extreme similarity of the monosaccharide spectra, we judge these results to be excellent. A final test was performed to judge the degree to which the ordering of the predicted chemical shifts matched the observed shifts. Carbohydrate spectra typically contain a number of chemical shifts in a narrow region. In such cases, absolute prediction errors can be misleading. It is possible to have a prediction error that is small in magnitude, yet have predicted lines out of order (i.e. the chemical shift of atom i may be predicted downfield from atom j , when an upfield prediction is correct). This potential problem was evaluated within each spectrum by ordering the predicted and observed shifts by atom, forming two matched lists of chemical shifts. The degree of similarity in ordering between the two lists was evaluated by computing the Spearman rank correlation coefficient, The value of this statistic ranges from -1 (opposite ordering) to +1 (identical ordering). The average value of rrmkover the 35 test spectra was 0.968, an excellent result. Further, many of the discrepancies in ordering involved chemical shifts differing by less than 0.1 ppm. Evaluation of Predictive Ability of Models. The most demanding test of any spectrum simulation system is its ability to predict spectra not included in the model formation study. This represents the actual use of the developed methodology. The steps required to perform a prediction are: (1)entry of the chemical structures whose spectra are to be predicted, (2) molecular modeling of those structures, (3) perception of the carbon atoms within each structure that will give rise to distinct chemical shifts, (4)calculation of the specific structural parameters required by the chemical shift models to be used, (5) use of the models and the computed structural parameters to generate predicted chemical shifts, and (6) assembly of the

ANALYTICAL CHEMISTRY, VOL.

59,NO. 14, JULY 15, 1987

1811

S i m u l a t e d Spectrum

I

COMPOUND NUMBER

Ill

Flgure 5. Predicted and observed spectra are plotted for a-c-altrose. The overall prediction error is 0.515 ppm.

Flgure 4. Bar graph showing predictlon errors for the 15 compounds used in testing the computed models. Compound numbers refer to Table 11.

predicted shifts to form complete predicted spectra. Fifteen spectra collected locally served as a test set for evaluating the actual predictive ability of the computed models. These compounds are listed in Table 11. As noted previously, our experimental conditions were calibrated to the Pfeffer data, thereby allowing comparisons to be made. Eleven of the spectra corresponded to L-monosaccharides for which the corresponding D compound was included in the modeling study. The D and L compounds are structural mirror images, producing chemical shifts that are identical to within experimental error. The mirror image relationship produces structures that are different in terms of the orientations of the ring substituents, however. As such, they represent an excellent test of the ability of the computed structural parameters to encode the relationship between a structural environment and its corresponding chemical shift. Individual values of the structural parameters will be different for the D and L compounds, but a valid model would be expected to produce similar chemical shifts. The spectra of a- and p-D-fUCOSe and a- and p-D-altrose correspond to compounds completely removed from the modeling study. These four spectra provide a test of the ability of the models to extend to new compounds. Figure 4 is a bar graph that presents the standard errors between the simulated and observed spectra of the 15 test compounds. Figure 5 provides a visual presentation of prediction quality. The simulated and experimentally observed spectra for a-D-altrose are plotted. Overall, the results are outstanding, with the mean prediction error computed at 0.944 ppm. Only four of the spectra have standard errors that are higher than desired. Errors for the fucose anomers are in the range of 1.4 ppm. This value is high due entirely to difficulty in simulating the chemical shift of the methyl carbon. Our group V model was based on a very limited number of methyl environments,suggesting that more methyl data are needed. The prediction errors for the L-galactose anomers are greater than 1.5 ppm, due entirely to problems in modeling the atom 4 chemical shift. Upon inspection, it appears that the atom 4 environment in this L-sugar is quite different from the environments included in the modeling study. CONCLUSIONS The presented work represents a set of basic procedures for modeling 13C NMR spectra of carbohydrates. In all testa, the computed models are judged highly successful. In addition, the developed procedures for calibrating chemical shifts

allow multiple data sources to be used in the modeling studies. This calibration technique also enables experimental data to be adjusted to the computed models, thereby allowing actual structural questions to be posed and answered via comparisons with simulated spectra. Perhaps the key achievement in the work is the successful encoding of the effects of multiple oxygen atoms. The derived parameters based on the oxygen and lone pair coordinates should prove valuable in extending the modeling capabilities to larger carbohydrates. A study involving a set of disaccharides is currently under way in our laboratory. While great advances have been made recently in instrument-based aids to structure elucidation (e.g. two-dimensional NMR), we feel that computational approaches such as spectrum simulation can have a strong complementary role in helping to solve structural problems. All of the computations required in the work reported here can be performed on a standard laboratory microcomputer (IBM-PC AT, etc.). In a typical working environment, access to high-field NMR instrumentation is often limited. Spectrum simulation shows promise as a structural investigation tool that can serve to ease the demands on this instrumentation. LITERATURE CITED (1) Hooper, I.R. I n Aminoglycoside Antibiotics; Umezawa, H., Hooper, I. R., Eds.; Springer-Verlag: New York, 1982;Chapter 1. (2) Pfeffer, P. E.; Valentine, K. M.; Parrish, F. W. J. Am. Chem. SOC. 1979, 101, 1265-1274. (3) Reuben, J. J. Am. Chem. SOC. 1984, 706, 6180-6186. (4) Dorman, D. E.; Roberts, J. D. J. Am. Chem. SOC. 1970, 92,

1355-1361. (5) Perlin, A. S.; Casu, B.; Koch, H. J. Can. J . Chern. 1970, 48, 2596-2605. (6) Bock, K.; Pedersen, C. Acta Chem. Scand., Ser. 8 1975, 2 9 , 258-263. (7) Small, G. W.; Jurs, P. C. Anal. Chern. 1983, 55, 1121-1127. (8) Brugger, W. E.: Jurs, P. C. Anal. Chem. 1975, 4 7 , 781-784. (9) Stuper, A. J.: Brugger, W. E.; Jurs, P. C. Computer Assisted Studies of Chemical Structure and Siologicai function; Wiley-Interscience: New

YOrk, 1979;pp 83-90.

(IO) Allinger, N. L. J. Am. Chem. SOC. 1977, 9 9 , 8127-8134. (11) Reeves, R. E. J. Am. Chem. SOC. 1950, 7 2 , 1499-1506. (12) Angyal, S.J. Angew. Chem., Int. Ed. Engi. 1969, 8 , 157-228. (13) Angyal, S. J.; Pickles, V. A. Aust. J. Chern. 1972, 25, 1695-1710. (14) Small, G. W.; Jurs, P.C. Anal. Chem. 1983, 5 5 , 1128-1134. (15) Small, G. W.; Jurs, P. C. Anal. Chem. 1984, 56, 2307-2314. (16) DelRe, G. J . Chem. SOC. 1958, 4031-4040. (17) Draper, N. R.; Smith, H. Applied Regression Analysis, 2nd ad.;WileyInterscience: New York, 1981. (18) Belsley, D. A.; Kuh, E.; Welsch, R. E. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity; Wiley-Interscience: New York, 1980. (19) Allen, D. M. Technical Report 23;Department of Statistics, University of Kentucky, 1971.

RECEIVED for review January 27,1987. Accepted April 1,1987. This work was supported by the National Institutes of Health under the Biomedical Research Support Grant program.