2307
Anal. Chem. 1984, 56,2307-2314
widths than for initial point estimation, with -2% error at W = r/2 as shown in Figure 4. In addition, the relative error for the derivative estimate is roughly three times larger than for the initial point estimate at widths, W > T. These results indicate the larger sensitivity of the slope estimate to the lack of higher order terms in the filter function. Noise added to the exponential test data had no effect on the determinate error of initial point or first derivative estimates. In summary, polynomial digital filters can be derived to provide convenient least-squares estimates of the initial value and initial slope of a data set. A loss of accuracy results when a filter of given order is applied to data described by higher order terms. To obtain accurate initial point estimates of such data, a filter of higher order may be derived by using the equations outlined in the Theory section, or the relative width of the filter may be reduced, such that the data may be described by a lower order polynomial. In practice, data should be collected with a high sampling rate and with minimal prior electronic filtering to minimize distortion. The initial point filter may then utilize the statistical benefit of a large number of data points, while minimizing determinate errors from lack of fit. Initial point polynomial filtering has been used successfully in this laboratory to increase the precision in determining initial light intensities in thermal lens absorption measurements (15,16).Other potential applications include initial rate estimation in thermal lens measurements (18, solution and gas-phase reaction kinetics (12),and excited-state decay measurements (18). Leasbsquares polynomial fib to a given data segment might also be used to extrapolate a best fit value and derivatives for points which lie outside the available range of data. Extrapolation to the pth data point before the data segment would be accomplished by deriving the desired filter coeffi-
cients as above, however, with a change in the index of the data segment to span the range in i from p to 2m + p. The best estimate of the polynomial value or its kth derivatives at the origin, p points before the data segment, are then given by bnOand bnk,respectively. This extrapolation may be used to determine initial conditions in kinetic determinations where there is a known, finite time lag between the initiation of the kinetic process and data collection. The extrapolation length, p , should remain small in relation to the filter width in order to minimize determinate errors in the estimated results.
LITERATURE CITED (1) Savitzky, A.; Golay, M. J. E. Anal. Chem. 1984, 36, 1627-1639. (2) Enke, C. G.; Nieman, T. A. Anal. Chem. 1978, 48, 705A-712A. (3) Betty, K. R.; Horlick, G. Anal. Chem. 1977, 49, 351-352. (4) Kaiser, J. F.; Reed, W. A. Rev. Scl. Instrum. 1977, 48, 1447-1457. (5) Madden, H. H. Anal. Chem. 1978, 50, 1383-1386. (6) Bromba, M. U. A.; Ziegler, H. Anal. Chem. 1979, 51, 1760-1762. (7) Zlegler, H. Appl. Spectrosc. 1981, 35, 88-92. (8) Bromba, M. U. A.; Ziegler, H. Anal. Chem. 1983, 55, 648-653. (9) Bromba, M. U. A.; Ziegler, H. Anal. Chem. 1981, 53, 1583-1586. (IO) Bromba, M. U. A.; Ziegler, H. Anal. Chem. 1983, 55, 1299-1302. (11) Gans, P.; Gill, J. 8. Appl. Spectrosc. 1883, 37, 515-520. (12) Mark, H. B., Jr.; Rechnitz, G. A. “Kinetics in Analytical Chemistry”; Elving, P. J., Koithoff, I . M., Eds., Wiiey-Interscience: New York, I968 Vol. 24. (13) Wonnacott, T. H.; Wonnacott, R. J. “Regression: A Second Course in Statistics”; Wiley: New York, 1981. (14) Hail, K. J.; Quickenden, T. I.; Watts, D. W. J . Chem. Educ. 1978, 53, 493-494. (15) Carter, C. A.; Harris, J. M. Anal. Chem. 1983, 55, 1256-1261. (16) Leach, R. A.; Harris, J. M. Anal. Chem. 1984, 56, 1481-1487. (17) Brannon, J. H.; Madge, D. J . f h y s . Chem. 1978, 82, 705-709. (18) Blrks, J. B. “Photophysics of Aromatic Molecules”; Wlley-Intersclence: New York, 1970.
RECEIVED for review May 21,1984. Accepted June 22,1984. This material is based upon work supported by the National Science Foundation under Grant CHE 82-06898.
Data Reduction in the Simulation of Carbon43 Nuclear Magnetic Resonance Spectra of Steroids Gary W. Small and Peter C. Jurs*
Department of Chemistry, The Pennsylvania State University, 152 Davey Laboratory, University Park, Pennsylvania 16802
Automated procedures are descrlbed for selectlng observatlons to delete In the formatlon of llnear models for carbon-13 NMR chemical shlfts. The methodology Is tested by calculatlng models for hydroxy sterdds based on four progresslvely smaller sets of atoms. The calculated models are used to slmulate the spectra of a separate set of test compounds. The slmulatlon results lndlcate that greater than 50% of the observations can be deleted wlthout a slgnlflcant loss In performance of the models. Practlcal guldellnes are establlshed for the use of the data reduction methodology In future studles.
parisons can be used to confirm or reject suppositions regarding the structure of an unknown. In many cases, however, desired comparisons are impossible due to the unavailabiity of certain spectra. An attractive option in such cases is the employment of spectrum simulation methods to obtain approximate spectra for the desired compounds. The simulated spectra can then be compared with the spectrum of the unknown. One approach to 13C NMR spectrum simulation involves the construction and use of linear models relating numerically encoded structural features to observed 13C NMR chemical shifts. These models have the form
s = b(0) + b(lIX(1) + b(2)X(2) + ... + b ( p ) X ( p ) The interpretation of carbon-13nuclear magnetic resonance
P3C NMR) spectra can often be aided through comparisons with standard reference spectra. The results of such com0003-2700/84/0356-2307$01 SO10
(1)
where S is the predicted chemical shift of a given carbon atom, the X ( i ) are numerical descriptors which encode structural features of the chemical environment of the atom, the b(i) are 0 I964 American Chemical Society
2308
ANALYTICAL CHEMISTRY, VOL. 50, NO. 13, NOVEMBER 1984 I
,
I
'0
14
IC
7 ¶
/ 7
B Flgure 1. The structure and numbering system of 5a-androstane (A) and 5a-cholestane (6). ¶
coefficients determined from a multiple linear regression analysis of a set of unambiguously assigned chemical shifts, and p denotes the number of descriptors in the model. The practicality of this simulation approach has been enhanced by the development of an interactive computer system that incorporates each of the required tasks involved in model construction and application ( I ) . With this system, it is now feasible to develop and apply chemical shift models for complex molecules (2). When suitable models are available, the calculation of a simulated spectrum is a rapid and straightforward procedure. At the present time, however, there are few existing models. When a simulated spectrum is required for a compound for which there is no suitable model or set of models, the only recourse is to invest the time to generate the required models. The model formation process is time-consuming (I). Data must be gathered, structures entered, descriptors generated and evaluated, regressions performed,and the resulting models evaluated. Regression analyses must be performed on many combinations of descriptors in order to find the most suitable model. Often, a number of statistical tests are required in order to select one model as the most reliable. In addition, the structural class being modeled often requires that several models be generated in order to simulate complete spectra. The amount of computation time required for each descriptor calculation, regression, or statistical test is directly proportional to the number of carbons (observations)that have been selected to form the basis for the model. Our experience with the generation of chemical shift models by use of a superminicomputer (PRIME 750) has confirmed a significant increase in computation time as the set of carbons being modeled grows large. While a more powerful computer may exhibit no difficulty in performing large calculations, a smaller laboratory computer could become quickly taxed when presented with floating-point operations on large columns of data. This concern over the number of observations becomes particularly important when large molecules are studied, as the number of atoms required in the model formation procedure can mount rapidly. Steroid molecules provide an illustrative example. Figure 1depicts 5a-androstane (A) and 5a-cholestane (B). Ignoring the alkyl side chain in B, each molecule has 19 carbons in distinct structural environments. Eleven of the nineteen are secondary carbons to which substituents may be attached. Axial and equatorial substituents produce distinct spectra due to the conformationally locked trans-ring junctions. Thus, if only a single type of substituent were con-
sidered, a total of 44 different androstanes and cholestanes could be formed, resulting in 836 carbon atoms and a corresponding number of observations in the computation of chemical shift models. It may be extremely tedious to perform calculations of this magnitude with a laboratory computer. The best way to conserve computation time in the model formation procedure is to limit the number of observations. In the above example, it is legitimate to ask whether it is necessary to include each of the 836 observations. Can an intelligent data reduction be performed, thereby reducing the size of the problem without seriously affecting the accuracy of the chemical shift models? If performed a t the beginning of the study, such a data reduction will promote faster computations for each succeeding step in the model formation procedure. An additional concern in the use of large sets of atoms is the statistical import of the presence of duplicate or highly similar observations. Two carbons in highly similar chemical environments will produce correspondingly similar 13CNMR chemical shifts. Thus, in the formation of chemical shift models, highly similar carbons add little new information to the determination of the regression coefficients. Such carbons represent effective duplicate observations in the analysis. If a group of duplicate observations exists, and the group is large enough with respect to the total number of observations, the resulting model will be biased toward the group, possibly negatively affecting the ability of the model to explain the chemical shifts of other carbons. While in most cases, the number of carbons is large enough that no group of duplicates possesses sufficient weight to be harmful, the presence of duplicates should have little positive effect. It should be possible to delete the duplicate observations without seriously affecting the predictive power of the computed models. A principal goal of our work in the simulation of 13CNMR spectra has been to develop methodology such that the entire simulation procedure is compatible with the performance capabilities of the average laboratory computer. This paper attempts to further this goal with the development of an automated procedure for selecting reduced sets of atoms for use in the computation of chemical shift models. A set of substituted steroids is used to test the procedure. Four progressively smaller sets of observations are selected, and the corresponding models are computed. Each model is used in the simulation of the 13C NMR spectra of an additional set of compounds. The test compounds were withheld from the model formation step. The results of the simulations are used to establish practical guidelines for the use of the methodology in future studies.
EXPERIMENTAL SECTION The set of compounds used in the calculation of chemical shift models is composed of 31 hydroxy steroids. Nineteen androstanols and 12 cholestanols comprise the set. In Table I, the names of the compounds are given, along with an identifying sequence number from 1 to 31. The geometrical orientation of each substituent is denoted in the table by and p where a specifies an orientation below the plane of the ring systems and p specifies an orientation above the plane. Figure 1specifies the numbering system within each compound. The 13C NMR spectra of compounds 1-31 were taken from a collection published by Eggert et al. (3). Fourier transform spectra of the compounds were collected at 25.2 MHz with a Varian XL-100-15 instrument or at 22.6 MHz with a Bruker WH-90 spectrometer. All chemical shifts were reported relative to internal tetramethylsilane (Me4Si). The compounds were measured as 0.2-0.6 M solutions in deuteriochloroform. The test set of compounds used in the evaluation of the data reduction procedure is composed of eight trans-10-methyldecalols and four steroids, numbered 32-43 in Table I. The trans-10methyldecalols are two-ring compounds corresponding to atoms 1-10 and 19 in Figure 1. The numbering system for the ring atoms (Y
ANALYTICAL CHEMISTRY, VOL. 58, NO. 13, NOVEMBER 1984
Table I. Chemical Structures Used no. 1 2 3 4 5 6
7 8 9 10 11 12 13 14 15 16
17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
name 5a-androstan-la-01 Ba-androstan-l&ol 5a-androstan-2a-01 5a-androstan-28-01 5a-androstan-38-01 5a-androstan-3a-01 5a-androstan-4P-01 5a-androstan-4a-01 5a-androstan-68-01 5a-androstan-llp-01 5a-androstan-lla-01 5a-androstan-12@-01 5a-androstan-12a-01 5a-androstan-158-01 Sa-androstan-15a-01 5a-androstan-168-01 5a-androstan-178-01 5a-androstan-16a-01 5a-androstan-17a-01 5a-cholestan-la-01 5a-cholestan-l~-ol 5a-cholestan-28-01 5a-cholestan-la-01 5a-cholestan-3@-01 5a-cholestan-4a-01 5a-cholestan-48-01 5a-cholestan-5a-01 5a-cholestan-6a-01 5a-cholestan-7a-01 5a-cholestan-78-01 5a-cholestan-15a-o1 trans-10-methyldecal-la-ol
trans-10-methyldecal-18-01 trans-10-methyldecal-2a-01 tram-lO-methyldecal-2@-01 trans- l0-methyldecal-3a-o1 trans-l0-methyldecal-3~-ol trans-10-methyldecal-4a-01 trans-lO-rnethyldecal-4~-01
5a-androstane 5a-androstane-3a,l78-diol 5a-androstane-38,17B-diol
5a-cholestane
in the decalols is identical with that shown in the figure. The 1% N M R spectra of compounds 40 and 43 were taken from the collection discussed above. The same experimental parameters apply. A collection published by Grover and Stothers (4)provided the spectra of the remaining ten compounds. Fourier transform spectra of the compounds were collected with a Varian XL-100-15 spectrometer. Chemical shifts were reported relative to internal Me4Si. All compounds were measured as 20% (w/v) solutions in deuteriochloroform. The 43 chemical structures were entered into computer disk fdes via a graphical procedure developed for the ADAPT software system (5,6). Approximate three-dimensional atomic coordinates for each structure were generated via a two-step molecular mechanics approach. An interactive molecular mechanics program was used to generate initial three-dimensional coordinates (7). These coordinates were refined by use of the MMI program of Allinger et al. (8). All computer programs used in this work were written in FORTRAN and implemented on a PRIME 750 computer operating in the Department of Chemistry at The Pennsylvania State University. Graphical capabilities were implemented by use of Tektronix PLOT-10 software.
RESULTS AND DISCUSSION An automated procedure for use in selecting atoms that are potential candidates for deletion requires a computational method for determining the degree of similarity that exists between two carbon atom environments. Two components
2309
of chemical similarity are considered here: (1)topological similarity and (2) geometrical similarity. Topological Similarity. The determination of topological similarity is an assessment of the similarity of chemical environments in terms of atom type, hybridization, and connectivity. We have introduced a multidmensional vector approach to this determination (9). The chemical environment of a carbon center, C(a), is encoded as
+
where E(a) is an (n 1)-dimensional vector describing the environment of C(a). The elements of E(a), represent the chemical environment at a distance i bonds from C(a). The carbon center itself, C(a), is described by e(0). Typically, n is set to five. In most molecules, effects on chemical shifts are seldom significant through more than five bonds. At a distance of i bonds from C(a), we define
e(i) = S / d 3
(3)
where S is a sum of terms describing the atoms located i bonds from C(a) and d = i (i > 0); d = 1, (i = 0). Parameters derived from observed I3C NMR chemical shifts of small molecules are used to compute the components of S, defining the influence of each atom in the determination of chemical shifts. The weighting factor, d, serves to give greater influence to atoms closer to C(a), as these atoms will have a greater effect in determining the chemical shift of C(a). The topological environment of C(a) can be compared with that of another carbon, C(b), by computing the Euclidean distance between E(a) and E(b), where the smaller the distance, the more similar C(a) is to C(b). If the environments of C(a) and C(b) are identical to a distance of n bonds, the Euclidean distance between E(a) and E(b) will be zero. Geometrical Similarity. Geometrical similarity involves similarity in the spatial orientations of atoms. In the case of a molecule in which there are many internal degrees of freedom of rotation, the resultant I3C NMR spectrum represents a weighted average over the various conformations. No distinction is made as to a specific geometrical orientation of the molecule and questions of geometrical similarity do not apply. Other molecules exist in locked conformations, however, and a defied geometrical orientation of the molecule produces the observed I3C NMR spectrum. In such molecules, interactions between atoms are significantly affected by their geometrical orientations. For the steroid molecules treated here, this involves the presence of axial vs. equatorial substituents. The chemical environments of two carbons, C(a) and C(b), may be topologically identical, but the respective chemical shifts of the atoms may differ by as much as 8 ppm, due entirely to a change in orientation of a nearby substituent. The approximate three-dimensional coordinates obtained from molecular mechanics calculations serve as a basis for measuring geometrical Similarity. Perhaps the most straightforward approach to measuring geometrical similarity is to build a function based on interatomic distances, computed by use of the modeled coordinates. This procedure is hampered by questions of whether slight differences in distances should be attributed to differences between modeling programs or to small geometrical structural differences. An alternative procedure begins with the construction of the principal plane of the molecule. This plane corresponds to the two most significant eigenvectors of the coordinates covariance matrix, formed from the modeled ( x , y, z ) coordinates of the molecule. The cross product of the two eigenvectors results in a principal normal vector of the molecule. The orientation of a given atom is characterized by the vector from the atom to the center of the molecule, where the center
2310
ANALYTICAL CHEMISTRY, VOL. 56, NO. 13, NOVEMBER 1984
is defined as the mean ( x , y, z ) coordinate. The sign of the dot product of this vector with the principal normal vector is an indicator of the presence of the atom above or below the plane of the molecule. This information leads to the formation of a binary representation of the geometrical environment of a carbon atom. For a given carbon atom, C(a), we define
B(a) = (I(O),W , I@),..., I(n)) (4) where the elements, I(i),are 32-bit integer words used to hold a binary representation of the geometrical orientations of atoms i bonds from C(a). Typically, n is set to five, as discussed previously. The 16 high-order bits of I(i) encode atoms above the plane of the molecule (positive dot product with principal normal vector) as bits turned on from left to right. For example, if three atoms are above the principal plane at a distance i bonds from C(a), the first three high-order bits of I(i) are turned on. The 16 low-order bits similarly encode atoms below the principal plane. It is assumed here at no more than 16 atoms will be above or below the plane at a given distance of i bonds. Should this not be the case, the size of each element could be increased. The geometrical environment of C(a) is compared with that of another carbon, C(b), by comparing B(a) and B(b). This is done by use of the binary distance metric XOR/IOR, where XOR defines a count of the number of mismatched bits between B(a) and B(b) (exclusive or), and IOR defines the sum of the number of mismatched plus matched bits in the two representations (inclusiveor). Identical representations result in a value of zero for the metric (no mismatches). Two comparisons are made, switching the left and right halves of each I(i)of one of the representations in order to account for effeds of symmetry. The lowest score derived from the two comparisons is taken as a measure of the geometrical similarity of C(a) and C(b). The binary metric has proven compatible with most‘compounds in which the stereochemistry is based on orientations with respect to ring systems. Compounds of this type are the predominant species among structures whose NMR spectra are produced by specific conformations. Therefore, the metric seems applicable to a large number of commonly encountered structures. Definition of Steroid Study. We choose to restrict this study to the steroid ring systems in compounds 1-31, thereby ignoring the alkyl side chain in each of the 12 cholestanols. Two factors prompt this decision. The ring systems exist in locked conformations, while the side chains are flexible. The flexibility of the side chain hinders the use of numerical descriptors which encode specific geometrical orientations. Such descriptors are necessary for the ring systems. Secondly,linear and branched alkanes, of which the side chains are examples, have been treated previously by Lindeman and Adams (IO). Limiting the study to the ring systems defines an initial set of 589 atoms in compounds 1-31. In most cases, separate models are computed for primary, secondary, tertiary, and quaternary carbons, as chemical shifts naturally cluster in this manner. These groups can be either subdivided or combined as the situation merits. A semiautomatic procedure was used to determine the proper atom grouping for compounds 1-31. The six-dimensional E vectors of eq 2 were computed for each of the 589 carbons. A Karhunen-Loeve transformation (11)was applied to the six-dimensional data, resulting in the best two-dimensional approximation to the data. Figure 2 presents a plot of the transformed data. This plot is termed a principal components projection plot. It should be noted that points representing atoms differing only in terms of geometry will lie on top of each other in the figure. The vectors representing the atoms cluster into six band-
I
%
W
4
z 0 a r 0
“A
u
A
81 a 0
z
*
0 V
W
A
a
3A
$
4A A
u)
I =‘
% FIRST PRINCIPAL COMPONENT
Flgurs 2. Principal components projection plot of E vectors of atoms in compounds 1-31.
shaped regions, suggesting a modeling procedure based on six groups of atoms. In the figure, the regions labeled 1 , 2 , 3 , and 4 correspond to primary, secondary, tertiary, and quaternary carbons, respectively. The region labeled 3A consists of tertiary carbons with attached hydroxyl groups. The region labeled 4A contains one point, atom 5 in compound 27. Significantly, this is the only quaternary carbon with an attached hydroxyl group. As no relevant model can be formed from one atom, it was attempted initially to include the atom with either group 3A or group 4. The presence of the observation degraded the performance of both models, prompting the decision to remove the observation from the analysis. Thus, the remaining 588 atoms were divided into five groups for modeling, with groups 1 , 2 , 3 , 3 A , and 4 containing 62,299,135,30, and 62 carbons, respectively. Selection of Atom Subsets. An automated procedure was used to determine atoms to delete from the model formation step. Beginning with the first carbon in compound 1, and continuing for each carbon in each structure, the following procedure was performed. For the current test carbon, C(a), E(a) in eq 2 and B(a) in eq 4 were computed. E(a) and B(a) were compared to each corresponding E(i) and B(i) of atoms already selected to remain in the analysis. If the Euclidean distance between E(a) and E(i) was greater than a specified topological tolerance, 4t), and the value of the XOR/IOR metric between B(a) and B(i) was greater than a specified geometrical tolerance, e(g),C(a) was allowed to remain in the analysis. E(a) and B(a) were retained for future comparisons. If the above criteria were not met, C(a) was deleted. In this procedure, subsets of atoms are characterized by the values of e(t) and e(g). For the steroid compounds treated here, geometrical differences can be extremely significant. Thus, it was decided to hold e(g) constant at zero. This demanded that absolute geometrical equivalence be found before an atom could be deleted. The topological tolerance, 4th was allowed to vary in eight steps, from 0 to 10.0. For each t(t)used, the standard error in observed chemical shifts was computed for each pair of atoms found equivalent. The standard error serves as an indicator of the amount of information lost through the deletion of atoms. In evaluating the results of this calculation, standard errors of less than 0.2 ppm are deemed indistinguishable from experimental error. In Figure 3, the computed standard errors for each data group are plotted vs. the logarithm of t(t). The symbols square, diamond, triangle (upper curve),triangle (lower curve), and inverted triangle correspond to groups 1, 2, 3, 3A, and 4, respectively. Each curve increases in steps as the tolerance is increased, except the one belonging to group 3A. The
ANALYTICAL CHEMISTRY, VOL. 56, NO. 13, NOVEMBER 1984
0 . a00
t
0.600
0.400
0.200
t 0.000 - . -
-4.00
-2.00
-3.00
- I .OO
I .OO
0.00
LOG (epsilon)
Plot of standard errors in chemlcal shifts (ppm) between atoms found equivalent vs. log (t(t))for the five-atom groups. It is assumed that c(g) = 0 in each case. Figure 3.
Table 11. Sizes of Atom Subsets
subset
1
2
group 3
3A
4
total
A
62 48 31 17
299 224 194 135
135 120 103 80
30 25 23 23
62 53 49 36
588 470 400 291
B C D
standard errors reach a constant value at the point at which the fixed geometrical tolerance precludes any additional deletion of atoms. For the 3A curve, only two atoms are deleted after the initial point in the curve (e(t) = 0). Three subsets were selected as representative of the successive steps in information loss due to atom deletion. The subsets corresponding to topological tolerance of 0,0.01, a n d / 0.1 were labeled B, C, and D, respectively. The set containing all atoms was labeled A. It should be noted that subset B corresponds to deleting only exact topological and geometrical duplicates, while subsets C and D correspond to the additional deletion of atoms that are topologically similar to those already present. In Table 11, the total size of each subset is given, along with the corresponding size of each atom group. Subsets B, C, and D represent 20% , 32% , and 51% decreases in the total number of atoms, respectively. Calculation of Chemical Shift Models. The model formation step is composed of two parts: (1)the selection of which set of numerical structural descriptors to use in each case ( X ( i )in eq 11, and (2) the calculation of the regression coefficients (b(i)in eq 1). The procedure used in each case to determine the best set of descriptors has been described in detail elsewhere (2), and it is summarized here. An initial screening process eliminated those descriptors with too few nonzero values (