Automated selection of models for the simulation of carbon-13 nuclear

Apr 9, 1984 - Gary W. Small, Terry R. Stouch, andPeter C. Jurs*. Department of ..... model, m is a 1 X 6 vector of the means of the columns of E,. M i...
0 downloads 0 Views 777KB Size
2314

Anal. Chem. 1984,

(4) Grover, S. H.; Stothers, J. E. Can. J. Chem. 1974, 52, 870-878. (5) Brugger, W. E.; Jurs, P. C. Anal. Chem. 1975, 4 7 , 781-784. (8) Stuper, A. J.; Jurs, P. C. J . Chem. I n f . Compot. Scl. 1978, 76, 99-105. (7) Stuper, A. J.; Brugger, W. E.; Jurs, P. C. “Computer Assisted Studles of Chemical Structure and Blologlcal Function"; Wlley-Interscience: New York, 1979; pp 83-90. (8) Allinger, N. L.; Trlbble, M. T.; Miller, M. A.; WertZ, D. H.J . Am. Chem. SOC. 1971,93,1637-1648. (9) Small, G. W.; Jurs, P. C. Anal. Chem. 1984, 56, 1314-1323. ( I O ) Lindeman, L. P.; Adams, J. Q. Anal. Chem. 1971, 43, 1245-1252. (1 1) Tou, J. T.; Gonzalez, R. C. “Pattern Recognltlon Principles”; AddlsonWesley: Reading, MA, 1974; pp 271-283. (12) Wertz, D. H.; Alllnger, N. L. Tetrahedron 1974. 30, 1579-1588.

56,2314-2319 (13) DelRe, G. J. Chem. SOC. 1958, 4031-4040.

RECEIVED for review April 9,1984. Accepted June 25,1984. This work Was suPPo*d by the National Science Foundation, under Grant CHE-8202620. The PRIME 750 computer used in this research was purchased with partial financial support of the National Science Foundation. Portions of this paper were presented a t the 10th Annual Meeting, Federation of Analytical Chemistry and Spectroscopy Societies, Philadelphia, PA, Sept 1983.

Automated Selection of Models for the Simulation of Carbon- 13 Nuclear Magnetic Resonance Spectra Gary W. Small, Terry R. Stouch, and Peter C. Jurs* Department of Chemistry, The Pennsylvania State University, 152 Davey Laboratory, University Park, Pennsylvania 16802

Carbon-13 NMR chemical shift models are evaluated for thelr sultabiltty for performing speclflc spectrum slmulatlons. A computational procedure Is described that allows an automated determination to be made regarding the sultabillty of a glven model for any glven simulation. A statistlcal basis for the computations allows a quanlttatlve measure of confidence to be assigned to each slmuiatlon. The methodology Is evaluated by the use of a llbrary of 15 chemlcal shift models and a set of 25 test carbon whose slmulated chemlcal shlfts are deslred. Tests are performed to evaluate the ablllty of the methodology to spot Improper slmulatlons, as well as the ability to choose the correct model from a group of highly similar models.

Spectrum simulation methods for carbon-13 nuclear magnetic resonance (13CNMR) data are computational procedures that enable approximate spectra to be obtained in cases in which actual spectra are unavailable and cannot be easily collected. The availability of spectrum simulation methodology can significantlyaid chemical analyses in which spectral comparisons are desired but cannot be performed due to a lack of observed spectra. One approach to 13C NMR spectrum simulation involves the construction and use of linear models relating numerically encoded structural features to observed 13C NMR chemical shifts. These models have the form S = b(0) + b(l)X(l) b(2)X(2) ... + b(p)X(p) (1) where S is the predicted chemical shift of a given carbon atom, the X ( i ) are numerical descriptors which encode structural features of the chemical environment of the atom, the b(i)are coefficients determined from a multiple linear regression analysis of a set of unambiguously assigned chemical shifts, and p denotes the number of descriptors in the model. The early work using this approach focused on linear and branched alkanes (1,2), while, more recently, attempts have been made to expand the approach to account for heteroatoms and unsaturations (3,4). The development of computer-based methodology for implementing the modeling process has ex-

+

+

0003-2700/84/0356-2314$01.50/0

panded the capabilities for treating complex chemical systems (5, 6). A characteristic of both simple and complex chemical shift models is that each is applicable only to atoms similar to those used to determine the models. By way of example, a model based on alkanes will clearly have no ability to predict accurately the chemical shifts of atoms in an aromatic ring system. Whenever chemical shift models are applied in a spectrum simulation, the suitability of the available models for performing the simulation must be evaluated. Questions regarding chemical similarity are an important aspect of this evaluation. Are atoms in the compound of interest similar to those atoms used in determining the models? If several similar models are available, which is the best to use? Can it be determined when no model is suitable? If the simulation is performed, how accurate is the resulting spectrum? If questions of this type cannot be answered with confidence, the resulting simulated spectra cannot be used with confidence. Moreover, unless these questions can be answered routinely and automatically, severe restrictions will be placed on the ultimate practicality of spectrum simulation as an analytical tool. In this paper, an automated procedure is described that allows the above questions to be answered. Given the chemical structure whose spectrum is to be simulated and a set of calculated models, the most suitable model (if any) for the simulation can be selected. A statistical basis for the computation produces a quantitative measure of confidence in the subsequent simulation. This procedure is tested for its ability to reject improper simulations as well as for its ability to choose among models that are highly similar.

EXPERIMENTAL SECTION The chemical structures used to test the proposed methodology were entered into computer disk files via a graphical procedure developed for the ADAPT software system ( 7 , B ) . Some of the chemical shift models used for testing required that three-dimensional atomic coordinates be available for the structures. Approximate coordinates for each structure were generated via a two-step molecular mechanics approach. An interactive molecular mechanics program was used to generate initial threedimensional coordinates (9). These coordinateswere refined by 0 1984 Amerlcan Chemical Society

ANALYTICAL CHEMISTRY, VOL. 56, NO. 13, NOVEMBER 1984

use of the MM1 program of Allinger et al. (10). All computer programs used in this work were written in FORTRAN and implemented on a PRIME 750 computer operating in the Department of Chemistry at The Pennsylvania State University. Graphical capabilities were implemented by use of Tektronix PLOT-10 software.

2315

1’

RESULTS AND DISCUSSION Computational Elements. First, we define the question of model suitability in precise terms. In a given analysis, it is desired to have a simulated I3C NMR spectrum for compound X. Compound X contains h structurally unique carbons which will give rise to distinct spectral resonances. A simulated chemical shift must be computed for each of these carbons. A set of chemical shift models is available from the literature or from previous studies. The set of carbon atoms used to determine each model is known. For each of the h carbons in compound X, the following questions are posed. For each model, is the current carbon, C(a), structurally similar to the carbons used to determine that model? In the best case, is the similarity great enough to warrant confidence in the simulated chemical shift obtained from the application of the corresponding model? If the second question can be answered positively, the accuracy of the simulation should fall within the distribution of the residuals obtained in the determination of the model. In practical terms, the simulated chemical shift of C(a) should be accurate to within 2s ppm, where s is the standard error of estimate obtained in the calculation of the regression coefficients of the model. It is assumed here that the residuals are normally distributed. Four practical requirements are placed upon any formalized methodology for answering the above questions: (1)a computational procedure must be used; (2) the procedure must be automated; (3) given the fact that hundreds of carbons are used to determine each chemical shift model, the methodology must be economical in its use of computer storage; and (4) the methodology must have a statistical basis so that a measure of confidence can be attached to the results. A procedure based on the calculation of the topological similarity of carbon atom environments can be used to meet the above requirements. The determination of topological similarity is an assessment of the similarity of chemical environments in terms of atom type, hybridization, and connectivity. We have previously introduced a multidimensional vector approach to this determination ( l l ) ,and we describe it briefly here. The chemical environment of a carbon center, C(a), is encoded as e(a) = ( d o ) ,411, 421, ..., e @ ) ) (2) where e(a) is a (p + 1)-dimensional vector describing the environment of C(a). The elements of e(a), e(i), represent the chemical environment a t a distance i bonds from C(a). The carbon center itself, C(a), is described by e(0). Typically, e(a) is a six-dimensional vector. In most molecules, effects on chemical shifts are seldom significant through more than five bonds. The topological environment of C(a) can be compared with that of another carbon, C(b) by computing the Euclidean distance between e(a) and e(b), where the smaller the distance, the more similar C(a) is to C(b). If the environments of C(a) and C(b) are identical to a distance of p bonds, the Euclidean distance between e(a) and e(b) will be zero. At a distance of i bonds from C(a), we define e(i) = S / d 3 (3) where S is a s u m of terms describing the atoms located i bonds from C(a) and d = i, (i > 0); d = 1(i = 0). Parameters derived from observed 13CNMR chemical shifts of small molecules

Figure 1. The relationship of a test point, represented as a triangle, to a cluster of points contained In the box. are used to compute the components of S, defining the influence of each atom in the determination of chemical shifts. The weighting factor, d, serves to give greater influence to atoms closer to C(a), as these atoms will have a greater effect in determining the chemical shift of C(a). The cubic term was derived empirically by studying pairs of atoms whose environments produce small Euclidean distances in the vector comparisons. Use of the cubic term seems to minimize the Occurrence of large chemical shift differences between atoms judged similar by the distance computation. Thus, the capability exists for characterizing the h carbons in compound X as six-dimensional vectors. The atoms associated with each model can be characterized similarly. This allows standard mathematical techniques for handling vectors to be applied to the question of model suitability. The elements of the e vectors constitute continuous variables, thereby making a multivariate normal distribution a reasonable assumption. Statistical procedures based on this distribution should be applicable here. As noted previously, atoms used to form an accurate chemical shift model must be structurally similar. Therefore, the tips of the vectors representing the atoms should lie near each other in space, forming a cluster of points. If t models have been determined, t clusters of points will be formed. If two models are calculated from similar atoms, the corresponding clusters of points will overlap. For a given carbon whose chemical shift is to be simulated, the question of model suitability can be reduced to a determination of whether the tip of the vector representing the atom lies within any of the t clusters. if the point is close to several clusters, the question demands a quantitative determination of the goodness-of-fit to each. If the tip of the vector does not lie within any of the clusters, it can be concluded that none of the available models are applicable to the chemical shift simulation. The problem of determining the distance from a point to a cluster of points is illustrated in Figure 1 for a three-dimensional space. The following discussion is equally applicable to a multidimensional space, however. The test point is represented as a triangle, while the cluster of points is represented by the smallest volume that encloses the cluster. The actual shape of the enclosing volume is ellipsoidal, but it is conveniently represented here as a box. In the present application, there would be t boxes, with the possibility that some of the boxes would overlap. A simple approach would be to compute the Euclidean distance from the triangle to the center of the box (the cluster mean) and to use the distance as a measure of the goodness-of-fit of the point to the cluster. If each box were symmetric, this approach would be viable. Unfortunately, a given box may be highly asymmetric. For example, if a box were long, narrow, and thin, it would be quite possible for a point to be relatively close to the center of the box, yet be clearly outside the boundaries of the box.

2318

ANALYTICAL CHEMISTRY, VOL. 56, NO. 13, NOVEMBER 1984

An approach that overcomes this limitation is based on the ratio of the volume of the box to the volume of the smallest box that is able to contain both the original box and the test point. If the test point lies close to the cluster, the two volumes will be nearly equal, and the ratio will be close to one. As the test point diverges from the cluster, the volume of the containing box will grow large in comparison to that of the original box, causing the volume ratio to approach zero. This simple geometrical determination is related to a standard statistical procedure for the detection of multivariate outliers. It is termed Wilks’ A (12),and it is computed as det [C] A=---(4) det [C*] where C = [E- M]’[E - MI (5) and C* = C + [e* - m]’[e* - m] (6) where E is an n X 6 matrix containing the six-dimensional vectors representing the n atoms used to determine the current model, m is a 1 X 6 vector of the means of the columns of E, M is a n X 6 matrix of the means of the columns of E, C is the 6 X 6 covariance matrix of E, det [ C ] is the determinant of C , and C* is the 6 X 6 covariance matrix resulting from augmentation E with e*, the 1 X 6 vector representing the test atom. In our implementation, det [C] is computed as the product of the eigenvalues of C . An advantage of this procedure is that a cluster can be characterized by C (36 values), m (six values), and det [C] (one value), regardless of the number of points in the cluster. The original set of vectors, E, is not needed, once C has been computed. Furthermore, since C is symmetric, only the upper half plus the diagonal is needed (21 values). This reduces the totalstorage to 28 values per cluster, fulfiing the requirement for storage economy. Numerous workers have studied the distribution of A in an effort to assign significance to the values. Belsley et al. (13) have shown that

+

has the F distribution, with [p - 1, (n 1)- p ] degrees of freedom, where n is defined above and p is the dimensionality of the vectors. In our application, the degrees of freedom simplify to (5, n - 5). The suitability of a given model for simulating the chemical shift of an atom can be tested by computing A and the corresponding F value. The calculated F can be compared to the tabulated F for the same number of degrees of freedom. If the calculated F exceeds the tabulated value, the probability is only a that the atom is within the cluster of atoms defining the model, where a! is the probability associated with the tabulated F. Typically, F values corresponding to a = 0.05 are used (termed 95% F). Therefore, smaller calculated F values indicate a better fit of the atom to the group of atoms used to determine the model. It should be noted that F tables are designed such that F 1 1. If the calculated F is less than one, 1/F is used, and the comparison is made against the tabulated F corresponding to the opposite number of degrees of freedom (n - 5, 5). The above methodology can be used in an automated procedure for determining model suitability. This procedure is presented schematically in Figure 2. For each of the t available models, C , m, and det [C]are computed and stored. When a simulated spectrum is needed, the corresponding structure is entered, and the structurally distinct atoms are perceived. For each atom, the vector representation, e*, is

I

I

COMPUTE FCCALCJ FOR EACH STORED MODEL

A I I I

,- i I

Y N I REPORT NO VALID SIMULATION POSSIBLE

L-----------------_-_________

I I I

Figure 2. Flow chart of the automated procedure for determining model suitability. formed. The suitability of each model is evaluated by the use of e* in the computation of A and F. Three categories of results may be obtained. First, all of the calculated F values may exceed the corresponding tabulated F values. In this m e , none of the available chemical shift models are judged suitable for simulating the chemical shift of the current atom. Second, one of the calculated F values may be less than the tabulated value. In this case, the corresponding model is deemed suitable for the simulation. Third, based on the F-value comparisons, several of the models may be judged suitable. The model corresponding to the largest value of Wilks’ A would normally be chosen here, as the test atom is closer to the center of the cluster of atoms used to determine this model. Evaluation of Methodology. A library of 15 chemical shift models was assembled to test the proposed methodology. In Table I, each of the models is described and given an identifying letter. The number of atoms used in the determination of each model is given (n), along with s, the standard error between predicted and observed chemical shifts for the atoms used in the calculation of the model. Models A through E provide capabilities for simulating the spectra of steroids and hydroxy steroids. Thirty one androstanols and cholestanols were used in the formation of the models (14). Thirty six cycloalkanes formed the basis for models F through H. The data set consisted of methylcyclohexanes, cis- and trans-decalins, perhydroanthracenes, perhydrophenanthrenes, and one androstane (15). The three models used were updated versions of the published models. Several recently developed structural descriptors were used to improve the original models.

ANALYTICAL CHEMISTRY, VOL. 58, NO. 13, NOVEMBER 1984

2317

Table I. Library of Chemical Shift Models

model

description

n

A B C D E F G H I J K L M N 0

steroids and hydroxy steroids-primary carbons steroids and hydroxy steroids-secondary carbons steroids and hydroxy steroids-tertiary carbons hydroxy steroids-tertiary carbons with attached hydroxyls steroids and hydroxy steroids-quaternary carbons cycloalkanes-primary carbons cycloalkanes-secondary carbons cycloalkanes-tertiary carbons cycloalkanols-primary carbons cycloalkanols-secondary carbons cycloalkanols-tertiary carbons linear and branched alkanes-primary carbons linear and branched alkanes-secondary carbons linear and branched alkanes-tertiary carbons linear and branched alkanes-quaternary carbons

48 224 120 25 53 51 157 79 39 138 78 125 117 53 24

99

PPm

0.738 1.065 0.967 1.208 0.808 0.890 1.085 0.735 0.753 0.866 1.143 0.760 0.584 1.001 0.751

Table 11. Results for Atom Rejection Test

atom

model

Wilks

F (calcd)

95% F

deg of freedom

1

K

0.0572 0.0566 0.0564 0.00306 0.00130 0.000676

240.8 246.6 374.6 4761 17220 35500 190.2 437.0 2271 2554 14240 35930 189.3 228.7 252.2 178.0 392.5 421.7 1246 6242 9665 15.93 34.56 43.27 17.38 226.3 292.7 5.443 39.07 166.7 12.49 81.56 141.1

2.37 2.37 2.29

5, 73 5, 74 5,112

2.37 2.29 2.29

5, 73 5,112 5,120 5, 19 5, 73 5,112 5, 73 5,112 5,120

H 2

3

M K M

L 0

4

K M K M L

5

C

N K 6

L

7

K M K M N K

8

9

10

11

H C L F M 0 K M L

F A

0.0196 0.0323 0.00977 0.00568 0.00157 0.000668 0.108 0.04029 0.0547 0.119 0.0359 0.05044 0.0116 0.00358 0.000992 0.478 0.300 0.347 0.580 0.0391 0.0711 0.411 0.272 0.118 0.658 0.101 0.0574

2.74 2.37 2.29 2.37 2.29 2.29 2.29 2.45 2.37 2.29 2.37 2.29 2.37 2.29 2.45 2.37 2.37 2.29 2.29 2.45 2.29 2.74 2.37 2.29 2.29 2.45 2.45

5,115 5, 48 5, 73 5,120 5, 73 5,112 5, 73 5,112 5, 48 5, 73 5, 74 5, 115 5,120 5, 46 5,112 5, 19 5, 73 5,112 5,120 5, 46 5. 43

Models I through K were formed from a set of 31 cycloalkanols (6).Included in this set were methylcyclohexanols and trans-decalols. The Lindeman and Adams models for linear and branched alkanes (2) comprise models L through 0. Fifty nine compounds, ranging from five to nine carbons each, were used in the formation of the models. A set of 25 atoms was used to evaluate the ability of the methodology to judge model suitability. None of the test atoms was included in the determination of models A through 0. The 25 atoms can be divided into two categories. Figure 3 depicts the structural environments of atoms 1-11. In structural terms, atoms 1-11 are very different from any of the atoms used in the calculation of the models. Therefore,

Flgure 3. Structural environments of atoms used to test the ability to identify unsuttable models.

&I

1

22

Flgure 4. Structural envlronments of atoms used to test the ability to select among similar models.

the atoms serve as a test of the ability to detect simulations that cannot be performed reliably. For each atom, the 15 values of A and F were computed. Table I1 summarizes the results of this computation. The three smallest F values are listed for each atom, along with A, the letter of the corresponding model, the tabulated F (95%),and the corresponding number of degrees of freedom. In each case, the calculated F exceeds the tabulated value by a wide margin. As expected, none of the models are judged suitable for simulating the chemical shifts of atoms 1-11. In contrast, atoms 12-25 are very similar to the atoms used in the calculation of models A through 0. The structural environment of these atoms are depicted in Figure 4. Atoms 12-25 serve as a test of the ability of the methodology to select the most suitable model from a group of similar models. The 15 values of A and F were computed for each atom, and the three smallest F values were retained. In Table 111, the calculated F values are listed, along with the letter of the corresponding model, the value of Wilks' A, the tabulated F

2318

ANALYTICAL CHEMISTRY, VOL. 56, NO. 13, NOVEMBER 1984

Table 111. Results of Model Selection Test

atom

model

Wilks

F (calcd)

95% F

deg of freedom

actual

12

K

0.952 0.157 0.171 0.947 0.865 0.747 0.977 0.854 0.698

4.43 2.71 2.45 2.29 2.29 2.29

0.942 0.768 0.538 0.443 0.643 0.0967 0.956 0.771 0.554 0.971 0.973 0.969 0.929 0.954 0.722

2.696 8.019 26.16 5.023 8.115 89.69 1.060* 4.389 11.77 1.106* 1.338* 1.395* 1.131* 1.421* 3.701

73, 5 5, 20 5, 48 5, 133 5, 219 5,112 5,219 5, 133 5, 152 115,5 5, 74 5, 73 5, 48 5, 73 5, 48 5, 219 5, 133 5, 152 5, 20 5, 73 5, 48 5, 115 5, 74 5, 73 152, 5 133, 5 5,219 5, 74 73, 5 5, 48

72.2

0.963 0.760 0.618 0.856 0.353 0.0449

1.366* 21.54 46.48 1.503* 6.833 7.592 1.056* 4.540 13.16 1.144* 4.687 9.013 1.621* 26.82 204.1

0.950 0.983 0.931 0.959 0.934 0.820 0.903 0.626 0.250

1.395* 1.897* 3.247 1.043* 2.159* 5.845 2.578 5.495 20.45 2.226* 4.197 6.357

D 13

14

N J B M

B J

15

16

17

18

G C

H K E K N B J G

D K N

19

C

H K 20

G J

B 21

22

H K N J G

23

B M G

J 24

L F I

25

M G J

0.9096 0.879 0.8071

2.29 2.29 2.29 4.40 2.37 2.37 2.45 2.37 2.45 2.29 2.29 2.29 2.71 2.37 2.45 2.29 2.37 2.37 4.40 4.40 2.29 2.37 4.43 2.45 2.29 4.40 2.29 4.40 2.29 2.29 2.29 2.45 2.53 2.29 2.29 2.29

(95%), and the number of degrees of freedom. An asterisk is placed beside each calculated F that is smaller than the corresponding tabulated value. This indicates a model that is judged to be suitable for simulating the chemical shift of the given atom. The results for each atom were evaluated by applying the three listed models to the simulation of the chemical shift of the atom. In Table 111, the observed chemical shift of each atom is listed, along with a literature reference for the shift. All chemical shift values are reported relative to tetramethylsilane. The predicted chemical shifts corresponding to each model are listed, along with the residual or difference between predicted and observed chemical shifts. Atoms 12-16 produce straightforward results. In each case, one model is judged to be suitable for the simulation. The selected model produces a highly accurate simulated chemical shift in each case. Except for the application of model D to atom 12, the other models produce residuals that are significantly larger than twice their respective standard errors. Model D is based on steroids and hydroxy steroids. While atoms in these compounds exist in much more complex structural environments than the dimethylcyclohexanol en-

5, 133 152, 5 5,219 112, 5 5, 152 5, 133 5,120 5, 46 5, 34 5,112 5, 152 5, 133

39.7

37.5

35.8

43.5

31.9

67.1

35.6

22.2

37.2

27.4

40.2

17.9

34.5

chemical shift, ppm pred residual 72.8 73.5 39.1 40.7 44.5 46.0 36.6 34.2 34.7 36.1 38.4 41.0 44.0 63.4 48.2 30.7 30.3 29.9 82.9 61.1 32.5 35.8 39.0 38.7 22.0 21.7 21.0 35.8 34.0 35.0 26.0 27.7 23.7 43.8 40.8 42.7 17.1 20.3 16.0 33.1 35.4 39.9

ref

-0.6 -1.3 33.1 -1.0 -4.8 -6.3

16

0.9 3.3 2.8 -0.3 -2.6 -5.2 -0.5 -19.9 -4.7 1.2 1.6 2.0 -15.8 6.0 34.6 -0.2 -3.4 -3.1 0.2 0.5 1.2 1.4 3.2 2.2 1.4 -0.3 3.7 -3.6 -0.6 -2.5 0.8 -2.4 1.9 1.4 -0.9 -5.4

17

16

17

17

18

18

18

19

20

20

21

22

22

vironment of atom 12, model D seems to adapt well to the simpler environment of atom 12. On the basis of the structures involved, however, the application of model D to atom 1 2 is an extrapolation, and it is detected as such by the calculated F value. Atoms 14-16 exist in an androstanediol. The selected steroid models, B, C, and E, were formed from androstanols and cholestanols, compounds in which atoms are influenced by only one hydroxyl group. The two hydroxyls in the test compound are widely separated, however, and atoms 14-16 are effectively influenced by only one hydroxyl. This is confirmed by the calculated F values and the accuracy of the simulated chemical shifts. Atoms 17-19 represent the opposite case. The compound containing the atoms is a 3,5-androstanediol. Atoms 17 and 18 are clearly influenced by two hydroxyls. The calculated F values indicate correctly that neither atom is sufficiently similar to the atoms used in the determination of the models. The residuals for atom 17 are quite acceptable, however. Again,this represents a successful extrapolation of the models. The extrapolation is markedly less successful for atom 18, as indicated by the large residuals.

ANALYTICAL CHEMISTRY, VOL. 56, NO. 13, NOVEMBER 1984

Atom 19 is six bonds removed from the far hydroxyl, and it is determined to be similar to the atoms in the androstanols and cholestanols. The steroid model is judged suitable, and the simulated chemical shift is highly accurate. An interesting example is provided by atom 20, as models G , J , and B are all judged to be suitable for the simulation. The selection of model B seems logical, given the great similarity of the ergostane environment of atom 20 and the androstane and cholestane environments associated with the model. The atoms used to determine model G also included androstane carbons. Among the atoms used in the formation of model J, the trans-decalol carbons are most similar to atom 20. Atom 20 is near the end of the leftmost ring, giving it an environment similar to carbons in trans-decalin derivatives. The correctness of the model suitability test is confirmed by the accuracy of each of the simulated chemical shifts. Two models are judged suitable for simulating the chemical shifts of atoms 21 and 22. For atom 21, models H and K are selected. The compound containing the atom is a cis-decalin. The atoms used to determine model H included a number of cis-decalin atoms. The simulated chemical shift resulting from the application of model H is very accurate. Model K does not produce an accurate chemical shift, however. The atoms associated with the model include trans-decalols, but no corresponding cis compounds. The model is inadequate for the simulation, but the limitation of the model is undetected, due to the geometrical differences involved. Atoms in cis- and trans-decalin are topologically identical, and their vector representations are identical in the current scheme. The geometrical differences that distinguish the compounds are not coded by the model suitability methodology. Models J and G are selected for atom 22. The same factors affect the results for this atom, but both of the resulting simulations are accurate. The difference in the results for atoms 21 and 22 lies in the increased distance between atom 22 and the cis-ring junction. The effect of the difference in geometry is less pronounced for atom 22, and model J is adequate for the simulation. Another limitation of the present methodology is illustrated by the results for atom 23. Two models are selected as being suitable for the simulation. Model G is appropriate, and the simulation is accurate. Model M, based on linear and branched alkanes, is judged suitable by the calculation, but it is clearly unsuitable for the simulation. The present methodology makes no distinction between cyclic and acyclic carbons. In terms of the number of carbons in the environment and the connectivity of the carbons, atoms in methylcyclohexanes are similar to atoms in branched alkanes. The differences in chemical shifts are due to the cyclic nature of the structure. The derivative of cyclic parameters for use in the computation of e* would alleviate this problem. Atoms 24 and 25 illustrate an important final point in assessing the results of the model suitability computations. For atom 24, the calculated F for model L is slightly greater than the tabulated value. The model is appropriate, however, and the chemical shift is simulated accurately. For atom 25, model M is judged suitable, producing a calculated F that is just under the cutoff established by the tabulated value. The simulated chemical shift, while still acceptable, is significantly greater that twice the standard error of the model. The results for these atoms indicate that the F-value comparisons should

2319

be evaluated judiciously. Models corresponding to F values either slightly greater than or slightly less than the tabulated value should be considered carefully before they are used. Their suitability for the current simulation must be evaluated by the experimenter on a case-by-casebasis. While this case prevents complete automation of the model suitability test, it is judged inevitable that certain carbons will assume a borderline position with regards to their suitability for use with a given model.

CONCLUSIONS The methodology described in this paper enables a highly automated spectrum simulation sytem to be constructed. While certain limitations remain, the majority of questions regarding model suitability can now be answered without user intervention. While the complete generality of the approach cannot be established until a larger library of models has been developed, our resulb suggest that the practicality of applying spectrum simulation to routine analyses has been significantly enhanced by the methodology presented here. The ease of implementing the methodology is predicated largely on the availability of structure-handling software. If a potential user has access to capabilities for structure-entry and storage, the computation of e* and A should be straightforward. A detailed procedure for the computation of e* has been given previously (11). Software for the computation of covariance matrices and determinants is commonly available (23).

LITERATURE CITED Grant, D. M.; Paul, E. G. J . Am. Chem. SOC. 1964, 8 6 , 2984-2989. Lindeman, L. P.; Adams, J. 0. Anal. Chem. 1971, 43, 1245-1252. Elchart, A. Org. Magn. Reson. 1980, 13, 388-371. Elchart, A. Org. Magn. Reson. 1981, 15, 22-24. Small, G. W.; Jurs, P. C. Anal. Chem. 1983, 55, 1121-1127. Small, G. W.; Jurs, P. C. Anal. Chem. 1963, 5 5 , 1128-1134. Brugger, W. E.; Jurs, P. C. Anal. Chem. 1875, 4 7 , 781-784. Stuper, A. J.; Jurs, P. C. J . Chem. Inf. Comput. Scl. 1976, 16, 99-105. Stuper. A. J.; Brugger, W. E.; Jurs, P. C. "Computer Assisted Studles of Chemical Structure and Blological Function"; Wiley-Intersclence: New York, 1979; pp 83-90. Allinger, N. L.; Trlbble, M. T.; Miller, M. A.; Wertz, D. H. J. Am. Chem. SOC. 1971, 9 3 , 1637-1648. Small, G. W.; Jurs, P. C. Anal. Chem. 1984, 56, 1314-1323. Wliks, S. S. Sankhua, Ser. A 1963, 2 5 , 407-428. Belsley, D. A.; Kuh, E.; Welsch, R. E. "Regression Dlagnostics: Identifying Influentlal Data and Sources of Collinearity"; Wlley-Interscience: New York, 1980. Small, 0. W.; Jurs, P. C. Anal. Chem. 1984, 5 6 , 2307-2314. Smith, D. H.; Jurs. P. C. J . Am. Chem. Soc. 1978, 100, 3316-3321. Pehk, T.; Kooskora, H.; Llppmaa, E. Org. Magn. Reson. 1976, 8 , 5-10. Grover, S. H.; Stothers, J. B. Can. J. Chem. 1974, 5 2 , 870-878. Konno, C.; Hlklno, H. Tetrahedron 1976, 3 2 , 325-331. Balogh. B.; Wilson, D. M.; Burllngame, A. L. Nature (London) 1971, 233, 261-263. Dalling, D.K.; Grant, D. M.; Paul, E. G. J . Am. Chem. Soc. 1973. 9 5 , 3718-3724. Dalling, D. K.; Grant, D. M. J . Am Chem. Soc. 1967, 8 9 , 6612-6622. Schwarz, R. M.; Rabjohn, N. Org. Magn. Reson. 1980, 13, 9-13. International Mathematical and Statistical Library", 8th ed.; IMSL: Houston, TX, 1980.

RECEIVED for review May 3, 1984. Accepted June 25, 1984. This work was supported by the National Science Foundation, under Grant CHE-8202620. The PRIME 750 computer used in this research was purchased with partial financial support of the National Science Foundation. Portions of this paper were presented at the 35th Pittsburgh Conference and Exposition on Analytical Chemistry and Applied Spectroscopy, Atlantic City, NJ, March 5, 1984.