Determination of topological similarity of carbon atoms in the

Júlio S.L.T. Militão , Vicente P. Emerenciano , Marcelo J.P. Ferreira , Daniel Cabrol-Bass , Michel Rouillard. Chemometrics and Intelligent Laborato...
0 downloads 0 Views 1MB Size
1314

Anal. Chem. 1S84, 56, 1314-1323

Determination of Topological Similarity of Carbon Atoms in the Simulation of Carbon- 13 Nuclear Magnetic Resonance Spectra Gary W. Small and Peter C. Jurs* Department of Chemistry, The Pennsylvania State University, 152 Davey Laboratory, University Park, Pennsylvania 16802

Carbon-I3 NMR chemlcal shlfls of small molecules are used to derive parameters whlch enable the chemlcal envlronments of carbon atoms to be encoded. A multldlmenslonal vector approach Is used to compare one carbon envlronment to another, resuHing In a quantltatlve measure of structural slmIlarity. A variety of chemlcal structures are used to test the proposed methodology. Prlnclpal components proJectionplots and the results of hierarchicalclusterlng analysls are used In the evaluatlon. The usefulness of thls methodology as a diagnostic tool in the slmulatlon of carbon-I3 NMR spectra Is descrlbed.

Carbon-13 nuclear magnetic resonance spectroscopy (13C NMR) is applied routinely in the solution of organic structure-elucidation problems because of the unique information it provides about the carbon skeleton of a compound. The widespread use of the technique has prompted increased interest in the development of methodology for improving the efficiency of 13C NMR data interpretation. Given a spectrum to interpret, it is often useful to have reference spectra available for comparison. In many cases, however, desired comparisons are impossible due to the unavailability of certain spectra. Spectrum simulation techniques are computational procedures which enable the chemist to generate approximate spectra in such cases. One approach to 13CNMR spectrum simulation involves the construction of linear models relating structural features to observed 13CNMR chemical shifts. These models have the form

S = bo

+ blX, + bzX2 + ... + bpXp

where S is the predicted chemical shift of a given carbon atom, the Xi terms are numerical descriptors which encode structural features of the chemical environment of the atom, the bi terms are coefficients determined from a multiple linear regression analysis of a set of unambiguously assigned chemical shifts, and p denotes the number of descriptors in the model. The early work using this approach focused on linear and branched alkanes and produced excellent results (1, 2). Several trends characterize the majority of attempts to extend this work to more complex chemical systems: (1)individual models are forced to be very specific to a class of carbon atoms, requiring the use of numerous models to stimulate complete spectra; (2) simple, hand-calculable structural descriptors are largely inadequate for modeling complex systems where geometrical and electronic information are needed; and (3) the overall simulation approach is very cumbersome to implement routinely for complex molecules, due to the large number of models required and the need for complex descriptors. To minimize these problems, we have developed an interactive computer system that implements each of the tasks required to perform 13CNMR spectrum simulation with linear models (3, 4). With a computer-based approach, it is

straightforward to compute complex structural descriptors and to develop and apply as many models as needed to simulate the spectra of complex molecules. As capabilities are developed for simulating the spectra of complex molecules, one is forced to deal with larger and larger sets of carbon atoms. In the model formation stage of the analysis, the standard practice has been to include in the computations each structurally distinct carbon atom in each compound. If large molecules are being studied, this strategy can easily result in an overwhelming number of atoms to be processed. At best, the analysis will be computationally expensive. Often, however, the storage capacity of the user’s computer will be insufficient. The limitations of performing the analysis in this manner can be seen clearly in Figure 1. The structures labeled A and B are 5a-androstan-18-01and 5a-cholestan-@-ol,respectively. They differ only in the presence of an alkyl side chain attached to atom 17 in the cholestanol. 13CNMR chemical shifts are given for the same six atoms in both structures. The chemical shifts were taken from a collection published by Eggert and co-workers (5). These chemical shifts, and all others reported in this work, are referenced to tetramethylsilane(Me@). Both spectra were recorded under the same experimental conditions. Within the accepted experimental error of 0.1 ppm, atoms 3, 5, and 11have identical chemical shifts in both structures. The chemical environmenta of these atoms in both structures are effectively equivalent. The side chain in the cholestanol is distant enough to have no discernible effect on their chemical shifts. Atoms 13,14, and 16 in the cholestanol are significantly influenced by the presence of the side chain, however, as shown by the differences among the cholestanol and androstanol chemical shifts. Each of the 12 atoms described here would be included in the model formation stage of a simulation study, if standard procedures were used. Atoms 3, 5, and 11in the cholestanol represent effective duplicates, however, if the corresponding atoms in the androstanol are included. The presence of duplicates adds unneeded observations to the analysis, and it serves to give undesired weight to the duplicate observation in the determination of the regression coefficients. The elimination of such duplicates can significantly reduce the number of atoms to be processed, thereby greatly increasing both the efficiency and the feasibility of the analysis. When many atoms are involved, there is no practical visual method for determining which atoms to keep and which to remove from consideration. Therefore, a computational procedure must be employed. The problem can be defined in terms of perceiving the degree of topological structural similarity that exists between the environments of two carbon atoms. Most of the previous work in this area has involved the detection of uniqueness, rather than similarity, with the goal of estimating the number of 13CNMR resonances produced by a given molecule (6-8). An exact defmition of structural similarity is required before a computation can be implemented. For example, if two structures differ only by the substitution of a chlorine for a fluorine, how similar (in a quantitative sense) are the envi-

0003-2700/84/0358-1314$01.50/0 0 1984 Amerlcan Chemical Society

ANALYTICAL CHEMISTRY, VOL. 56,

NO.8, JULY 1984

Ea = (eo, e l , e2, ..., e,)

1315

(2)

+

4



24.7

A

0



24 7

B

Figure 1. Chemical shifts for selected atoms in Sa-androstan-lp-ol (A) and 5a-cholestan-l~-ol(8).

ronments of various carbon atoms in the two structures? The answer to questions of this type is provided by the intended application of the method. A carbon atom with a fluorine directly attached will have a far different chemical shift than one in which the fluorine is replaced by chlorine. A quantitative measure of structural similarity can be based on the magnitudes of the effects of atoms in determining 13C NMR chemical shifts. In this paper, a procedure is presented for encoding the chemical environment of an atom such that it can be compared with the similarly coded environment of another atom. The numerical result of this comparison is a quantitative measure of the structural similarity (in 13C NMR terms) of the two atoms.

EXPERIMENTAL SECTION Chemical structures used in evaluating the similarity measure were input and stored via a graphical approach developed for the ADAPT software system (9, 10). All programs were written in FORTRAN, and graphia capabilities were implemented by using Tektronix PLOT-10 software. The MINITAB statistical software system (11) was used to perform the regression analyses. All computationswere performed with a PRIME 750 computer operating in the Department of Chemistry at The Pennsylvania State University.

RESULTS AND DISCUSSION Computation of Structural Similarity. A quantitative measure of structural similarity requires that the chemical environments of atoms be encoded such that numerical comparisons are possible. For applications to I3C NMR problems, there are two additional requirements (1) structural features closer to the carbon center of interest should be given greater weight than those farther away, as those features closer to the carbon center will have greater influence in determining its chemical shift; and (2) each atomic species should be weighted according to the magnitude of its influence on chemical shifts. In the previous example, fluorine should be given greater weight than chlorine because its presence induces a greater change in observed 13CNMR chemical shifts. The term atomic species is used to specify an element in a particular bonding configuration. A given element can exist as one of several different species. The above requirements are met by encoding the chemical environment of a carbon center, C,, as

where E, is an (n 1)-dimensional vector describing the environment of C,. The elements of E,, ei,represent the chemical environment at a distance i bonds from C,. The carbon center itself, C,, is described by eo. Typically, n is set to five. In most molecules, effects on chemical shifts are seldom significant through more than five bonds. The environment of C, can be compared with that of another carbon center, c b , by computing the Euclidean distance between E, and Eb, where the smaller the distance, the more similar c, is to c b . If the environments of c, and Cb are identical with a distance of n bonds, the Euclidean distance between E, and Eb will be zero. At a distance of i bonds from C,, we define

ei =

d3

(3)

where Zj is an atom code describing the effect on chemical shifts of the jth of pitotal atoms located i bonds from C,, and d = i (i > 0); d = 1(i = 0). Larger values of Zj are given greater weight by the root-sum-of-squares(RSS) calculation. For the case of large molecules in which many Zj values contribute to the sum, roundoff error can cause slight structural differences to be obscured. This problem can be minimized in such cases by substituting a linear sum for the RSS calculation. The function of d in eq 3 is to weight the effects of atoms closer to C , more than those of atoms further away. The cubic term was derived empirically by studying pairs of atoms whose environments produce small Euclidean distances in the vector comparison. Use of the cubic term seems to minimize the occurrence of large chemical shift differences between atoms judged similar by the distance computation. The Zj values comprise the most important elements in successfully measuring structural similarity. A simple approach would be to assign arbitrary codes to the various atomic species (e.g., C = 1, N = 2, 0 = 3, etc.). Shelley and Munk have used this procedure to determine structural uniqueness (12),but the approach is not useful in measuring similarity. In addition, it provides no means for weighting atoms according to their effects on 13C NMR chemical shifts. In the work presented here, observed 13C NMR chemical shifts are used to derive three sets of parameters, Pa,M,, and C,, which are used in the assignment of Zj values. Each P, parameter serves as a starting value for 2,of a given atomic species. The chemical environment of an atomic species has a profound effect on the manner in which the species influences chemical shifts. The parameter sets, M, and C,, are used to modify the initial Zj values to account for effects of the immediate chemical environment. For a given atomic species, let us define P, as the change in observed chemical shift when the atomic species replaces hydrogen in a given reference compound. Thus

pa = 6,

- 6ref

(4)

where 6, is the chemical shift of the carbon attached to the species of interest and 6,f is the chemical shift of the reference carbon where hydrogen replaces the species. For example, Pafor one oxygen species is simply the observed chemical shift of methanol (13)vs. that of methane (14) (49.9 - (-2.3) = 52.2 PPd. Besides atom type, bonding (hybridization) and connectivity must be characterized in order to distinguish each different atomic species. Primary carbons that are sp3hybridized are distinct from primary, sp2hybridized carbons. The value for the sp3species is the chemical shift of ethane vs. methane (8.0 ppm (14)). For the primary, sp2carbon, the P, value is the chemical shift of ethene vs. methane (124.4 ppm (15)).

1316

ANALYTICAL CHEMISTRY, VOL. 56, NO. 8, JULY 1984

Table I. Derivation of Pa Parameters atom

bonda

C C

1 1 1 1

C

C C C C C C

C C 0 0 0 N N N N N F

c1

Br I

2 2 2 2* 2* 3 3 1 1

2 1 1 1

2* 3 1 1 1 1

conn 1

2 3 4 1

2 3 2 3 1

2 1

2 1 1

2 3 2 1 1 1 1 1

compd

lit.

solvc

CH,-CH, CH;-CH;CH, CH, -CH( CH, ), CH,-C(CH, 1 3 CH,=CH,

14 14 14

1 1 1 1 1 1

a

CH;=CH(CH,)

CH,=C(CH,)(C,H,) benzene toluenee CH=CH CH=CCH, CH,-OH CH, -OCH, CH,HC= 0 CH,-NH, CH, -NHCH, CH 3 -N(CH, 1 2 pyridine f CH,CqN F-CH,CH, Cl-CH,CH, Br-CH,CH, I-CH,CH,

14

15 15 17 13 18 13 19 13 14 18 20 20 20 18 18 14

16 16 16

2 1

3 1 1 1

1 2 4 4 4 2 3 1 1 1 1

o(

shift, ppm 5.7 17.2 24.6 31.4 122.1 114.7 108.7 128.5 129.1 71.9 66.9 49.9 59.2 199.7 28.1 38.2 47.1 149.8 117.7 78.0 40.5 27.4 -0.8

refd

pa,PPm

1 1 1 1 1 1 1

19.5 26.9 33.7 124.4 117.0

2 2 1 1 1 1 3 1 1 1

2 3 3 3 3 3

8.0

111.0 101.0 101.6 74.2 69.2 52.2 61.5 194.0 30.4 40.5 49.4 122.3 112.0 72.3 34.8 21.7 -6.5

a Bonding codes: 1, single bond; 2, double bond; 2*, aromatic bond; 3, triple bond. Connectivity codes: 1,primary; 2, secondary; 3, tertiary; 4, quaternary. Solvent codes: 1,neat; 2, CDCl,; 3, dioxane, 4, D,O. Reference compounds (shift, solvent, literature reference): 1, methane (-2.3 ppm, cyclohexane, ( 1 4 ) ) ;2, cyclohexane (27.5 ppm, neat, ( 1 3 ) ) ;3, ethane (5.7 ppm, neat, ( 1 4 ) ) . e The chemical shift of the ortho carbon is used. f The chemical shift of the carbon attached to the nitrogen is used.

In order to distinguish connectivity, separate P, codes must be derived from primary, secondary, tertiary, and quaternary atomic species, as needed. When a secondary atom replaces hydrogen in the reference compound, there will, by definition, be an atomic species two bonds from the carbon whose chemical shift determines the P, value. The error introduced by this complication is minimized by defining primary, sp3-hybridizedcarbon (methyl) as the only species that can be attached to the species whose P, value is being determined. The effects of any methyl species are therefore included in the derivation of the other parameters. For example, the P, value for a secondary, sp3-hybridized carbon is defined as the methyl chemical shift (vs. methane) in n-propane (19.5 ppm (14)),while the parameter for a secondary, sp2-hybridized carbon is the chemical shift (vs. methane) of atom 1 in 1propene (117.0 ppm (15)). The guidelines outlined above were used to derive P, codes for many common atomic species. These results are presented in Table I. Several atomic species have been omitted due to a lack of available 13C NMR data. For example, no parameters for sulfur or phosphorus species are included. Each species in Table I is identified by atom type, bonding, and connectivity designations. In the table, a compd refers to the compound containing the species whose effect is being characterized. A literature reference for the 13C NMR spectrum of the compound is given, and the solvent used in the data collection is indicated. The solvent data are given for informational purposes only. No attempts were made to correct for solvent differences. The errors associated with these differences are assumed to be small. The chemical shift used to measure the effect of the current species is listed (a shift), and the atom associated with the shift is italicized. For each species, a reference compound is given. The a and reference chemical shifts were used in eq 4 to compute the P, values. Methane was used as the reference compound in the derivation of the majority of the P, values. Several exceptions are worth noting, however. Separate parameters were desired for aromatic carbon and nitrogen species. The a compounds used were six-membered ring compounds. It was decided to

use the cyclohexane chemical shift as the reference, as it is the corresponding saturated six-membered ring compound. Ethane was used as the reference for one oxygen and one nitrogen species, as the a compounds available had methyl groups attached to the carbon of interest. Ethane was also used as the reference for the four halogen parameters. The a chemical shifts were taken from haloethanes, rather than the corresponding halomethanes. The chemical shifts of the methane derivatives produce P, values that are judged to be too heavily weighted. For example, the heavy-halogen effect in iodomethane produces a chemical shift of -22.5 ppm (14). This represents a stronger shielding effect than is often seen in other compounds, although the possible subjectivity of this choice is acknowledged. The P, values allow each atomic species to be distinguished. In many cases, this is insufficient to recognize distinct carbon environments. The two structures depicted in Figure 2 provide an example. The structures labeled A and B are 2,3,3-trimethylhexane and 2,2,5-trimethylhexane, respectively. Observed I3C NMR chemical shifts are given for one secondary carbon in each structure. The spectra of both structures were recorded under the same experimental conditions and were reported by Lindeman and Adams (2). The two atoms are in distinct, yet similar, structural environments, as evidenced by their chemical shift difference of 1.1 ppm. In the figure, the surrounding atoms in each structure are labeled to indicate their bond distance from the center of interest. Note that the P, codes fail to distinguish the two indicated carbons. Both centers are secondary carbons, and each is attached to secondary and quaternary carbons. LOcated two bonds away are three methyl groups and a tertiary carbon. In each case, two methyl groups are found three bonds away. If the appropriate P, codes are assigned as the 2;values in eq 3, the resulting environment vectors, E, and Eb,are identical. This example illustrates the inadequacy of a single code (P,) for 2,. A given atomic species will exert a certain effect on chemical shifts, but the magnitude of the effect will be perturbed by the chemical environment of the species. The effect of the quaternary carbon in structure A is different from that

ANALYTICAL CHEMISTRY, VOL. 56,

Table 11. p Effects of Chlorine compound CH,-CH,-Cl CH, -CH (C1)(CH, ) CH3-C(C1)(CH3)2

CH,=CH-Cl CH,=C(Cl)(CH,) chlorobenzeneC

lit.

sol@

shift, ppm

ref b

16 21 18 22 23 18

1 2 2 1 1 3

19.2 27.3 34.4 116.0 112.0 128.6

1 1 1 1 1 2

pp, PPm 21.5 29.6 36.7 118.3 114.3 101.1

NO. 8, JULY 1984

1317

pa, PPm

19.5 26.9 33.7 117.0 111.0 101.6

Reference compounds (shift, solvent, literature reference): 1, a Solvent codes: 1,neat; 2, CDC1,; 3, dioxane. methane (-2.3 ppm, cyclohexane, (14));2, cyclohexane (27.5 ppm, neat, ( 1 3 ) ) . The chemical shift of the ortho carbon is used. along with solvent and literature reference information for their 13C NMR spectra. The chemical shift is given for the indicated carbon in each compound. In each case, this carbon is bonded to an atomic species whose effect is being altered by an attached chlorine. The reported chemical shift can be used with the shift of a reference compound to derive the actual Po for the effect of chlorine on the atomic species. The last column in the table lists the corresponding P, values. A comparison of the Po and Pa values reveals almost exact linear correlation ( R > 0.999). This suggests a model of the form Pp = MpPa C, (6)

+

A

/3

2

€5 Flgure 2. Chemical shifts for atom 4 in 2,3,34rlmethylhexane (A) and 3 in 2,2,5-trimethylhexane(B). Atoms are numbered by their distance in the molecule from the circled atom.

of the corresponding atom in B because there are different atoms surrounding the two centers. The atom in A is attached to two methyl groups, a secondary carbon, and a tertiary carbon. In B, the quaternary carbon is attached to three methyl groups and a secondary carbon. Environmental effech can be implemented in the computations by modifying the Pacode for an atomic species based on the surrounding atoms. For simplicity, we choose to limit the definition of surrounding atoms to those directly attached. A new atom code, P,, can then be defined as

P, = P,'

(5)

where P,' represents the modification of the basic Pa code for the atom. The best approach to computing P, is to derive parameters that encode how each species alters the effects of every other species. This can be described as the p effect of the species. A complete treatment would require a separate parameter for each interaction. Two limitations prevent the implementation of this approach (1)there are insufficient data available to describe each interaction; and (2) if the data were available, the resulting parameter set would be large and unwieldly to apply. To overcome these limitations, the available data can be used to form a model describing the @ effect of each species. The resulting model can be used in place of the individual parameters. In addition, it can be used to predict the effects of interactions for which data are not available. As an example, Table I1 presents the available data describing the p effects of chlorine. Six compounds are listed,

where the multiplicative and additive constants, M , and C,, are derived from a simple regression analysis of the available Pa values. This approach has the advantage that ony two additional parameters, M , and C,, are needed to encode the p effects of a species. Table I11 summarizes the calculation of (Iparameters for each appropriate species. Each species is identified by atom type, bonding, and connectivity codes, as in Table I. The derived M , and C, parameters are presented, along with statistics describing the regression analyses and literature references for the 13CNMR data used. The statistics given are: (1)n, the number of chemical shifts used in the regression; (2) R , the resulting linear correlation coefficient describing the regression; and (3) s, the standard error (in ppm) between the predicted and observed Po values. The variability in n is due to a lack of data for the description of certain interactions. For several species, no statistics are given, as only one or two data points were available. The derived parameters have no statistical significance in such cases, but they can be used if appropriate caution is employed. For three atomic species, only null parameters are given (M, = 1.0, C, = 0.0). These species are (1)primary, sp3-hybridized carbon; (2) primary, sp-hybridized carbon; and (3) primary, triply bonded nitrogen. As mentioned previously, the effects of primary, sp3-hybridized carbons have been included in the derivation of the other parameters. Therefore, no separate 0 parameters are applicable. The Pa values of sp-hybridized species were determined as the chemical shift of the bonded carbon (also sp hybridized). The determination of p parameters for these species requires two consecutive triple bonds. As this construct does not exist, no p parameters were computed. The /3 effect has been treated as a single interaction in the derivation of M, and C, parameters. Clearly, secondary, tertiary, and quaternary atoms are affected by more than one interacting species. Each contributing effect can be computed by use of the appropriate a and /3 parameters. The overall atom code, Zj in eq 3, is computed as the root-mean-square (RMS) average of the individual P, contributions

1318

ANALYTICAL CHEMISTRY, VOL. 56, NO. 8, JULY 1984

Table 111. Derivation of P Parameters atom C

C C C C C C

C C C

C 0 0 0 N N N N N F

c1 Br I

bond a conn 1 1 1 1

2 2 2 2* 2* 3 3

1.00

2 3 4

1.021 1.039 1.049 0.649 1.048 1.028

1

2 3 2 3

2" 3

9 10

9

S

0.999 0.999 0.998

1.091 1.808 2.425 6.771 1.450

1

-3.12 -5.82

4 3 3

0.990 0.999

1.ooo

1.000

0.00

1.033

-3.20 0.00 -4.36 5.49 4.013

11

0.992

5.031

4 4 6

0.999 0.999 0.992

0.949 1.114 4.823

0.987 0.994 0.998

4.300 3.031 1.854

0.999 0.999 0.999

1.457 1.996 2.978

1.068 0.813 0.733 0.0410 0.798 0.816 0.865 -1.000

0.00

1

9.13 3.81 -1.23 1.97

6 6 6

1.000

0.00

0.853 0,988 1.013 1.069

-1.026 2.76 3.11 4.25

1 1 1 1 1

R

0.00

2

2 3 2

-

lit. refs

0.00

-3.41 -6.076 -9.78

1.000

1 1

1 1 1

n

CP

1

2

2

1 1 1 1

1

1

1 1

MP

0.000

2

2 6 5 5

a Bonding codes: 1,single bond; 2, double bond; 2*, aromatic bond; 3, triple bond. 2, secondary; 3, tertiary; 4, quaternary.

2, 15, 18, 20, 24, 25, 26 2, 17, 20, 26, 27, 28

2, 17, 20, 25, 26, 28 29 18, 30, 31, 32 31, 32, 33 13,18 18, 28, 32, 33, 35, 36 16, 27, 29, 33 18, 34 18, 26, 31, 37 38 20, 39, 40 18, 20, 40 20, 36, 40 18 14,39 16, 18, 21, 22, 23 18, 41, 42, 43 18, 43,44 Connectivity codes: 1, primary;

Table IV. Computation of e, for Structures A and B structure A p, 33.7

species primary C primary C secondary C tertiary C

MP

1.000 1.000 1.021 1.039

structure B PP 33.7 33.7 31.0 28.9 31.9

CP 0.00 0.00

-3.41 -6.076

Zl

species primary C primary C primary C secondary C

structure A species primaryC secondary C

Mo

z2 RSS

36.66

1.000

CP 0.00

1.021

-3.41

1.000 1.000 1.000

0.00

1.021

-3.41

0.00

species secondary C tertiary C

MP 1.021 1.039

PP 16.5 14.2 15.4

CP

-3.41 -6.076

36.45

1

d e1

PP 33.7 33.7 33.7 31.O 33.1

CP 0.00

structure B PP 19.5 19.5 18.1

pa

19.5

Mo

1

36.66

where aj is the number of atoms attached to atom j and PP,k is the Pa value for the current attachment, computed using eq 6. The RMS average is used because it gives slightly greater weight to effects of larger magnitude. Tables IV and V illustrate the computation of E, and Eb for the indicated atoms in structures A and B of Figure 2. In Table IV, a detailed computation is given for el in both structures. Two atoms (pi in eq 3) contribute to the el term in each case. As discussed previously, the Pavalues are the same in both structures. The fact that the p terms are different results in different Zj values, and therefore in different e, values. Thus, the secondary and quaternary carbons are found to be distinct, yet similar. Each component of E, and &, is listed in Table V. The last two components are zero in each case, as no atoms occur four or five bonds from the indicated carbon centers. A comparison of E, and Eb produces a relatively small Euclidean distance (0.25), further confirming both the uniqueness and similarity of the two centers. Evaluation of Methodology. Figure 3 depicts 35 structures containing carbon centers in a variety of chemical environments. Within each structure, triangles indicate the carbon centers which give rise to distinct 13CNMR chemical

36.45 Table V. Comparison of E, and E, ei

A

B

0 1 2

13.90 36.66 3.10

3 4

0.12

0.00

13.90 36.45 3.23 0.12 0.00

5

0.00

0.00

I

shifts. Those carbon atoms not inidcated represent duplicate centers within the molecule. Each indicated atom is labeled with an identifying sequence number from 1 to 112. Table VI lists the name of each structure, the literature reference and solvent information for its 13C NMR spectrum, and the corresponding sequence numbers in Figure 3. In Table VII, the chemical shifts are given for each of the 112 indicated atoms. The six-dimensional E vectors were computed for each of the carbon centers. The structures are varied enough that the computation required the use of each a and 0 parameter in Tables I and 111. Thus, the 112 centers comprise a rep-

ANALYTICAL CHEMISTRY, VOL. 56, NO. 8, JULY 1984 ~~

~~

Table VI. Chemical Structures Used for Evaluation seq no. lit. SOIVU structure name 1-4 2 3 3-methyl-3-ethylpentane 5-10 3 2 2,3,3-trimethylpentane 11-18 3 2 2,3-dimethylheptane 19-23 3 2 4,4-dimethylheptane 24-24 1 46 cyclohexane 25-27 1 47 trans-decalin 28-30 1 46 trans-1,4-dimethylcyclohexane 31-34 5 1,4-dibromo-2-chloro-l,l,2- 45 trifluorobutane 35-36 4 44 2-iodo-2-methylpropane 37-37 1 16 1,2-diiodoethane 38-41 2 18 2-bromobutane 42-43 3 18 1,2,3-trichloropropane 44-45 2 18 1 , l ,a-trichloroethane 46-47 2 18 1,1,2-tribromoethane 48-51 3 18 2-butanone 52-55 2 18 ethyl trifluoroacetate 56-57 2 18 acetic acid 58-59 2 18 acetaldehyde 60-61 2 18 trifluoroacetic acid 62-64 3 18 propene oxide 65-66 2 18 tetrahydrofuran 3 67-69 18 1-propanol 2 70-72 1,2-propanediol 18 73-74 18 2 diethylamine 75-78 2 18 1-amino-3-methoxypropane 79-81 2 18 2-amino-2-methyl-1-propanol 82-85 3 1-hydroxy-2-butyne 18 3 86-88 3-chloro-1-propene 18 3 89-91 2-propenenitrile 18 3 92-93 acetonitrile 18 2 1-hexyne 94-99 18 2 pyridine 100-102 18 2 103-106 phenol 18 1 benzene 107-107 13 toluene 3 108-112 18 a Solvent codes: 1, neat; 2, CDC1,; 3, dioxane; 4, cyclohexane; 5, no report.

resentative sampling of common chemical environments. The evaluation of the proposed methodology requires insight regarding the relationships among the E vectors. Effectively, this is a study of the clustering that exists in the six-dimensional space defied by the E vectors. Those vectors representing atoms in similar chemical environments shouId lie near each other in space, while those vectors representing atoms in disparate environments should lie far apart. The degree to which this is true is a measure of the usefulness of the methodology. The evaluation is best begun by concentrating on atoms in very similar structural environments. Atoms 1-30 in Figure 3 meet this criterion. Each atom in each structure is an sp3-hybridizedcarbon. Therefore, no heteroatom or hybridization effects are present. Several approaches can be taken to the study of the E vectors representing atoms 1-30. The Euclidean distances between each pairwise combination of vectors could be computed. This approach requires the comprehension of 435 distances, however. A visual approach is preferable, but the high dimensionality of the space prevents a straightforward plot. To overcome this difficulty, the techniques of principal components analysis can be employed. Given points in an n-dimensional space, one can compute the "best" m-dimensional approximation for the points, where m < n. The term "best" refers to the m-dimensional space in which the relationships among the points most closely resemble those in the n space. This computation is termed the Karhunen-Loeve transformation (48). If the data are such that an effective two-dimensional approximation can be computed, a simple plot can be constructed which provides

1319

Table VII. Chemical Shifts of Indicated Atoms seq no. 1 2

3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

shift, PPm 7.5 30.6 23.2 34.8 32.6 23.3 17.1 35.1 7.9 34.9 32.2 13.8 15.2 19.0 23.1 30.0 34.0 38.8 44.8 14.9 32.8 17.3 27.0 27.4 44.2 34.7 27.2

32.9 35.9 23.0 119.6 22.2

40.4 110.3 39.5 41.1 29.2 26.0

seq no. 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75

shift, PPm 53.1 34.2 12.1 45.3 59.0 70.4 50.1 40.3 38.7 207.6 29.0 36.5 8.0 115.3 158.1 64.7 13.8 178.1 20.6 30.7 199.7 163.0 115.0 47.6 18.1 47.3 67.9 25.8 10.5 26.3 64.0 67.7 68.2 18.7 44.1 15.4 39.6

78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111

shift, PPm 33.6 70.9 58.4 50.6 26.7 71.1 50.5 80.0 78.9 3.2 118.1 134.6 45.1 117.5 108.1 137.5 1.3 117.7 68.1 84.5 18.1 30.7 21.9 13.5 135.7 123.6 149.8 115.4 154.9 129.7 121.0 128.5 129.1 137.7 128.3 125.5

112

21.2

seq no. 76 77

visual information regarding the clustering that exists among the points. This plot is called a principal components projection plot. Two factors in the present application aid the success of a two-dimensional approximation of the six-dimensional data: (1) greater numerical weight is given to the lower dimensions in the scheme; and (2) the molecular structures comprising the present example are relatively small. Few atoms are found a t distances of four and five bonds. Thus, the effective dimensionality of the space is less than six. The computation described above was applied to the 112 E vectors. The success of the approximation is indicated by the fact that greater than 95% of the variance in the sixdimensional data is explained by the two-dimensional approximation. Figure 4 is a principal components projection plot of the vectors of atoms 1-30. Each point is labeled with the corresponding sequence number of the atom. Four broad regions are apparent, with smaller clusters observed within each region. The broad regions correspond to primary, secondary, tertiary, and quaternary carbons, moving from the lower left to the upper right of the figure. An inspection of the relevant chemical shifts in Table VI1 reveals a general ordering of the points in each region. Those atoms with larger chemical shifts seem to be found to the right of those with smaller shifts. Points in individual clusters seem clearly grouped by structural similarity. We conclude from this study of very similar atoms that the E vectors effectively encode chemical structural information such that the vectors of structurally similar atoms lie close together in space. In order to measure the effects of increasing

1320

ANALYTICAL CHEMISTRY, VOL. 56, NO. 8, JULY 1984 ‘\

14

h

2L

f

A 80

90

f

a100 1

\

I

c

‘4 I01 f’

I \

f I

I

I

,A105

182

,

104

‘$103

‘8

w

A 1..

y107

I

-., .J

v

I06