Computer-assisted interpretation of carbon-13 nuclear magnetic

Nov 1, 1977 - Citation data is made available by participants in CrossRef's Cited-by Linking service. For a more comprehensive list of citations to th...
0 downloads 0 Views 774KB Size
Computer-Assisted Interpretation of Carbon- 13 Nuclear Magnetic Resonance Spectra Applied to Structure Elucidation of Natural Products Hugh B. Woodruff,' Charles R. Snelling, Jr., Craig A. Shelley, and Morton E. Munk' Department of Chemistry, Arizona State University, Tempe, Arizona 8528 1

The reduction of chemlcal and spectroscopic data to their structural lmplicatlons Is a major component of the computer model of the structure elucidation process developed at Arlrona State Universlty. Thls paper compares flve classlflers for binary 13C-NMR spectral data. Whlle selection of the best technique for spectral lnterpretatlon is largely dependent upon the partlcular sltuatlon, several trends are evldent from this investlgatlon. I f a thorough enough data set exists and an ample supply of computer tlme Is available, a search of the data set seeklng the most slmllar member should prove valuable. When deallng wlth blnary data, the Tanlmoto slmllarlty measure correctly classlfles unknown compounds more often than the conventlonal dlstance measure (Nearest Nelghbor). of the three nonsearchlng procedures Investlgated, maxlmum Ilkellhood and dlstance from the mean have comparable predictive abllttles, and both procedures correctly classlty spectra more frquently than the dot product classtfler. Examples are presented whlch demonstrate that the manner In whlch questlons are posed greatly affects predlctlve ablllty.

A major problem confronting the natural products chemist is the need to deduce the molecular structure of an unknown molecule rapidly and reliably. While it is true that no two natural products chemists practice the science and art of structure elucidation in exactly the same way, certain common characteristics are apparent. An early step in the process is the reduction of chemical and physical data to their structural implications. These structural implications constitute the familiar partial structure, an expression of known structural fragments and unaccounted-for-atoms that summarizes the status of the structure problem a t any given stage. The chemist is guided by the partial structure or by some or all of the molecular structures consistent with it in designing new experiments. The final solution of the problem may be described as the cyclic process through these steps that leads to the reduction of structural fragments and atoms in the partial structure to the one correct molecular structure. The chemist's intuition is frequently a valuable asset in determining the correct pathway for combining the structural fragments and residual atoms of the partial structure to form molecular structures. However, care must be taken not to overlook a valid combination pathway, especially in dealing with the relatively complex molecules of nature. To assist the chemist in the structure elucidation process and to relieve the chemist of the tedious task of manually assembling molecular structures, several computerized molecule assemblers have been developed that ensure that all chemically feasible molecules are considered (1-6). The current status of CASE (Computer-Assisted Structure Elucidation), a highly interactive and continually evolving Present address, Merck Sharp & Dohme Research Laboratories, Rahway, N.J. 07065.

network of computer programs developed a t Arizona State University and designed to accelerate and make more reliable the entire process of structure elucidation, is summarized in Figure 1. In developing a computer model of the structure elucidation process, our attention was focused on two of its major components. The first component is the expansion of a partial structure to all molecular structures consistent with it and any other information available to the chemist. This expansion is achieved by a unique molecule assembler (6, 7). The second major component is the reduction of chemical and spectroscopic data to their structural implications. This task is presently shared by the chemist and the computer. An IR interpreter designed specifically for application to multifunctionalized molecules is a t an advanced stage of development and fully operational (8, 9). The development of programs for the automated interpretation of other spectroscopic information is at an earlier stage. This paper describes the results of an investigation on computer-assisted interpretation of 13C-NMR spectral data.

GOALS OF THE STUDY The enormous sensitivity of 13C chemical shifts to structural changes is making %NMR spectroscopy a valuable tool in structure elucidation work (10, 11). Wilkins and co-workers (12-15) have reported encouraging results using linear learning machines and a committee threshold logic unit on 13C-NMR spectra. The present investigation has a twofold purpose. First, pattern recognition techniques that have not previously been applied to I3C-NMR data are studied. By comparing the trends found in this study to results from previous investigations on IR (16) and MS (17) data, observations can be made on the efficacy of the various pattern recognition techniques. Second, the general philosophy of the CASE project is to design programs that will prove to be valuable to the chemist at the bench in solving actual structure elucidation problems. Observations will be made on the effect this philosophy has on the types of questions posed to the pattern recognition programs. DATA SET Most of the data used in this study were supplied in a computer readable form by C. L. Wilkins. These 2229 spectra were augmented by spectra obtained from the literature (10, 18-23), resulting in a file that eventually totaled 2471 spectra. The spectra were stored in a binary (peak/no peak) format. Each spectrum was divided into 200 one-ppm intervals. If a peak appeared in a given interval, a 1 was recorded. If no peak appeared in that interval, a 0 was entered. Chemical shifts were measured relative to tetramethylsilane. Peaks appearing below 200 ppm or above TMS (i.e., negative ppm values) were placed in the 199-200 and 0-1 ppm intervals, respectively. Initially, three classifications believed to be of value to the natural products chemist were attempted, nucleoside (or nucleotide) /non-nucleoside (53 file members were nucleosides or nucleotides), carbohydrate/non-carbohydrate (171 carANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

* 2075

COMPATIBLE S'RUCTIJRES

', \i, ' L, l

a

STMULATOR/

~

TRLNCATED S T R U C TOUFR E S

AUTOMATED

Figure 1. Diagram of CASE (Computer-Assisted Structure Elucidation) network

bohydrates, of which 53 were the nucleosides), and steroid/ non-steroid (110 steroids). Subsequent questions were posed and they will be described later in this paper. All programs were coded in FORTRAN IV and run on the Arizona State University Univac 1110 computer.

THEORY Five different pattern classifiers are tested in this investigation. Two of the classifiers are based upon a search and compare scheme. Classification is achieved simply by measuring the similarity of an unknown spectrum to each member of a reference file of known spectra (i.e., spectra for which the correct classification is known). The unknown is predicted to belong to the same class as the most similar file member. The other three techniques require a training step. By using a training set of known spectra, the class conditional probabilities for each of the 200 features (1-ppm intervals) can be approximated (24). A class conditional probability, p,i, is the probability that a peak appears in interval j, given that the spectrum belongs to class i. If a set of spectra belonging to class i exists, the average spectrum for that class approximates the 200 probabilities: mi

Pji = nzlxnj/mi where x,, = 0 or 1 (the value of the j t h interval of the nth spectrum in class i) and m, = number of spectra in class i. The average spectrum (weight vector) for class i is denoted by the vector Wi

Now that the preceding definitions have been presented, brief summaries of the five techniques are given below with references to the source of more detailed descriptions. Nearest Neighbor (Hamming Distance) ( 2 5 , 2 6 ) . The most frequently employed measure of the similarity between two spectra is the Euclidean distance. The unknown is predicted to belong to the same class as its nearest neighbor in the reference file. When dealing with binary data, the same ordering of near neighbors is obtained whether the Euclidean distance or the Hamming distance is measured. Since the latter measure is a more easily implemented computer operation, it is used as the similarity measure. The Hamming distance is simply equal to the number of mismatching intervals when two spectra are compared. The exclusive OR 2076

ANALYTICAL CHEMISTRY, VOL. 49,NO. 13, NOVEMBER 1977

Table I. Boolean Logic Operators ORE (exclusive O R ) 10 1

0 1 0 0 1 1 1 1

OI0 1 0 O1 (ORE) operation yields the mismatching intervals (Table I), hence the Hamming distance between spectra X and Y is 200

D x y = 2 (xj'ORE'yj) j=1

(4)

(Again, x , and y , are either 0 or 1.) If X is the unknown spectrum, then it is predicted to belong to the same class as reference spectrum Y for which Dxu is the minimum. Tanimoto Similarity Measure ( 2 6 , 2 7 ) . An alternative measure when dealing with binary data is the Tanimoto similarity measure (SX~). To obtain the Tanimoto measure from the Hamming distance, the latter is normalized by the number of intervals containing peaks in either or both spectra (a value obtained by the OR1 operation). The resulting value is a measure of the dissimilarity between two spectra, Le., how many mismatches occurred divided by the number of possible mismatches. The Tanimoto measure is the complement of this dissimilarity measure.

(5) s x u = 1 - ( D x u/ k ) where k = number of intervals containing a peak. In Boolean logic terminology 200

(xj'ORE'yj)

sx,=1-

j=1

20 0 j=1

(xj'ORI'yj)

which can be shown to become 200

Z (xj'AND'yj)

j= 1

SXY =

200

C (xj'ORI'yj)

j=1

The unknown spectrum X is predicted to belong to the class of Y for which Sxu is the maximum. Dot P r o d u c t (28). For each question posed, two weight vectors, one being the average class spectrum (W,) and the second being the average non-class spectrum (W,), are obtained by the method described above. For unknown spectrum X, discriminant functions KJX) and K,(X) are obtained by dot products.

K , ( X ) = W;X K,(X) = w;x

(7)

If K , ( X ) > KJX), X is predicted to be a member of the class. If K , ( X ) > K,(X), X is predicted to be a non-class member. Distance f r o m t h e Mean (28). W, and W, are obtained as above, but instead of calculating dot products, discrimination is achieved by means of a distance measure.

Q ( X ) = [ ( X - Wi).(X - Wi)]

‘I2 (8) where i = c or n in the present terminology. X is predicted to belong to the class for which D,(X) is the smallest. Maximum Likelihood (29,30). The maximum likelihood approach requires an assumption that the peak positions be statistically independent. While this assumption is not true (for example, the presence of a carbonyl functionality affects the chemical shifts of neighbor carbon atoms as well as the carbonyl carbon atom), the vast number of higher order terms that must otherwise be included requires that it be made. When used on binary infrared data where the assumption of statistical independence among the peak positions is also false, the results achieved by the maximum likelihood discriminator were equal to or superior to results obtained by any other technique investigated (16),thus its inclusion as a discriminator in this investigation seems justified. The maximum likelihood estimate, G,(X),is calculated by combining the joint probabilities for peak presence (R,)and peak absence (Q,)

200

T~= R ~ . Q = ~n pj,?j(l

- pji)l-xi

(9)

j=1

and then taking the log of TI.

Gi(X) = l o g

Ti 200

= log =

, n pj,?j(l- pji)l-xJ

1=1

200

x

j=l

Pji

[Xj

log pji

+ log (1- P j i ) - x j log (1-

11

200

=

c xi

j= 1

l o g (-

pji

1- Pji

)

+ 2x0 0 j=1

log (1 -

T h e resulting maximum likelihood is linear in x,, and the discriminant function is simply the dot product of the spectrum and the appropriate weight vector.

where

x i = 0 or 1 wji

Pji

= log ___ 1 - Pji

The unknown is placed into the class for which G,(X) is maximum.

RESULTS A N D DISCUSSION Tanimoto vs. Nearest Neighbor. Previous work on infrared data indicated a slight superiority in classification ability by the Tanimoto measure over nearest neighbor (16, 26). However, the results from those studies were too comparable to support the conclusion that distance as a measure of similarity between two binary spectra has no value. Yet, classification by a similarity measure is relatively time-consuming as for each unknown tested, n measurements must be made (n = number of spectra in reference file). Thus, a preliminary study was performed. The initial set of 2229 spectra was used as the reference file. From Ref. 23, 45 nucleosides were selected for the test. Each of the 45 spectra was compared to each member of the reference file by both the distance and Tanimoto measures. By Tanimoto, 41 of the 45 spectra had nucleosides or carbohydrates as the most similar spectrum. The results by the conventional nearest neighbor technique, a distance measurement, were substantially poorer. Only 17 of the 45 nucleosides were correctly classified. An example illustrates the problem. The spectrum of 2-thiocytidine contained 9 peaks. Four reference spectra, fluoromethane, tetrabromoethene, 1H-tetrazole, and bicyclo[2.2.1]hepta-2,5-diene,tied as the nearest neighbor of 2-thiocytidine; all were 8 units away. The first three spectra contained only 1 peak, all of which matched with one of the unknown’s peaks. The fourth spectrum had 3 unique peak positions, 2 of which matched with the unknown. Needless to say, the chemist would be dissatisfied with these results. With 2-thiocytidine, three reference spectra tied as most similar by the Tanimoto measure. All three compounds were carbohydrates. Their similarity measures were 0.25, whereas the similarity measures for the nearest neighbors to 2-thiocytidine were 0.2 for the bicyclic compound and 0.11 for the other three. Using distance as a measure of similarity biases the results vary heavily in favor of spectra with few peaks. Two spectra each containing two peaks and with none of the four peaks matching would be 4 units apart. Two other spectra each containing 50 peaks, 47 of which matched, would be 6 units apart. Yet, few chemists would say the totally non-matching spectra were more similar than the two spectra with 47 matching peaks. I t is for just this reason that the Tanimoto similarity measure is normalized by the number of intervals containing peaks. Based on the fiidings from this preliminary study, the Tanimoto measure was selected as the similarity measure of choice during subsequent testing. In addition, in order to avoid using excessive amounts of computer time, only 750 non-class spectra were employed for Tanimoto searching. Results. The results obtained by the four classification techniques for the nucleoside, carbohydrate, and steroid questions are presented in Table 11. ‘The format of the table is similar to a recent publication by Soltzberg et al. (31). Their publication consolidates arguments by other workers (32) as well as their own arguments on the relative merits of various performance evaluators. One point of agreement that seems to be gaining acceptance is that the weighted percent correct figure should be abandoned as a measure of performance (30-33). As Lowry et al. (30) have stated, a classifier that always says no (usually the non-class is more populous) is seldom useful even if frequently correct. Certainly if one desires to use a single percentage value to evaluate success, the average percent correct figure seems more indicative of classifier ability. Detailed defiiitions of‘the various probability measures are presented elsewhere (31, 32), so only brief descriptions will be included here. The earlier notation has been retained; thus “1” denotes the class (nucleoside, carANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

* 2077

Table 11. Classification Results Av. % correct/

0.03 0.10

0.22 0.26 0.04

0.64 0.91 0.96

0.9996 0.997 0.996

0.12

0.48 0.82 0.94

1.00

0.998

0.11 0.30

Carbohydrate Steroid

0.98 0.98 0.98

0.96 0.97

0.96 0.97

0.997 0.99

1.00

0.997 0.99 0.99

0.96

1.00

Nucleoside Carbohydrate Steroid

0.53 0.59 0.53

1.00 1.00 1.00

0.66 0.63 0.14

0.06 0.17 0.05

Nucleoside

Carbohydrate Steroid

0.82 0.95 0.98

0.98 0.96 0.92

0.99 0.99 0.998

Nucleoside Carbohydrate Steroid

0.74 0.91 0.97

1.00

0.98

0.97 0.98 0.997

1.00

Dot

Dis-

product

tance

0.90

0.90

0,J1

0 0.00

0

0.10

0.10

0.9

bohydrate, or steroid), “2” the non-class, “j” (ja) means the classifier predicted the pattern is a class member, and “n” (nein) that the pattern is predicted to belong to the non-class. The class conditional probabilities for each prediction are p(jl1) and p(j12);that is, given the pattern belongs to class 1, the probability of a correct prediction is pcjJ1). The a posteriori probabilities for patterns belonging to class 1 and class 2 are p(l1j) and p(2(n),respectively. Given that the pattern is predicted to belong to class 1,p(1lj) is the probability that the prediction is correct. I(A, B) is the information gain in units of bits as proposed by Rotter and Varmuza (32). Soltzberg et al. (31) demonstrate that I(A, B) can suffer from the defect of being test set dependent in certain situations and propose a figure of merit, M , which is the information gain relative to the maximum possible information gain imposed by the composition of the test set. Dot Product vs. Distance. Comparison among the values in the figure of merit (M) column in Table I1 resulted in several trends being observed. The most obvious feature was the relatively poor performance of the dot product as a classifier of binary patterns. These findings were not surprising since it has been argued previously (28) that the dot product classifier considers only intervals containing peaks in reaching a decision. Yet frequently the absence of a peak in a certain region of the spectrum is of more value to the chemist than peak presence. An example is shown in Table 111. X is the unknown spectrum while W1 and W2 are two weight vectors. The values of the dot product and distance discriminators are indicated. By dot product, X is predicted to belong to class 1, whereas the distance from the mean approach predicts class 2 membership for the unknown. Most chemists would agree that class 2 appears the more realistic selection. Class 1 members have peaks in the second and third intervals 90% of the time, but the unknown has no peak in either interval. The probabilities for those same intervals are considerably lower in class 2 spectra. However, the dot product ignores contributions by spectrum intervals containing zeros. Interestingly, for all three questions in Table 11, p(21n) = 1, i.e., any time a spectrum was predicted to be a non-class 2078

1.00 1.00 1.00

Nucleoside

Table 111. Example Comparing Dot Product and Distance Classifiers

0.11 1 0.10

0.89 0.87 0.94

p(nl2)

Maximum likelihood

w, w,

0.31 0.60 0.52

p(j 11)

Distance from mean

X

M

100

Tanimoto similarity

Dot product

Z(A,B) bits

Category

Classifier

ANALYTICAL CHEMISTRY, VOL. 49,

NO. 13, NOVEMBER 1977

~ (ij) 1

P(2ln)

1.oo

0.01

0.31

0.22

0.25

0.76 0.84 0.83 0.70 0.81 0.94

member, that prediction was correct. By necessity, for each case p(jl1) = 1 also (all class members were correctly identified). While it is true that all 110 steroids were correctly identified, 2050 of the 2361 non-steroids were also predicted to be steroids. Since the dot product ignores peak absence information, peak presence is accordingly favored. The average steroid spectrum contained 21.3 peaks; the average non-steroid spectrum contained 5.7 peaks. Thus, the fact that so many spectra were predicted to be steroids was less surprising. One method of at least partially compensating for the obvious biasing toward the steroid average spectrum is to normalize each average spectrum so the sum of its 200 probabilities equals the same value as the sum from all other average spectra. Following such a normalization, by dot product all steroids were still correctly identified and now 1156 non-steroids were erroneously predicted to be steroids ( M = 0.17). The improvement in the figure of merit was considerable, but the technique still cannot compete with the other classifiers. Nucleoside Question. A second trend was that performance on the nucleoside question was quite consistently poorer than performance on the other questions by the non-searching techniques. A closer examination of the results indicated that the blame must be placed on the manner in which the question was posed. Using the distance classifier, 52 of the 53 nucleosides were correctly identified, but 29 non-nucleosides were predicted to belong to the nucleoside class. Of these 29 problem compounds, 26 were carbohydrates. Since nucleosides also contain carbohydrate functionality, the reason for the relatively poor performance was more apparent. When the nucleoside question was posed in a somewhat better fashion, performance improved remarkably. Rather than attempting to discriminate nucleosides from all non-nucleosides, a class which contains other carbohydrates, one would have been better advised to do a two-step process. First carbohydrates should be separated from non-carbohydrates, followed by discrimination among those spectra predicted t o be carbohydrates (nucleosides vs. other carbohydrates). Using the distance classifier, the 53 nucleosides and 118 non-nucleosides among the carbohydrate class were completely separated ( M = 1.0). The results for the nucleoside question using the two-step process are presented below. Seven carbohydrates, one of which was a nucleoside, were predicted to be non-carbohydrates by the distance classifier. A total of 180 spectra were predicted to be carbohydrates (164 actually were carbohydrates and 16 were non-carbohydrates). Next, predictions were made on these 180 spectra concerning whether or not they were nucleosides. Fifty-three were predicted to be nucleosides (52 correct and 1 error). As a

Table IV. Figures of Merit as Obtained by the Distance Classifier for Varying Numbers of Features No. of features Carbohydrate Steroid 1

2 4 9 10 32 64 128 200

0.22 0.39 0.53 0.61 0.70 0.79 0.81 0.81 0.84

0.29 0.36 0.42 0.60 0.59 0.78 0.83 0.83 0.83

result, using the two-step process, 52 of 53 nucleosides were correctly identified, the same findings as reported earlier using the one-step nucleoside/non-nucleoside question. However, unlike before when 29 of the non-nucleoside spectra were in error, by the two-step process only 1 error occurred among the non-class spectra. The resulting figure of merit was 0.95, substantially improved over the value of 0.76 from Table 11. Similar two-step questions were asked of steroids. Once an unknown was predicted to be a steroid, an attempt was made to determine whether or not it was an aromatic steroid ( M = 0.92) and/or whether or not it contained a seven carbon tail ( M = 0.54). Certainly it would be unwise to ask these questions using the entire data set. While most aromatic steroids might be identified correctly, many non-aromatic steroids, but compounds which still were steroids, would most likely have been wrongly predicted to be class members. Feature Selection. Lowry and Isenhour (34) proposed a simple feature selection technique for binary data based upon the variance among the average class spectra for each interval. Intervals with a large amount of variance among the classes would be better used for classification purposes than intervals with relatively little variance. Unlike their work where they suggested the one best ordering of one-tenth micrometer intervals for categorizing 13 different chemical functionalities by infrared data, we decided to obtain the best ppm intervals for each question posed. Thus, we selected as best the interval which had the largest difference between the class conditional probabilities fbr a peak appearing in the class and non-class spectra. For example, 103 steroids contained a peak between 35 and 36 ppm (probability = 103/110 = 0.94). Between 3 5 3 6 ppm, only 116 non-steroids contained a peak ( p = 116/2361 = 0.05). Likewise, 36-37 ppm was the second best interval for steroid/non-steroid separation. Similar rankings were determined for the carbohydrate question. It is desirable to reduce the number of features used in order to increase computation speed. Since the weight vectors being employed are only estimates of the true class conditional probabilities, it is generally true that the more spectra included in the training set from which the estimates are made, the more accurate the estimates will be. However, increasing the size of training sets requires additional investments of time to develop the estimates. If the number of intervals tested could be reduced from 200 to 50, the size of the training set could be increased by a factor of 4 (if sufficient spectra were available), thereby increasing the reliability of the estimates at a cost of no additional computation time. Similarly, time required to classify unknowns would be decreased fourfold. The major drawback to reducing the number of features is the possibility of a considerable decrease in discriminating ability. Table IV indicates that the number of features needed to effect class/non-class discrimination with little decrease in performance for the steroid and carbohydrate questions could be reduced by at least two thirds from the initial value of 200.

CONCLUSIONS This investigation has reaffirmed earlier findings with binary infrared data, namely, the conventional nearest neighbor method (using Euclidean or Hamming distance) and the dot product are poor selections as classifiers of binary data (26, 28). For the three questions described in Table 11, Tanimoto similarity, maximum likelihood, and distance from the mean, all performed quite well. I t should be noted that the Tanimoto measure correctly identified 85.5% of the carbohydrates when the unaugmented file of 2229 spectra was used. The percentage increased to 97.7% with the addition of 138 carbohydrates to the reference file. Thus an extensive reference f i e is necessary for the similarity measure approach. In order to make a prediction, nmd operations are required with a similarity measure, where n = number of spectra in the file (2471 for this example) and d = number of features (200). That product, 494200 is considerably larger than the c d ( c = number of classes ( 2 ) ;thus c-d = 400) operations necessary during the classification step for distance and maximum likelihood. I t must be mentioned that similarity measures require no training step while the other two methods do, but except for instances where very few unknowns will be treated, the time invested during the training step is more than compensated for by the difference between c.d and ned. While selection of the best technique is largely dependent upon the particular situation, our investigation has led us to select a suggested approach for chemists confronted with the problem of solving actual structure elucidation problems. If computer time is not excessively limited and a suitable reference f i e exists, a search employing the Tanimoto or some comparable measure might prove valuable. The possibility always exists that one is dealing with an impurity or a byproduct and not an unknown natural product. A search through a suitable file would help allay that possibility and conceivably save the chemist much time and effort. If computer time is a t a premium, satisfactory results should be obtained using a reduced feature set and the distance or maximum likelihood classifier. As demonstrated by the nucleoside/ non-nucleoside question in this study, care must be taken not to ask too specific a question too early in the classification procedure. Far superior results were obtained for the nucleoside problem in a two-step process. The encouraging results obtained from this study, the previous work by Wilkins and co-workers and the work of Sjostrom and Edlund (35), give strong indication that computer-assisted interpretation of I3C-NMR spectra can be a valuable tool in the solution of actual structure elucidation problems.

ACKNOWLEDGMENT The authors are indebted to J. Devens Gust for helpful discussions. We also wish to show our appreciation to Charles L. Wilkins for supplying a copy of the data f i e produced under Contract No. 68-01-334 with the Environmental Protection Agency.

LITERATURE CITED (1) D. B. Nelson. M. E. Munk. K . B. Gash. and D. L. Herakl. Jr.. J. Oro. Chern.. 34, 3800 (1969) (2) H Abe and S Sasaki, So R e p TohokL Imp U n w , Ser 7 , 55, 63 11972) (3) B. D . ~ O X ,P ~ . DThesis, . Department of Chemistry, A ~ ~ Z O MState University. 1973. I

(4) L. A. Gribov, V . A. Dementyev, M. E. Eiyashberg, and E. Z. Yakupov, J . Mol. Struct., 22, 161 (1974). 15) . , R. E. Carhart. D. H. Smith. H. Brown. and C. Dierassi. J . Am. Chern. SOC.,97,5755 (1975). (6) C. A. Shelley, H. B. Woodruff, C. R. Snelling, Jr., and M. E. Munk, Abstracts, Division of Chemical Information, 173rd National Meeting, American Chemical Societv. New Orleans. La.. March 23, 1977:ACS SvmDosium . . Series “Compuier-Assisted Structure Elucidation”, in press. (7) C. A. Shelley and M. E. Munk, “Computer Elaboration of Molecular

ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

2079

(8) (9)

(IO) (11)

(12) . . (13)

(14) (15) (16) (17) (18) (19) (20) (21) (22)

Stiuctwes”, Abstracts, Division of Computers in Chemistry, 172nd National Meeting, American Chemical Society, S a n Francisco, Calif., August 1976. H. 9. Woodruff and M. E. Munk, J . Org. Chem., 42, 1761 (1977). H.9. Woodruff and M. E. Munk, Res.lDev., 28 (E), 34 (1977). H. J. Reich, M. Jautela!, M. T. Messe. F. J. Weigert, and J. D. Roberts, J . A m . Chem. Soc.. 91. 7445 (1969). G. C. Levy and G. L. Nelson, “Carbon-13 Nuclear Magnetic Resonance for Organic Chemists”, Wiley-Interscience, New York, N.Y., 1972. C. L. Wilkins. R. C. Williams. T. R. Brunner. and P.J. McCombie. J . Am. Chem. Soc., 96, 4182 (1974). T. R. Brunner, R. C. Williams, C. L. Wilkins, and P. J. McCombie, Anal. Chem.. . . .. . ., 46. . ., 1798 . . .- 11974) T. R. Brunner, C . L: Wilkhs. R. C. Williams, and P. J. McCombie, Anal. Chem.. 47. 662 (1975). C. L. Wilkins and’T. L.’Isenhour, Anal. Chem., 47, 1849 (1975). H. 9. Woodruff, G. L. Ritter, S. R. Lowry, and T. L. Isenhour, Appl. Spectrosc., 30, 213 (1976). J. 9. Justice and T. L. Isenhour, Anal. Chem., 46, 223 (1974). T. Wittstruck and K. I.H. Williams, J . Org. Chem., 38, 1542 (1973). H. Eggert and C. Djerassi, J . Org. Chem., 38, 3788 (1973). J. W. ApSimon. H. Beierbeck, and J. K. Saunders, Can. J . Chem., 51, 3874 (1973). D.Leibfri!z and J. D. Roberts, J . Am. Chem. SOC.,95, 4996 (1973). N. S.Bhacca, D. D. Giannini, W. S. Jankowski. and M. E. Wolff, J . Am. Chem. Soc., 95, 8421 (1973).

(23) E. Breitmaier and W. Voelter, “l3C NMR Spectroscopy: Methods and Applications”, Verlag Chemie. Weinheim/Bergatr., Germany, 1974. (24) R. 0. Duda and P. E. Hart. “Pattern Classification and Scene Analysis”, Wiley-Interscience, New York, N.Y., 1973. (25) T. M. Cover and P.E. Hart, I€€€ Trans. Info. Theory, 1113, 21 (1967). (26) H. 9. Woodruff, S. R. L o w , G. L. R i e r , and T. L. Isenhour, AM/. Chem., 47, 2027 (1975). (27) D. J. Rogers and T. T. Tanimoto, Science, 132, 1115 (1960). (28) H.9. Woodruff, S. R. Lowry, and T. L. Isenhour, Appl. Spectrosc., 28, 226 (1975). (29) J. Franzen, Chromatographia, 7, 518 (1974). (30) S. R. L o w , H. 9. Woodruff, G. L. R i e r , and T. L. Isenhour, AM/. Chem., 47, 1126 (1975). (3 1) L. J. Sobberg, C. L. Wilkins, S.L. Kaberline, T. F. Lam, and T. R . Brunner, J . Am. Chem. SOC.98, 7139 (1976). (32) H. Rotter and K. Varmuza, Org. Mass Spectrom., 10, 874 (1975). (33) N. A. 9. Gray, Anal. Chem., 46, 2265 (1976). (34) S.R. L o w and T. L. Isenhour, J. Chem. Inf. &mp. Sci., 15, 212 (1975). (35) M. Sjostrom and U. Edlund, J . Magn Reson.. 25, 285 (1977).

RECEIVED for review August 3,1977. Accepted August 3,1977. Financial support by the National Institute of General Medical Sciences (GM21703) is gratefully acknowledged.

lodine-Amine Charge-Transfer Complexes as Spectrophotometric Detectants in High Pressure Liquid Chromatography C. Randall Clark,” Charles M. Darling, Jen-Lee Chan, and Alfred C. Nichols School of Pharmacy, Auburn University, Auburn, Alabama 36830

Iodlne-amine charge-transfer complexes are demonstrated to enhance the UV detectability of N,Ndlmethylbenzylamlne. The complexation reactlon is rapid, reaching equlllbrlum In less than 7 s. The molecular ratio of reactants In the complex Is 1:l 1,:amIne. Maximum charge transfer band Intenshy which reflects the total amount of complex present was observed at a 1 O : l 1odlne:amlne ratio. An HPLC analysls Is described lor N,N-dlmethylbenzylamlne which Includes the direct chromatography of the free amine, the “In line” formatlon of the complex, and the detectlon of the amine in the charge transfer complex form. The use of the complex allows for a 20-fold Increase In peak area over the same concentration of uncomplexed N , N-dlmethylbenzylamlne. This procedure appears to be of general appllcablllty to all types of amines.

A charge-transfer complex can be described as a molecular complex formed by the weak interaction of an electron donor and an electron acceptor. Charge-transfer complexes usually involve simple integral ratios of the components and the enthalpy of formation is usually only a few kcal/mol. The rates of formation and decomposition into the components are so high that the reactions appear to be instantaneous by normal techniques. In most cases, the complex has absorption peaks in its electronic absorption spectrum which are not common to either component. Charge-transfer complexes of organic materials with iodine have long been used as a method of visualization of thin-layer and paper chromatography. Iodine has been described as a sacrificial a-acceptor and amines as increvalent n-donor. Yada et al. ( I ) have shown that iodine-aliphatic amine complexes show two characteristic absorption bands in the 430-410 and 280-230 nm regions. Taha and co-workers (2) 2080

ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

reDorted a considerL,,; increase in the JV bani intensity of many alkaloids via charge transfer complexation with a-acceptors such as iodine. The increase in absorptivity (up to 100 times the value of the uncomplexed alkaloid) was much greater for weak UV absorbers such as the tropine alkaloids, ephedrine, codeine, and spartane. The UV analysis of other pharmaceuticals utilizing iodine charge-transfer complexation has been reported ( 3 ) . The highly efficient separation powers of high pressure liquid chromatography have been demonstrated in many areas of analytical chemistry. The most common method of detection in HPLC is spectrophotometric; therefore, the detection limits of a compound are directly related to its absorptivity. Molecules having high natural absorptivity can be detected in low concentrations by HPLC. However, compounds with low absorptivity values must be derivatized (4) with a strong chromaphoric group in order to achieve good detectability. The use of derivatizing reagents to produce molecules of high absorptivity has been applied to many classes of compounds (5-7). These derivatization procedures generally link chromaphore to substrate through a covalent bond. The formation of covalently linked amine derivatives depends greatly on the degree of nitrogen substitution. Primary and secondary amines serve as substrates for such reactions but tertiary amines are inert to these procedures. Therefore, the use of covalent chromophores cannot be applied to all types of aliphatic amines. Regardless of degree of substitution, all aliphatic amines possess the pair of n electrons on nitrogen and should serve equally well as donors in charge transfer complexes. This paper reports the results of our spectrophotometric studies on amine-iodine charge-transfer complexes and our initial efforts a t the use of tertiary amine-iodine complexes as spectrophotometric detectants in