Probability discriminant functions for classifying binary infrared

Hugh B. Woodruff , Charles R. Snelling , Craig A. Shelley , and Morton E. Munk. Analytical ... H. B. Woodruff , G. L. Ritter , S. R. Lowry , T. L. Ise...
0 downloads 0 Views 335KB Size
Table 1. Specific Conductance of Liquid N2226B2226 T ,“C

25.00 35.00 45.06 55.00 65 .OO 75.06

lo4 y , ohm-!cm-l

I

5.120 8.665 13.69 20.43 29.13 39.81

/

I

/

M

1.2

cene in aprotic solvents) (8) appears to be completely reversible because, after correction for background current, the cathodic and anodic peak currents are equal within experimental error. Observed peak potentials, however, a t scan rates of 50-800 mV/sec varied from -1.99 to -2.03 V, and cathodic and anodic peak separations were about 0.10 V, because of uncompensated resistance in the solution. Extrapolated to zero scan rate, the anthracene E,, = -1.95 V. By the cathodic and anodic peak current criterion, the first reduction step of benzophenone is only partly reversible, perhaps because a trace of water in the N2226B2226 reacts with the benzophenone radical anion. Extrapolated to zero scan rate, the benzophenone E,, = -1.71 V. The irreversible second-reduction step appears a t about -2.1 V. Cyclic voltammograms of anthracene and benzophenone in N2226B2226 at 51’ appear qualitatively the same as at 22’ except that peak currents are 1.7 to 2.0 times larger. The useful working range of N222&226 is about -0.5 to -2.5 V. At more anodic potentials, oxidation of the B2226 ion proceeds readily (9) and, at more cathodic potentials, the background current is unacceptably high. Its residual current appears qualitatively to be smaller than that of tetra-n- hexylammonium benzoate hemihydrate (5), probably because N2226B2226 contains 10.1 wt % water. Two features of N2226B2226 should make it attractive for further electrochemical investigation. It is a single component conducting fluid a t room temperature, and its high viscosity makes diffusion processes much slower than in common

/

I6

I

24

20

- E (v) Flgure 1. Cyclic voltammograms in N222&226

at 22’

( a ) 2.7rnM anthracene, 100 rnVlsec. (b) 2.6rnM benzophenone, 200 rnV/ sec. E is referenced to aqueous Ag/AgCI. saturated KCI

aprotic solvents. Moreover, its viscosity and conductance can be varied markedly by relatively small temperature changes.

ACKNOWLEDGMENT I thank L. Faulkner and S. G. Smith for helpful discussions and use of their instruments.

LITERATURE CITED (1) W. T. Ford, R. J. Hauri, and D. J. Hart, J. Org. Chem., 38, 3916 (1973). (2) W. T. Ford and R. J. Hauri, J. Am. Chem. SOC.,95, 7381 (1973). (3) W. T. Ford, R. J. Hauri. and S. G. Smith, J. Am. Chem. SOC.,96, 4316 (1974). (4) W. 1.Ford and D. J. Hart, J. Am. Chem. SOC.,98, 3261 (1974). (5) C. G. Swain, A. Ohno, D . K. Roe, R. Brown, and T. Maugh 11, J. Am. Chern. SOC.,89, 2648 (1967). (6) J. F. Coetzee and G. P. Cunningham, J. Am. Chem. SOC., 88, 3403 (1964). (7) W. T. Ford, unpublishedresults, 1974. ( 8 ) C. K. Mann and K. K. Barnes, “Electrochemical Reactions in Nonaqueous Solvents”. Marcel Dekker, Inc.. New York. NY. 1970, p 59. (9) K. Ziegler and 0.-W. Steudel. Justus Liebigs’ Ann. Chem., 652, 1 (1962).

RECEIVEDfor review January 7, 1975. Accepted March 3, 1975. This research was supported by National Science Foundation grant GP 38493.

Probability Discriminant Functions for Classifying Binary Infrared Spectral Data S. R. Lowry, H. B. Woodruff, G. L. Ritter, and T. L. lsenhour Department of Chemistry, University of North Carolina, Chapel Hill, NC 275 14

In computer classification of chemical data, each compound is represented by a feature vector whose components correspond to physical measurements made on the compound (melting point, % light transmittance, mass spectrum, etc.). Although the actual measuring process may produce a continuum of values (analog output), for the data to be computer compatible, it must be digitized into a discrete form. Depending on the resolving capabilities of the measuring device and the resolution required by the classification algorithm, the digitized data may either approximate the continuous state or, in the opposite extreme, may be limited to the binary state. A number of pattern recognition investigations have dealt with “continuous” data (1). Less work, however, has involved chemical pat1126

ANALYTICAL CHEMISTRY, VOL. 47, NO. 7, JUNE 1975

tern recognition of binary data ( 2 , 3 ) .This work reports the use of probability methods for classifying compounds into similar structural groups on the basis of their binary infrared spectra ( 4 ) . Two main reasons for investigating infrared spectral data are: 1) access to the ASTM file of over 90,000 infrared spectra stored in binary form (made accessible a t the Triangle Universities Computation Center by the R. J. Reynolds Tobacco Company); and 2) widespread use throughout industry. To represent the infrared spectrum of a compound by a binary vector, each element of the vector must correspond to some interval of the spectral region of interest. For the present work, all intervals are 0.1 pm. If a peak maximum is present in a 0.1-pm interval, the corresponding element

Table I. Percentage of Compounds Correctly Identified p,, peak absence

R,, peak presence Functional group

1 Carboxylic acid 2 Ester

3 Ketone 4 Alcohol 5 Aldehyde 6 1"Amine 7 2"Amine 8 3"Amine 9 Amide 10 Urea and derivatives 11 Ether and acetal 12 Nitro and nitroso 13 Nitrile and isontrile Average

(+ )

(3

86 88 82 90 76 90 97 95 96 66 97 87 88 87

91 91 79 90 91 85 46 58 54 95 57 89 98 79

AV

i-)

(-)

AY

(- )

(-)

Av

88 90 80 90 84 88 72 76 75 80

88 98 76 84 97 94 0

82 45 65 67 25 70

85 72 70 76 61 82 50 51 51 50 50 80 50 64

87 92 84 89 86 92 86 76 81 86 85 86 93 86

92 89 80 89 86 85 79 86 80 85 83 91 94 86

90 90 82 89 86 88 82 81 80 86 84 88 94 86

77 88 93 83

in the binary array is set to one; otherwise, it is set to zero. For the infrared region 2.0 to 15.9 Mm, this produces a vector with 139 elements. By using a training set of binary spectra whose correct classification is known, the class conditional probabilities for each of the 139 dimensions (features) can be approximated ( 5 ) .A class conditional probability, P(x, = IIcJ), is the probability that a peak maximum appears in interval i ( x , = I ) , given that the compound belongs to class j (c,). If one has a set of spectra that are known to belong to class 1, the average spectrum for the class approximates the 139 probabilities. mj

N x , = 11 c,) = C x i / m j

(1)

n=l

where x , = 1 , O and m, = number of spectra in class J For the two-class problem (presence or absence of a functional group), each dimension has two conditional probabilities; p,1 and p12 where pL1and p12 are simple notation for P ( x , = l ( c 1 ) and P ( x , = 1Icq). For the case of classification based on a single dimension, if the unknown spectrum contains a one in dimension k, it will be assigned to the class for which pk] is larger (assuming that the two classes are equally likely to occur). A total spectral probability can be determined by calculating the joint probability that all the peaks in an unknown spectrum occur, given that the compound belongs to a certain class (assuming statistical independence among the peaks) (6). This joint probability can be written for each class j

R, =

P i j'Pnj'Plj.

-

Pmj

(2 1

where x , , X k , x l , x , are the elements of the vector equal to one. Another way of writing R, is 139

R, =

rIPij"' i=l

G,, combination

100 99 100 0 100 89 0 65

3 1

100 0 70 100 62

P h i = o / c , ) = 1 - pij

n

139

Q, =

i=l

(1

(5)

- pij)l-x'

where x , = 1 , O . By using both the probability of a peak appearing and the probability of a peak not appearing, a better classification should result. For the single dimension this will be p,,Xi( 1 - p , , ) l - x b . The corresponding spectral probability will be Tj = R j Q j

or 139

Tj =

- p i j )'-xi

npilxi(l i.1

(6)

By taking the log of T,, a new discriminant function is obtained which is linear. This transformation greatly decreases computational times. Gj

= log T j 139

- log

c

npij"'(l-

139

-

P i j P i

1=1

[Xi

log

Pi, + lo& -

i.1 139

= c x i log i=l

Pij)

- xi

139

(1

- Pij)

izl

Thus, an unknown compound will be assignel to the class corresponding to the largest value of G. This calculation is simply the dot product of a binary vector with a series of weight vectors.

(3)

where x L = 1 , O . An unknown spectrum would then be assigned to the class that gives the largest R,. Often the absence rather than the presence of a peak gives more information concerning structure elucidation ( 7 ) . To study this possibility, a second set of class conditional probabilities are calculated; P ( x , = dc,). These are the probabilities that a peak maximum does not occur in a particular interval, given that the spectrum belongs to class j In the binary case, this probability is simply

(4)

The equivalent joint probability for an unknown spectrum is then

is9

Gj = c x i * w i j + wOj IZ1

where x i =

and WOj

=

l,o;

c l o g 0 - Pij).

(9)

In order to study the predictive abilities of each of the ANALYTICAL CHEMISTRY, VOL. 47, NO. 7, JUNE 1975

1127

three spectral probabilities, sets of 200 spectra are randomly selected from the ASTM file. Each of the training sets corresponds to one of thirteen functional groups (see Table I) and the sets are mutually exclusive (Le., a chosen compound will not be both a ketone and an amine). The presence or absence of each of the thirteen functional groups is determined for all of the 2600 compounds using the three different probability discriminant functions; peak appearance ( R , ) ,peak absence (Q,), and combination (G,). Table I gives the percentage of compounds correctly identified by each method. The columns labeled (+) correspond to percentages of the 200 compounds that do contain the functional group and which are correctly classified (e.g., 86% of the compounds containing acids are correctly identified on the basis of the peak appearance joint probability). The columns labeled (-) correspond to the percentages of the 2400 compounds that do not contain the group under investigation (82% of the non-acid compounds are correctly identified on the basis of the peak absence joint probability). The columns labeled (Av) are simply the average of the (+) and (-) columns. This value is indicative of total predictive ability. The bottom row of the table gives the average percentage for each column. It is pleasing to see that even though the peak absence joint probability (Q,) does very poorly by itself, when combined with peak presence information (I?,), it improves the discriminating ability of the joint probability function. I t is interesting to note that those functional groups which had very low prediction for the positive class using Q, (7, 8, 9, 11) correspond to those spectra containing the most peaks, while the functional groups with very good positive class prediction (10, 13) correspond to spectra containing the smallest average number of peaks. Two very fundamental assumptions are made in developing the joint probability discriminant functions. The first concerns the class sizes. By assuming equal class sizes,

one ignores the a priori class probabilities and simply calculates the maximum likelihood estimate. This assumption compensates for the unequal class sizes (200 and 2400 compounds). One could correctly predict over 92% of the compounds by always saying the functional group is not present. Although the results in column G, are lower than 92%, they are more satisfying in the sense that they more equally predict both absence and presence of a functional group. A discriminant function that always says no is seldom useful even if it is frequently correct. The second assumption involves the statistical independence of the different dimensions of the vector. While this assumption is obviously not true (for example in acids the C-0 stretch and the 0-H stretch are strongly correlated), the vast number of higher order terms that must otherwise be included require that it be made. Because of these two assumptions, the present results are not entirely satisfactory and investigations are continuing to develop techniques that do not require them. However, these results demonstrate that probability theory is a powerful aid in studying chemical classification of binary data.

LITERATURE CITED T. L. Isenhour, B. R. Kowalski, and P. C. Jurs, Crif.Rev. Anal. Chern.. 4, 1 (1974). D. S. Erley, Anal. Chern., 40, 894 (1968). S. L. Grotch, Anal. Chern., 46, 526 (1974). H. B. Woodruff, S. R. Lowry, and T.L. Isenhour, Anal. Chern., 46, 2150 (1974). R. 0. Duda and P. E. Hart, "Pattern Classification and Scene Analysis", Wiley-Interscience, New York, NY (1973). J. Franzen. Chrornatographia, 7, 518 (1974). D.N. Kendall, "Applied Infrared Spectroscopy", Reinhold Publishing Corporation, New York, NY, 1966, p 172.

RECEIVEDfor review November 25, 1974. Accepted January 31, 1975. The financial support of the National Science Foundation is gratefully acknowledged.

Circular Polymethylene Wedge for Determining the MethylGroup Content and Density of Polyethylene by Infrared Spectrometry A. Glenn Nerheim Research and Development Department, Standard Oil Company (Indiana), Naperville, IL 60540

According to ASTM ( I ) , methyl groups in polyethylene are determined by measuring compensated absorbance a t 1378 cm-'. Compensation is achieved with a wedge of known methyl content in the reference beam and a polyethylene film in the sample beam ( 2 ) .The wedge is adjusted to minimize methylene interference so that an accurate measure of methyl absorption at 1378 cm-' is achieved. The density of the sample, needed to calculate the methyl concentration, is determined separately. We have found, however, that a separate density determination may be eliminated by modifying the procedure to make the density determination an integral part of the compensation procedure. Since the interference is assigned to amorphous methylene, a t complete compensation the 1128

ANALYTICAL CHEMISTRY, VOL. 47, NO. 7, JUNE 1975

amorphous contents in the sample and reference beams must be equal. Then, because amorphous and crystalline fractions have different densities, the density of the entire sample is derivable from the amorphous content. Therefore, we modified the methylene compensation procedure to accurately measure the wedge thickness, which is directly related to the amorphous content of the sample. T o simplify the determinations, we use a circular calibrated wedge that is mounted over the aperture of the spectrometer. While satisfying the requirements of ASTM ( 2 ), our procedure includes the density determination. Wedge Construction and Calibration. Figure 1 shows the wedge assembly, including an aperture which aligns with that of the spectrometer. The wedge is attached as a