Computerized learning machines applied to chemical problems

Feb 15, 1971 - gamma-ray spectrum. Because of the high counting rate requirement, this tech- nique is generally most applicable to small relatively in...
1 downloads 0 Views 538KB Size
meterioritic, and lunar materials by neutron activation because of the large Compton interferences by other radionuclides and the complexity of the low energy portion of the gamma-ray spectrum. Because of the high counting rate requirement, this technique is generally most applicable to small relatively intense sources. However, this usually isn’t too restrictive in the neutron activation and fission product applications. As the size and efficiency of the Ge(Li) detectors increase, greater improvements in sensitivities will be obtained and the technique will be applicable to larger samples and higher energy gamma rays.

ACKNOWLEDGMENT

The author wishes to acknowledge the counting and electronic assistance of R. M. Campbell, R. T. Brodazynski, and J. R. Kosorok, sample preparation of J. H. Reeves and D. R. Edwards, and the many helpful comments of L. A. Rancitelli, W. A. Haller, C. E. Jenkins, R. W. Perkins, and C. W. Thomas. RECEIVED for review December 30,1970. Accepted February 15, 1971. This paper is based on work performed under United States Atomic Energy Commission Contract AT(45-1)1830.

Computerized Learning Machines Applied to Chemical Problems Optimization of a Linear Pattern Classifier by the Addition of a “Width” Parameter L. E. Wangen, N. M. Frew, and T. L. Isenhour Department of Chemistry, Unicersity of North Carolina, Chapel Hill, N . C. 27514 The binary linear pattern classifier has been improved by the addition of a “width” parameter which defines a null region in the pattern space. In the case of separable pattern sets, the width is maximized in an attempt to produce the maximum prediction for unknown patterns. For inseparable cases with slightly overlapping distributions, the null region contains the subset of patterns which prevent linear separation, thereby making the remaining patterns linearly separable. The method is illustrated with computer generated 2-dimensional Gaussian data and its utility for classifying spectrometric patterns is demonstrated with examples taken from mass spectrometry.

PREVIOUS WORK has shown the usefulness of Computerized Learning Machines for the interpretation of spectrometric data (1-6). In such an approach, the spectra are represented 1-dimensional pattern space which are to as points in a d be separated by d-dimensional planes into binary subsets according to various chemical and/or structural criteria. The linear pattern classifier is developed from a training set of patterns by an error-forcing iterative process, employing negative feedback, which is designed to maximize learning rate. The sole criterion for complete convergence is correct classification of all patterns in the training set. Upon cessation of training, the resulting linear pattern classifier (denoted by a weight vector W) can be tested according to its ability to correctly classify “unknown” spectra (predictive ability) not contained in the training set. Although the binary linear pattern classifier is easily implemented and has yielded useful results with chemical data, it suffers from certain limitations.

+

(1) P. C. Jurs, B. R. Kowalski, and T. L. Isenhour, ANAL.CHEM., 41, 21 (1969). (2) P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, ibid., p 690. (3) Zbid., p 1949. (4) Zbid., 42, 1387 (1970). (5) B. R. Kowalski, P. C. Jurs, T. L. Isenhour, and C. N. Reilley, ibid., p 1945. (6) L. E. Wangen and T. L. Isenhour, ibid., 42,737 (1970).

(1) Linearly inseparable sets may not be completely classified, In the case of nonconvergence after a maximum number of feedbacks for a pair of categories, it is assumed that the categories are for practical purposes not linearly separable. In these instances the category distributions may be overlapping and, because the convergence criterion requires correct classification of all training patterns, W will oscillate indefinitely until training is arbitrarily discontinued. AS a result of this oscillation, the utility and reproducibility of W for prediction and recognition (as measured by ability to classify training patterns) may be greatly diminished. Hence it would be desirable as a first approach to assume that the distributions are only slightly overlapping. Consequently omission of the samples in the overlapping region would allow complete linear separation of the remaining samples. (2) The successfully trained W is not guaranteed to be an optimum classifier. Due to the relatively simple yes/no convergence criterion, there is no guarantee that the decision surface will optimally separate the two pattern distributions. Consequently, there may be considerable variation in predictive ability for different classifiers developed with the same training set. (3) The linear pattern classifier gives no dependable method of assigning a confidence value to each classification. A confidence figure is especially desirable where binary decisions are to be used in a “branching tree” fashion. In such a system the propagation of a single incorrect decision may have deleterious effects unless the user has some indication that the result is in doubt. Watanabe et al. used “dead zone maximization” to maximize separation of an artificially generated character set (7). The purpose of this work is to demonstrate how the simple addition of such a “width” to the decision surface can minimize or eliminate the above limitations while still retaining (7) S. Watanabe, P. F. Lambert, C. A. Kulikowski, J. A. Buxton,

and R. Walker, “Computer and Information Sciences-11,” J. T. Tou, Ed., Academic Press, New York, N. Y . , 1967, P 107~. ANALYTICAL CHEMISTRY, VOL. 43, NO. 7, JUNE 1971

845

points and the decision plane. The W of the threshold logic unit is developed during training by a negative feedback process whereby a new weight vector W ’ is calc,llated after an incorrect classification of a Y. A method of feedback found successful in much previous work defines W’ such that W’ . Y has the opposite-hence correct-sign of W Y, i.e., 3

W’*Y=(W+CY)*Y=-W*Y where C is to be calculated. Equation 3 gives

c=

-2(W

Y)/(Y

(3)

Y)

(4) If the constraint of unit length is added to W‘ and W, then

\

\

W W

-

Y

-W

Figure 1. Use of the normal vector of a decision surface as a binary pattern classifier many of the computational advantages of the trainable linear pattern classifier used in earlier work. The use of such a technique for improving the performance of computerized learning machines for the interpretation of chemical data will be illustrated.

MATHEMATICAL FORMULATION

+

Any pattern of d elements can be represented as a d 1-component vector (Y) in a d 1-dimensional space. (The d 1st component is the same for all patterns and is added to establish a common coordinate origin for the patterns and the decision surface, i.e., the decision surface passes through the origin.) The decision surface is a d-dimensional plane 1-component weight vector (W) normal represented by a d to it. A 2-dimensional illustration separable by a line through the origin is shown in Figure 1. It is apparent that the interior angle between W and any Y in the positive category is less than a/2 radians whereas the corresponding angle for any pattern in the negative category is greater than n/2 radians. Hence the dot product

+

+

+

> Opatternsin + category W . Y = /Wi jY1 cos e- < 0 patterns in negative category a

=

/W/ / Y /cos 6+

(1)

where 1x1 designates length of X. The threshold logic unit uses the correct classification of all patterns in the training set, as determined by Equation 1, as sufficient condition for complete convergence and cessation of training. Further reference to Figure 1 shows that cos 6 = +a//Y/ where a is the perpendicular distance between Y and the decision surface, taken as positive for the positive category and negative for the negative category. Hence with Equation 1,

w .Y =

*/W/a

(2)

and if W has unit length W . Y = +a. Thus the dot product can be a direct measure of the distance between the pattern 846



where solving for C again gives Equation 4. However now the correction process puts the decision surface the same distance on the correct side of point Y as it previously lay on the incorrect side of Y (Equation 2). Now if some constant delta (A) which may be positive or negative is added to the decision criterion so that

\

W Y

*

ANALYTICAL CHEMISTRY, VOL. 43, NO. 7, JUNE 1971

e

Y

> A for patterns in positive category

. Y < A for patterns in negative category

the effect is to specify a region of width 2 / A / within which, during training, patterns are incorrectly classified if A is positive or patterns are correctly classified if A is negative. That is, this convergence criterion requires that all patterns lie at least a distance A on the correct side of the decision surface for positive A, whereas if A is negative it doesn’t classify patterns that actually lie on the wrong side within a distance /AI of the decision surface. Delta is made as large as is consistent with efficient training. The addition of A necessitates the solution of different equations for C . First recall that the dot product is algebraically defined by

where the subscripts denote vector components. It is observed that multiplication of all components of Y by minus one gives

w . (-Y)

=

- WlYl - W2YZ -

. . . - w d f l yd+l

-w

=



y

Thus for mathematical convenience all training patterns in the negative category are multiplied by minus one so the criterion for correct classification becomes the same for all training patterns, i.e., W . Y > A. If during training W . Y < A for some Y, W‘ is defined in analogy to the threshold logic unit, such that W‘ Y = 2A - W . Y. Hence, after adjustment of the weight vector, the dot product is as much greater than A as it previously was less than A. The calculation for C now involves solution of the following

which gives a quadratic in C of the form ac2

+ pc + y = 0

(8)

where CY =

fi

=

+

Y . Y[4A2 - 4A W * Y (W * Y)’ - Y * YI 2W . Y [(W * Y)’ - 4A W * Y 4 A 2 - Y Yl

y = 4 A 2 - 4A W . Y

+

9

(9)

u -0.30

090.20

31

null re g io n (width= 21All

‘\)2 IAI O

Figure 2. Decision surfaces resulting from deltas of 0.0 ( A ) , 0.10 (B), and 0.16 (0 This results from the necessity of squaring Equation 7 to remove the 1 W C Y . As squaring generates a second solution for C which does not satisfy Equation 7, it is necessary to use the solution of Equation 8 resulting in a positive value of C. It is noted that if A = 0, the equations reduce to those of the threshold logic unit which therefore can be treated as a special case of this more general treatment.

+

DATA AND COMPUTATIONS A flow diagram of the basic learning routine, SUBROUTINE LEARN, is given in a previous paper (6). The only necessary modification is normalization of W after application of the feedback correction. The 2-dimensional Gaussian data used in illustrating the decision region width were obtained with a Gaussian random number generator. So that linear separability could be ensured, no points greater than two standard deviations from the mean were allowed. The mass spectra pattern used are those 630 API spectra described previously ( I ) . Computer programs were run on the IBM 360/75 at the Triangle Universities Computation center and on the University of North Carolina, Chemistry Department’s Raytheon 706 computer.

DECISION SURFACE WIDTH Geometric illustration of the concept of a width for the decision surface should make the advantages clear. To this end, the set of 200, 2-dimensional points with a Gaussian distribution are used. These consist of 100 points with a mean of (1,l) denoted as the negative category and 100 points with a mean of (2,2) making up the positive category. These categories can be made to range from linearly separable to inseparable depending on the standard deviation of the points. One hundred of these points, consisting of 47 and 53 patterns from the negative and positive categories, respectively, were randomly chosen as a training set. Figure 2 shows a linearly separable case generated with a standard

I

2

3

4

Figure 3. Null region resulting from maximizing delta for a separable case deviation of 0.20. The three lines labeled A , B, and C represent one trio of an infinite number of possible decision planes and were developed by increasing A during training. (Any of the three lines correctly separate the categories.) Decision line A resulted from a delta of zero and hence represents the threshold logic units’ criterion. Line B resulted from a requirement that all points must lie at least 0.10 unit (A = 0.10) from the decision surface, and line C represents the maximum delta obtained (A = 0.16). It is apparent that line C most evenly separates the categories and, hence, should provide the best classification of unknowns assuming the training set is a good representation of the Universal set. Figure 3 shows the same training set with a standard deviation of 0.30 unit. The line drawn separating the categories resulted from maximizing delta to a value of 0.038. The area enclosed by the dashed lines indicates the decision surface width. Note that the sets remain linearly separable. In Figure 4, u = 0.40 resulting in overlapping distributions. This is a case where the threshold logic unit could not converge, thereby often giving unsatisfactory prediction. However, the patterns in the overlapping region can be excluded from classification if A is allowed to be negative. Again the dashed lines represent a decision surface width; however, in this case the width constitutes a nodecision region, that is, patterns located in this region are not classified as wrong patterns during training hence allowing complete convergence on all patterns outside this no-decision region. In practice, the no-decision region is minimizedor equivalently delta is maximjzed-by an iterative process. In the example of Figure 4, the maximum delta obtained was -0.082 unit and prediction on those patterns located outside the decision region (95 out of the 100 patterns in the prediction set were so located) was 97.97& whereas prediction was 6 8 . 0 z for the same problem with the threshold logic unit upon arbitrarily ceasing training after 500 feedbacks. It should be noted that a prediction of 68 is due to the oscillatory nature of the feedback process for inseparable categories. It is just as likely that by chance the last feedback could have ANALYTICAL CHEMISTRY, VOL. 43, NO. 7, JUNE 1971

847

Table 11. Prediction as a Function of Delta Oxygen and/or Oxygen nitrogen presence presence Delta Prediction Delta Prediction

u 10.40

0.0 4.0 5.0 6.0 6.125 6.25

\, no-decision region (width.2 IAI) \

'Y

\

\ -

++ + + ++ + \ + + +

\

\

++

++

\

\

+

\ +

\

\

-

\

f+ +++

++p

yMt

\ +

I

t+

+++

*\

\

0

t

t +

2

\

3

4

Figure 4. No-decision region resulting from maximizing delta for an inseparable case

Table I. Comparison of Prediction Results for a Delta of Zero to Those for a Positive Delta Prediction ( % correct) Positive Delta Negative Positive category max. category category Total Oxygen and/or 0.0 77.2 91.8 86.4 nitrogen 6.0 83.0 93.8 89.7 presence Oxygen 0.0 93.3 80.7 89.7 presence 4.7 95.6 87.2 93.9 C/H ratio of 0.0 82.1 77.9 80.6

>6 Carbon atoms

0.27 0.0 1.3

82.1 95.7 95.1

79.7 90.0 91.5

81.2 92.4 93.0

placed the decision surface in a much better position, thereby giving a relatively good predictive value. DISCUSSION OF RESULTS WITH MASS SPECTRA

As indicated, the addition of positive delta should increase the predictive ability of W for linearly separable cases by forcing a more optimum decision surface. The results of training with a positive delta on some questions known to be linearly separable are shown in Table 1. Prediction after maximizing A is compared with predictive ability obtained with a threshold logic unit. In each case a training set of 300 patterns was randomly chosen with the remaining 330 spectra as a prediction set. Patterns in the prediction set were classified solely according to the sign of the dot product so that any differences in prediction are attributable to different orientations of the decision surface. The positive categories are listed in the table, while the negative categories consisted of all patterns not satisfying the indicated criterion. In all cases delta was forced close to a maximum; however, it was found that if any improvement occurred in prediction, 848

0

0.0 2.0

3.0 4.0 4.5 4.7

ANALYTICAL CHEMISTRY, VOL. 43, NO. 7, JUNE 1971

89.7 91.5 93.6 93.6 93.9 93.9

Table 111. Confidence; Number Correct during Prediction as a Function of Distance from the Decision Surface Oxygen and/or Oxygen presence nitrogen presence No. correct/ Correct, No. correct/ Correct, ~W.YI total total > delta 2781285 98 2651276 96 (0.8to 1.O)A 14/17 82 14/17 82 (0.6to 0.8)A 416 67 6/10 60 (0.4to 0.6)A 516 83 811 1 73 (0.2to 0.4)A 317 43 6/13 55 67 619 67 213 (0.0 to 0.2)A Total < delta 32/45 71 36/54 67 Overall 3101330 94 3011330 90.0

z

z

\ \

86.4 89.1 90.0 89.7 89.1 89.1

Table IV. Threshold Logic Units Prediction Compared to that of the Negative Delta on Linearly Inseparable Categories Prediction, Correct) C,Hzn+2 Question C,Hzn Question Negative TLU Negative TLU A (A = 0) A (A = 0) Negative category 91.2 83.2 93.4 72.7 Positive category 72.4 55.3 92.2 84.5 Total 88.0 76.8 92.8 77.1 Number of spectra not classified 55 0 121 0

(z

z

it was almost wholly realized by 95 of Amax so that continued iteration was unnecessary. That is, trying to force A beyond 9 5 z of the maximum possible value is extremely difficult, and gives little or no additional improvement in predictive ability. Table I1 shows the change in prediction with A for the cases which gave significant improvement. It is noted in Table I that overall prediction improves in all cases tried, although not as much in some cases as in others. A further advantage of the delta lies in the significance of the magnitude of the dot product-Le., it is the distance of the unknown pattern from the decision plane. This provides a measure of confidence in the classification of a pattern according to this distance. Table I11 records illustrative results of number correct as a function of the magnitude of Amax. Although there is considerable noise due to the small numbers, the trend is apparent. Confidence is enhanced greatly for patterns that lie further than delta from the decision surface and gets correspondingly worse as the magnitude of the dot product decreases. Note the difference in prediction inside the decision width with that outside; also the predictive values for the latter are to be compared to that obtained with the threshold logic unit (see Table I). The third and perhaps most important benefit of the decision surface width results upon application of a negative delta to linearly inseparable categories. The way in which this is accomplished has already been explained (Figure 4).

To illustrate the utility with the mass spectra, categories that were known to be linearly separable when all mass peaks greater than 0.5 were used were made inseparable by retention of only the 6 largest peaks in each spectrum. That is, the pattern set consisted of 630 spectra, each containing only the six most intense peaks in its mass spectrum. Surprisingly, most of the questions tried using only the 6 largest peaks were linearly separable and the two examples listed in Table IV were the only ones by which the use of a negative delta was required. However these adequately

z

prove the superiority of the delta approach when compared with the threshold logic unit. In each case prediction is considerably better with the use of a no-decision region. The results for a C/H ratio of 1/2 are particularly striking, especially the difference between negative categories (93.4 % with delta as opposed to 72.7% without delta). RECEIVED for review October 5,1970. Accepted February 22, 1971. Research supported by the National Science Foundation.

Systematic Studies on the Breakdown of p,p’-DDT in Tobacco Smokes II. Isolation and Identification of Degradation Products from the Pyrolysis of p,p’-DDT in a Nitrogen Atmosphere N. M. Chopral and Neil B. Osborne Department of Chemistry, North Carolina Agricultural and Technical State Unioersity, Greensboro, N . C . 27411.

p,p‘-DDT was pyrolyzed at 900 O C in a nitrogen atmosphere and the pyrolysis products were collected in pentane at -80 O C and isolated by fractional distillation and chromatography on alumina and Florisil columns. The products isolated were: p,p’-DDT, p,p’-DDE, p,p‘-TDE, bis-(p-chlorophenyl)chloromethane, bis-(p-. chIor o phen y I)m et ha ne, p,p‘-d ichIo rob i phe ny I, a,pdichlorotoluene, hexachloroethane, chlorobenzene, tetrachloroethylene, trichloroethylene, carbon tetrachloride, chloroform, and dichloromethane. The solid (first 8) pyrolysis products were identified by gas chromatography and I R spectrometry, and the liquid (the last 6) pyrolysis products were identified by gas chromatography and colorimetric tests. Besides these pyrolysis products p,p‘-DDM, and cis- and frcms-p,p’dichlorostilbenes were also detected in the pyrolyzate. As they were present in small quantities they were not isolated but were identified by gas chromatography.

IN THE FALL of 1967 we embarked upon a systematic investigation into the breakdown ofp,p‘-DDT in cigarette mainstream and side-stream amokes. This is the first study of its kind and is divided into three phases-the study of the breakdown of p,p‘-DDT in a nitrogen atmosphere, in p,p’-DDT treated tobacco smokes, and in the cigarette main-stream and sidestream smokes. So far we have written two papers on the results and mechanisms of the breakdown of p,p’-DDT in a nitrogen atmosphere, and inp,p’-DDT treated tobacco smokes (192). In the first phase our objective was to study, from the breakdown of p,p’-DDT in an inert atmosphere such as nitrogen, the mechanisms of the breakdown of p,p‘-DDT at 900 “CAuthor to whom correspondence should be addressed.

the temperature of the on-puff burning zone of cigarette (3). In this study the temperature was the only parameter which corresponded to the smoking condition of a cigarette. Our choice of an inert atmosphere turned out to be fortunate since we later on found ( 2 ) that the atmosphere where the pyrolysis of pesticide takes place is inert. The results of our work on the first phase are described in our first paper ( 1 ) in Section 1. In the present paper we are reporting the methods employed in the pyrolysis of p , ~ ’ - D D T , and the isolation and identification of its pyrolysis products. The products reported are : p,p‘-DDT [2,2-di-(p-chlorophenyl)-1,l ,l-trichloroethane] and compounds with smaller molecular weight such as ; p,p‘-DDE [2,2-di-(p-chlorophenyl)1,l -dichloroethylene], p,p’-TDE [2,2-di-(p-chlorophenyl)-l ,I dichloroethane], p,p‘-DDM or p,p’-TDEE [2,2-di-(p-chlorophenyl)-1-chloroethylene], cis- and trans-4,4‘-dichlorostilbenes, bis-(p-chlorophenyl)methane, bis-(p-chloropheny1)chloromethane, a,p-dichlorotoluene (p-chlorobenzyl chloride), hexachloroethane, tetrachloroethylene, trichloroethylene, carbon tetrachloride, chloroform, and dichloromethane. EXPERIMENTAL

Materials. All solvents used were of “pure” grade, and were distilled before use. (“Pure” and “Puriss” grades refer to the quality of the reagent as mentioned on the reagent bottle. “Puriss” grade was 99.9%+ pure.) p , ~ ’ - D D T used for pyrolysis was of “Puriss” grade, and was purchased from Aldrich Chemical Co. ALUMINA AND FLORISIL. Alcoa chromatographic alumina F-20 was purchased from Aluminum Company of America, and Florisil (mesh, 100-200) was purchased from Fisher. These were activated as described in the text. REFERENCECOMPOUNDS.p,p’-DDT (99.9 pure) and p,p’-DDE (99.8z pure) were obtained from Geigy

+

(1) N. M. Chopra, J. J. Domanski, and N. B. Osborne, Beitr. Tabakforsch., 5, 167 (1970). ( 2 ) N. M. Chopra, Proc. Second Intern. Congr. Pesticide Chem., Tel Aviv, Israel, 1971, in press.

(3) G. P. Touey and R. C. Mumpower, Tobacco Sci., 1, 33 (1957). ANALYTICAL CHEMISTRY, VOL. 43, NO. 7, J U N E 1971

849