Generation of mass spectra using pattern recognition techniques

Apr 1, 1975 - ... and P. C. Jurs. Department of Chemistry, The PennsylvaniaState University, University Park, Pa. .... III computer system and The Pen...
0 downloads 0 Views 1MB Size
RECEIVEDfor review December 6, 1974. Accepted April 1, 1975. Certain commercial equipment, instruments, or materials are identified in this paper in order to adequately specify the experimental procedure. In no case does such

identification imply recommendation or endorsement by the National Bureau of Standards, nor does it imply that the material or equipment identified is necessarily the best available for the purpose.

Generation of Mass Spectra Using Pattern Recognition Techniques G. S. Zander'

and P. C. Jurs

Depdriment of Chemistry, The Pennsylvania State University, University Park, Pa. 1680 1

The pattern recognition technique utilizing arrays of adaptive binary pattern classifiers has been applied to the generation of mass spectra of organic compounds. A set of compounds was coded in connection table form. Then a series of automated descriptor development programs generated numerical descriptors of these types: (a) fragment descriptors, (b) substructure descriptors, (c) topological descriptors, and (d) geometrical descriptors. A total of 110 descriptors were generated for each of 600 compounds in the range C ~ - I O H ~ - ~ ~ O O -The ~NO data - ~ .were autoscaled and a series of adaptive binary pattern classifiers was trained to predict mass spectral peak presence or absence in a number of m/e positions. Predicted and actual partial mass spectra are compared, and correlations between the required descriptors and the fragments responsible for a particular mass spectral peak are discussed.

A number of different approaches have been reported for the prediction of mass spectra by calculation. Quasiequilibrium theory (QET) can be used to provide information on the relative abundance of ions with respect to energy. Calculation of entire mass spectra is possible for simple molecules using QET. A different approach implemented as a subprogram of DENDRAL named PREDICTOR calculates mass spectra of possible compounds using an empirically developed theory of fragmentation pathways ( I 1. Small compounds of the form CnH2,+,X could be handled by the original program where X represents 0, N, or S and u is the valence of X. More recent applications of DENDRAL ( 2 ) have avoided the use of PREDICTOR by using special classes of organic compounds. Another heuristic program, ION GENERATOR (IG) ( 3 ) generates plausible primary ions from an input acyclic structure representation using electron bookkeeping methods called primitive operations. This method has also been extended to cyclic structures (4).

Pattern recognition techniques have been applied to a wide variety of chemical problems (5-8) including generation of mass spectra of organic molecules from their molecular structures (9-11). This paper presents results of further studies of mass spectrum generation in which better descriptors of molecular structure were generated, and a new method for ranking the descriptors by usefulness was developed. Present address, Research Laboratories, Rohm and Haas Co., Spring House, Pa. 19477. 1562

ANALYTICAL CHEMISTRY, VOL. 47, NO. 9, AUGUST 1975

ADAPTIVE BINARY PATTERN CLASSIFIERS Excellent discussions of the theory and methodology of pattern recognition can be found in the literature (e.g., 12-14). Applications of pattern recognition to chemistry have been reviewed (5-8). The purpose of pattern recognition as applied to the problem of mass spectrum generation is to place objects (here organic molecules) into classes defined by the appearance of peaks of sufficient intensity in specified mle positions in their mass spectra. In order to perform such classifications, the structures of the molecules must be coded in computer compatible form. The first element in any pattern recognition system is the transducer, which performs the coding operation. In order for the coded descriptions of the molecular structures to be compatible with the remainder of the pattern recognition system, each molecule must be represented as a vector:

xi =

(XI, X?,

. . . , Xn)

Each entry, x j , is a number expressing a property of the structure being coded. For example, x 1 could be the molecular weight in a.m.u., x 2 could be the number of carbons in the molecule, x 3 could be one or zero depending on whether the molecule contained a carbonyl group or not, and so on. A set of m molecules can be represented by a set of such vectors, Xi, i = 1,2. . . ,m. Each vector representing a molecule can be viewed as a vector pointing from the origin to a point in an n-dimensional Euclidean space. A common origin for all m vectors is ensured by giving the n t h component of each vector, x,, an identical value. Once a common origin is ensured, then the molecules can be considered to be coded as n-dimensional points. Pattern classification applied to the present problem consists of developing and using decision surfaces in the n dimensional space that contains points representing the molecular structures. In the present work, only the simplest, planar decision surfaces are used. An orientation of an n-dimensional decision surface can be represented by an n-dimensional weight vector, W = ( W ~ , L U. ~. ,. . w,), that is perpendicular to the decision surface and whose components, wI, stand in one to one correspondence with the components of the patterns to be classified. A particular pattern point, Xi,is determined to be on one side of the decision surface or the other by forming the dot product between the pattern vector and the weight vector representation of the decision surface:

s =

we x i =

I(

+

W2X2

... +

WnXn

If s > 0, then XI is on the positive side of the decision surface, and if s < 0, then XI is on the negative side. A pattern classification problem is attacked by labeling all the points in the data set being used as to class (a “yes” class and a “no” class) and then attempting to find a planar decision surface that separates the two classes. Useful decision surfaces are found by a procedure known as error correction training. A set of labeled points, termed the training set, is presented to a binary pattern classifier (BPC) that has been arbitrarily initialized. Throughout the work reported here, the weight vectors were initialized as follows: I / ’I. =

0 , i = 1.2,

. . .,

(12 -

1) and

11,

= 1

The BPC classifies the points one a t a time and corrects itself by moving its decision surface whenever an error in classification is committed. The BPC cycles through the training set repeatedly until it can classify correctly all the members of the training set or until training is artificially terminated. If the BPC locates its decision surface so that the points comprising the two classes are separated, then the classes are said to be linearly separable. Once this has occurred the BPC can be tested with unknowns not used during training to derive a measure of performance termed predictive ability. After successful training of a set of BPCs, they can be used for a posteriori feature selection-investigation to find which of the descriptors are useful and which are not useful for separating the classes. This type of a posteriori feature selection is necessary because it is not possible to determine before training of the BPC which descriptors will be the most useful for the separation being attempted. T h e variance feature selection method used in this work has been discussed in detail elsewhere (15) and is only summarized here. A large number of weight vectors are generated while incrementing the value of x,, before the training of each new weight vector. The relative variations (similar t o relative standard deviations) are calculated for each descriptor and are placed in a list ranked from highest to lowest. Then, the components having the highest relative variation are removed successively until linear separability no longer exists. I t is sometimes necessary to repeat this process because of the presence of more than one linear combination of descriptors which are sufficient in themselves to separate the classes, or because of the lack of a sufficient quantity of weight vectors with which to measure the relative variations accurately. Thus, this method of feature selection can be used to find different linear combinations in addition to finding the most important of the necessary components; this can be especially useful for correlation studies. All programs used in this study were written in FORTRAN IV and were executed on T h e Pennsylvania State University Department of Chemistry MODCOMP II/MAX I11 computer system and The Pennsylvania State Universit y Computation Center IBM 370/168 computer system.

DATA SET The small organic compounds with associated mass spectra used in this study were obtained from a tape containing 2261 spectra from the American Petroleum Institute Research Project 44. A subset of 600 molecules representing a wide variety of compound classes was selected in the range C3-10H2-2200-4N0-2. Each original spectrum is in the form of digitized intensities in the range of 0.01 to 100.0%, normalized so that the most intense peak in each spectrum has an intensity of 100%. For this work, intensities have been expressed as percentage of the total ion current in the entire spectrum. In the remainder of this paper, an intensity

of 1.0% for a peak means one percent of the total ion current in the spectrum. This method of coding of spectral intensities was shown in experiments, not reported here, to be more effective than the original coding.

AUTOMATED DESCRIPTOR DEVELOPMENT In previous attempts to use pattern recognition for mass spectra generation, the coding of the molecular structures was done largely by hand (9-11). Very few automated methods were available during those studies. Unfortunately, coding a large number of molecules manually can introduce errors. This poses a very real problem to the pattern recognition system because it can be very error sensitive. New unknown data of the prediction set are classified on the basis of previous known data in the training set, and therefore errors in the knowns will affect the classification performance. In an extreme case, errors may even be sufficient to destroy linear separability of the classes. I t was therefore hoped that automated methods of descriptor development would diminish the number of errors in addition to removing a large burden from the researcher. The first problem encountered in attempting to generate descriptors by computer is that of representing chemical structures in a computer compatible form. The most common methods include linear notations such as Hayward, Wiswesser, and IUPAC, and topological or coordinate representations such as connection tables and augmented connection tables (16). Although linear notations are extremely compact and meaningful representations of molecules, connection tables are easier to generate and manipulate in a computer environment. A connection table is an explicit symmetric matrix representation of the atoms and bonds contained in a molecule. Although unique numbering systems are available, the atoms are usually numbered arbitrarily since no necessary information is lost. An additional table is provided when necessary to specify ring size information. Presence of an atom in more than one different size ring is indicated but presence in more than one identical size ring is not since this is redundant information for the purposes for which connection tables are used in this study. Although the information necessary t o generate a connection table was coded manually for this application, computer programs have been developed which allow this information to be computer generated from a two-dimensional model of the molecule by use of interactive graphics ( 1 7 ) . A special convention was adopted for representing non-specified atoms of a substructure (Le., X-C-X where X is not specified) where an atom type of zero is used. Stereochemical information is not included as this would require additional augmentation of the table or additional tables. Substructure Descriptors. One method for coding molecular structures is to define explicitly a number of substructures and then note whether they are imbedded in the molecule being coded or not. This information can be inserted into the pattern vectors as descriptors, x,, having values of one for presence or zero for absence of the substructure. For example, in each pattern vector representing a molecule, the forty-third entry, x43, could be one or zero according to whether or not the substructure >C=O is present or not. These types of descriptors are called substructure descriptors. The first automated descriptor development routine implemented was a substructure searching routine to automate the generation of substructure descriptors. The method used in this study is a variation of the graph theoretic set reduction technique developed by Sussenguth ( 1 8 ) . This algorithm is slightly more complex than atomby-atom techniques but is extremely rapid since factorial calculations are either avoided or vastly reduced. A modiANALYTICAL CHEMISTRY, VOL. 47, NO. 9. AUGUST 1975

* 1563

~~

Table 11. Search Times Using the Modified Algorithm for Substructure Searching

Table I. Substructure Search Results in 1-Cyclopropyl-3-cyclohexyl-propane Using the Modified Algorithm

Relative

Search tYPe

Subshucture

Search result

1 2

Acyclic -C-CCyclic (6-membered)

1 2

-c-C-

3

Acyclic or cyclic

4

-c-C-

4

cyclic (nonspecific)

3

-c-C-

fied form of the original algorithm was developed to allow greater specificity of the substrcture being searched, a wider variety of substructural types, and numeric instead of simple binary searches. The basic premise used in the set reduction technique is as follows: if a molecule is represented as a graph where atoms are designated nodes and bonds as branches connecting pairs of nodes, then the subset of nodes of a structural graph which exhibit some property must correspond to the subset of nodes of the substructural graph which exhibit the same property if the substructure is present in the structure. This can also be represented by the following equation: [X:X has property p ] = [X*:X* has property p ] where X is a structural node and X* is a substructural node. Although the elements of the subset of structural nodes will correspond to the elements of the subset substructural nodes, the number of elements in the structural subset does not necessarily have to be equal, but must then be greater than the number of elements in the substructural subset. The common properties utilized by the algorithm to produce the corresponding pairs of subsets can be easily obtained from the connection table. These are called set generators and include NODE (type of atom), BRANCH (type of bond), RINGOR (ring order or size), BOND (number of bonds), DEGREE (number of atoms attached), and SECDEG (number of first and second most adjacent atoms attached). The absence of the substructure in the structure is quickly realized during set generation by the production of an empty subset for the structure corresponding to a nonempty subset for the substructure. If the above condition is not met, then the algorithm attempts to produce a one-toone correspondence between the nodes of substructure and structure by a variety of methods. The first step is called partitioning and involves taking the intersection of all subsets having a common atom. It is hoped that only this atom and the corresponding structural atom will have the set generating properties which they represent. Therefore the intersect subset will contain only this atom and a one-to-one correspondence will be produced. If this fails, then the second step called connectivity, is employed which generates new subsets containing the atoms connected to the atoms in each original subset. This process is repeated upon the subsets most recently produced by connectivity until no new unique subsets can be produced. Partitioning is then used to see if the resulting set intersections are single membered. If this is not the case, then the connectivity and partitioning process is repeated until no new unique set intersections can be produced. The lack of a one-to-one correspondence between substructural and structural subsets at this point in the algorithm is due to one of two possibilities: first, the set gener1584

*

ANALYTICAL CHEMISTRY, VOL. 47, NO. 9, AUGUST 1975

search t i m e Search “ipe

3 3 3 3 3 3 1 2 3 1 2 3

(arbitrar, Substructure

4-

4== u

4 4 44-

-C44 4 4 4 -24-C-C-C-2-C-C-

44-C-

Structure

C 4 - C

c=c-C C4-C

C-C-C-C C4-C-C-C C 4 4 - C - C - C Cyclohexane Cyclohexane Cyclohexane Methylcyclohexane Methylcyclohexane Methylcyclohexane

units) a

15 15

brid f

E torsional -k E

non bond

where Eangleis the strain due to bond angles, E b o n d is the strain due to bond lengths, Ehybrid is the strain due to geometry about carbon atoms, E t o r s i o n a l is the torsional strain about single and double bonds, and Enon.bond is the strain due to non-bonded interactions. The minimization routine is different from the original algorithm and is based on an adaptive pattern search technique (21). Basically, a trialand-error approach is used to change the coordinates of each atom in an attempt to minimize the total strain energy. The molecular model building program requires approximately 20K 16-bit words of storage on the MODCOMP 11 computer and is fast with respect to other more exact model building programs, but slow when compared to other descriptor development routines. Typical times required for generation of good to excellent models of the molecules used in this data set ranged between 1.5 and 2.5 minutes per molecule on the MODCOMP I1 (800-nanosecond cycle time). The program can use either a two-dimensional model as input through an interactive graphics routine, or a connection table input. In the case of connection table input, a small routine is used to generate starting coordinates from the table. Output from the program is in the form of a coordinate matrix representing the three dimensional structure of the molecule in Angstroms. A radius of gyration routine was developed to convert the coordinate matrix representation of the molecule into useful descriptors; in this case, the principal axes or moments of the molecule. The distances of each atom from the center of mass are calculated and used to construct the symmetric tensor of gyration. This matrix is then diagonalized using the Jacobi method of successive plane rotations. The eigenvalues obtained are tested for consistency since poorly conditioned matrices can produce truncation errors in the eigenvalues. If such exist, then a second method is employed which involves the generation of a characteristic polynomial and solving this equation for its roots by the Graeffe root-squaring technique. The eigenvalues or roots obtained by either of these methods are real numbers which are multiplied by a scaling factor and truncated to integer values. The multiplication factor is such that information lost by truncation is minimized. Table 111 shows the geometrical descriptors used (number 88-93). X, Y and Z represent the longest, second longest, and shortest of the principal axes, respectively. The ratios Y/X, Z/X, and Z/Y were also used. Topological Descriptors. Substructural and fragment descriptors contain a wealth of structural information and ANALYTICAL CHEMISTRY, VOL. 47, NO. 9, AUGUST 1975

1585

Table 111. Descriptor List Descriptor

KO.

1. 2.

C-C-

3.

-C-

4.

-C-

5.

-C-

1

7.

-C-

8. 9. 10. 11. 12. 13.

0-0-

17.

44. 45.

C-N-C-N-

2 2

1 1 1 2 1 2

O=

-0 -N-NI -N=N-

c-cc-c-

I I

19.

C--C-

46. 47. 48.

2 1 4

49. 50.

4 4

51. 52. 53. 54. 55.

lclIlI Il-LcIIIc=cll c-c-c-c-c-c-c-c-c-

56.

c-c-c-

I I

C+-

21.

-c-c-c-c-

23.

-c-c-

24.

-c-c-

1

I

c-c= I c-c= -c-c=

c-c-c

1

59 *

c-c-c

60.

c-c-c

2

61.

-c-c-c-

1

62.

c-c-c-

1 2

63. 64.

65.

1

2 2

I

-c-c= =c-c= -c=c-

34.

-C=

35. 36. 37. 38. 39.

-C=

c-o-c-0 -c-0-c-0-

2 1 1 1 2

40.

-c-0-

1

78.

2 1

79. 80.

41. 42.

I I

2 2 2 1 1 2

1

ANALYTICAL CHEMISTRY, VOL. 47, NO. 9, AUGUST 1975

2

I

1

I

1

I I

2

I

I

2

I

1

I

c-c-c-c-c-c-

2 I

Yo -c\

2

1

0-

R0

66.

28. 29. 30. 31. 32. 33.

C= -C= -C=

c-c-c-

58.

2

I

1

57.

1

4 4 1 1 2

I

2 2 1

2

I

20.

26. 27.

1

I

c-c-

25.

2

2

1

18.

22.

-c=o

2

I

I

Descriptor type

43.

2

-C-

Descriptor

No.

1 1

1

1

6.

14. 15. 16.

1566

,

'Descriptor type

-c\ 0 -c-c-0-c-c-0c-c-c-c-c-c-c-c-

2

-C-C-

1 2 1 2 1 4 3 4

75.

-C-

4

76 77.

-c-c-

4 3

67. 68. 69. 70. 71. 72. 73. 74 *

NG

-N"

-N//O

-%

number of single bonds number of double bonds

1

~~

Table IV. Complete Descriptor Listing for Isobutyl

Table 111. (Continued) 'Descriptor Descriptol

No.

81. 82. 83. 84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98. 99. 100.

number of number of number of number of number of number of molecular

101.

-C-

102.

C=

103. 104. 105. 106. 107. 108. 109. 110.

-C=

triple bonds benzenoid bonds carbon atoms oxygen atoms nitrogen atoms hydrogen atoms weight

X Y Z Y/X

z/x z/y C-

-C-C-C=

-C-0-N-

I

I I

-C0O=

NNE -N=

-N-

Sote: -designates benzenoid bond. in some cases are sufficient in themselves to generate the total structure which they represent. In most cases, however, information is lacking in regard to the manner in which these descriptors are connected to form the complete molecule. It is expected that additional descriptors which define these connections would be of use in providing information to the pattern classifier. A recent study (22) realized this expectation by the use of special topological descriptors, denoted environmental descriptors, in combination with binary substructural and numeric fragment descriptors. Environmental descriptors are a numerical representation of the first and second most adjacent atoms to a given atom using information available from a connection table. Three types of environmental descriptors exist: 1, bonded; 2, weighted; and 3, augmented. Augmented were chosen for this case and include information of bond type and atom type around a specified atom-centered substructure. This can be represented as follows: m

n

x(atom)j j-i

Propanoate

tVPe

(bond),. (bond type),. i=l

(associated atom type) where n represents the total number of first and second most adjacent bonds to atom aj, and m represents all atoms present in the structure matching the atom-centered substructure of interest. The numbers representing the bond type and atom type are the same as those used for connection tables.

Descriptor

Descriptor

Kumber

Value

Number

Value

1 2 3 5 6 9 10 11 16 17 18 23 24 28 34 35 38 39 42 43 58 62 63 65 66

3 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

79 80 83 84 86 87 88 89 90 91 92 93 94 95 96 99 103 106

7 1 7 2 14 130 37 7 4 19 10 53 23 26 7 17 10

11

The augmented descriptors used are shown in Table I11 (numbered 94-110). These descriptors are computer generated from connection tables representing the structure and atom-centered substructure. The same program used in the previous study was used with minor modifications. The compound containing the largest number of nonzero descriptor values, isobutyl propanoate, was selected as an example. The descriptor numbers and values are shown in Table IV. It should be noted that the numerical values given for the geometrical descriptors are not in conventional units (such as Angstroms). Each of the 600 molecules of the data set is represented by a list of the values of the 110 descriptors, which altogether form a 110-dimensional point for each molecule. (Of course, a great many of the descriptors have the value of zero.) Preprocessing of t h e D a t a Set. Prior to passing the data to the remainder of the pattern recognition system, an autoscaling operation was performed. Autoscaling consists of modifying the values of the descriptors so that the average value for each descriptor is zero and the standard deviation is unity. The procedure is also known as standardization of the variables. The equation used is

x.. 1J =

-

( Ma.3 . - X j ) , / U j

where x i , j is the value of descriptor j for compound i, i j is the mean of descriptor j over all compounds, and uj is the standard deviation of descriptor j over all compounds. The values obtained are then multiplied by 20 causing the standard deviation for each descriptor to become 20. The net effect of this procedure is to equally weight each descriptor and increase the convergence speed of the training algorithm. RESULTS AND DISCUSSION B i n a r y P a t t e r n Classifier Generation. Each BPC is trained to detect the presence of a mass spectral peak of sufficient intensity in a particular mle position. The data set is partitioned according to whether each compound's ANALYTICAL CHEMISTRY, VOL. 47, NO. 9, AUGUST 1975

1567

~~

~

Table V. Feature Selection and Prediction Using 150 Membered Training Sets h’umber of compounds Percent Percent Peak

cutoff

29 29 29 30 31 37 38 39 39 39 40 41 41 41 42 42 42 43 43 43 44 45 46 50 51 51 51 52 53 53 53 54 55 55 55 56 57 58 59 63 65 66 67 67 67 68 69 70 71 73 74 75 76

30 37 40 37 37 37 37 30 37 40 37 30 37 40 30 37 40 30 37 40 37 37 37 37 30 37 40 37 30 37 40 37 30 37 40 37 37 37 37 37 37 37 30 37 40 37 37 37 37 37 37 37 37 30 37 40 37 30 37 40 37 37 37

77 77

77 78 79 79 79 80 81 82 1568

+

----t

0.1 0.5 1.0

Percent

No. of

predictive

over

Positive

Xegative

components

abiliq

random

Reference

class

class

Total

14

95.3 94.9 91.9 90.1 92.4 91.3 84.2 96.9 94 .o 94 .O 78.0 96.2 91.6 92.4 93.1 92.4 78.8 89.8 89.1 86.4 81.2 90.2 95.3 89.7 86.3 90.1 89 .O 88.3 89.6 82 .O 87.2 80.6 83.5 80.1 79.4 81.8 74.7 78.3 85.2 90.4 90.1 84.9 86.3 82.9 80.8 82.1 79.2 79.2 78.2 86.5 86.4 91.8 93.4 88.6 89.2 90 .o 93.9 85.2 85.2 87 .O 92.2 87.4 90.9

1.4 9.9 11.5 4.6 16.8 10.6 29.3 0.6 0.6 2.4 11.7 -0.4 1.7 7.5 5.1 9.3 14.4 2.2 14.2 19.1 15.9 10.5 -0.2 23.6 4 .O 38.2 20.4 17.3 1.5 17.9 33.2 17.7 2.9 12.6 18.3 22.2 22.7 5.3 0 .o 12.7 17.9

4 2 1 1 1 2 4 3 1 4 1 1 2 2 2 2

539 491 463 85 131 115 2 74 579 565 551 396 584 542 508 528 496 381 519 445 407 202 116 27 199 500 2 86 190 175 527 384 273 211 481 405 365 344 3 13 158 83 124 160 114 3 56 222 166 156 275 227 148 58 62 49 33 262 128 99 91 247 119 83 45 107 84

35 83 111 489 443 483 326 21 35 49 204 16 58 92 70 102 217 79 153 191 394 480 568 396 95 309 405 4 19 67 2 10 321 382 111 187 227 244 2 74 423 496 443 407 453 211 345 401 409 288 322 399 477 463 470 477 248 382 411 417 261 389 425 460 397 418

574 574 574 574 574 598 600 600 600 600 600

10

18 29 36 46 34 15 18 18 49 18 44 19 14 39 44 20 20 22 8 29 40 45 44 23 34 42 27 41 36 47 33 31 41 49 43 49 39 31 28 39 31 45 46 33 33 37 40 25 41 39 29 32 39 31 33 19 35 30 31 26 43

ANALYTICAL CHEMISTRY, VOL. 47, NO. 9, AUGUST 1975

5.1

24.7 21.1 9.5 7.5 27.4 20.1 5.6 -2 .o -2.2 1.o 0.2 37.6 13.2 10.8 13.5 31.8 9.5 3.9 1.6 9.1 8 .O

1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

600

600 600 598 598 598 598 598 598 596 596 595 595 595 595 595 594 594 594 594 593 592 592 592 588 587 581 579 567 567 567 567 567 567 565 563 549 547 535 525 519 5 10 5 10 510 510 508 508 508 508 505 504 502

Table V. (Continued) Percent

Percent

Percent

No. of

predictive

over

Peak

cutoff

components

ability

random

83 84 85 91 95 97 98 100 103 104 105 106 115 119 120 121 127 128 136

37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37

41 30

79.3

6.6

70.9

27

7

84.2 90.3 93.2 88.2 78.6 94.3 89.2 93.1 86.8 92.8 91.2 90.3 86.3 89.9 93.8 91.1 89.4

0 .o 1.1

29

87.6

Average

24 27 29 27 27 28 6 23 15 20 9 11 7 9 14

mass spectrum has a peak of sufficient intensity in the mle position of interest (“yes” class) or not (“no” class). An intensity cutoff of 0.5% of the total ion current was used for most of the BPCs trained. BPCs were trained using this 0.5% cutoff for all the mle positions listed in Table V. In addition BPCs corresponding to intensity cutoffs of 0.1% and 1.0% were generated for 11 mle positions. These latter intensities were selected in order to ensure that there was a n adequate number of positive class compounds equal to or greater than the 1.0% cutoff value. A random number generator was used to select a training set of 150 compounds. The remaining 450 compounds composed the prediction set. Two criteria were used to determine whether any compound of either set should be eliminated from consideration. The first test is due to a lack of information with respect to the API mass spectra. Peaks 29 through 37 are missing for certain compounds probably because such information for relatively large compounds is considered unimportant. Of the total 600 compounds, 26 qualify for removal on the basis of this criterion for mle positions 29, 30, and 31. This number is lowered to two compounds for peak 37 and to zero for the remaining peaks. The second criterion involves the removal of compounds whose molecular weight plus one is less than the m/e value for which the BPC is being trained. I t is obvious that such compounds would not have a peak and unnecessary weight might therefore be placed on the molecular weight descriptor. Peaks 29 through 41 are unaffected by this procedure since the lowest molecular weight in the data set is 40. A steady decrease in the total number of compounds results as the mle positions increase until peak 136 when there are only 68 compounds remaining in both the training and prediction sets combined. Table V lists the number of compounds for each class after elimination using the two criteria. Once the membership of the training and prediction sets were established for a particular mle position and intensity threshold, then a BPC was trained using the error correction procedure described above. After the BPC was fully trained, and linear separability was thereby established, predictive ability was determined: This was done by classi-

7.4 5.8 5.2 2 .o 0.7

7.5 7.7 6.1 3.6 8.2 8.3 4.3 0.7 0.7 2.9 4.2

Number of compounds

Reference

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Positive

Negative

class

class

Total

141 140

359 346 409 349 360 342 304 334 2 70 261 2 56 273 197 165 162 172 156 128 56

500 486 486 423 415 412 400 355 331 309 307 302 242 200 193 192 169 148 68

77

74 55 70 96 21 61 48 51 29 45 35 31 20 13 20 12

10.1

fying the members of the prediction set, none of which had been seen by the classifier, and recording the percentage of correct classifications. Feature selection was then performed to identify the minimum number of descriptors that afforded linear separability and yielded high predictive ability for each BPC. The method used was variance feature selection which was described above. This involved developing a list of the descriptors ranked by utility for the separation being done, followed by the discarding of the descriptors thereby identified as least useful. After discarding a fraction of the descriptors, each BPC was trained with its same training set but with each member containing fewer descriptors. If training was successful, then feature selection was done again, and so on, until a minimum number of descriptors remained. Column three of Table V reports the number of descriptors retained out of 110 for each BPC. Column four reports the predictive abilities of the BPCs developed with the training sets whose members contained the reduced number of descriptors. Column five reports the difference between the actual predictive ability attained and the prediction percentage that would be attained if the more populous class were always guessed. For example, the BPC trained to detect peaks in mle position 43 with an intensity cutoff of 0.5% completed the cycle of training and feature selection with only twenty descriptors surviving out of the original 110. The BPC trained with these twenty descriptors displayed a predictive ability of 89.1%. If prediction were done by classifying all members of the prediction set in the more populous positive class, a predictive ability of 74.9% would have been attained. Thus, the predictive ability is said to be 14.2% above random guessing. All BPCs trained were able to demonstrate linear separability. It should be pointed out that the number of compounds in the training sets is always greater than or equal to three times the number of descriptors per molecule after feature selection. The range of predictive abilities reported in Table V is wide, from 70.9 to 96.9%. The average predictive ability for all 82 BPCs was 87.6% and the average above random was 10.1%. ANALYTICAL CHEMISTRY, VOL. 47, NO. 9, AUGUST 1975

1569

\

~~~~

~

~~

Table VI. Feature Selection and Prediction Using 590 Membered Training Sets C9H20

Percent Percent

Compounds

Components

predictive

Peak

cutoff

eliminated

remaining

abiliv

29 29 29 39 39 39 41 41 41 42 42 42 43 43 43 51 51 51 53 53 53 55 55 55 67 67 67 77 77 77 79 79 79

0.1 0.5

0 0 0 0 0 0 0 4 4 2 0 64 6 2 6 28 16 6 0 22 8 28 28 28 22 16 22 0 4 2 8 16 2

33 46 52 25 29 33 34 57 47 38 49 59 55 56 59 37 47 72 37 44 63 41 6'6 55 48 47 57 42 48 47 47 50 52

100 80 100 100 100 100 100 100 100 90 90 60 100 80 90 90 100 70 100 100 90 100 80 80 80 80 80 100 90 100 70 90 80

10

48

90

Average

1.o

0.1 0.5 1.o 0.1 0.5 1.o 0.1 0.5 1.o 0.1 0.5 1.o 0.1 0.5

1.o

0.1 0.5 1.o 0.1 0.5 1.o 0.1 0.5 1.o 0.1 0.5 1.o 0.1 0.5 1.o

Computer Generation of Partial Mass Spectra. Predictive abilities are extremely dependent on the manner in which the training set used is representative of the entire data set. In an actual laboratory situation, one would use all available data in order to ensure that the data set was as representative of the universal set as possible. Unfortunately, one is usually limited by storage and time considerations in the computer system used for BPC generation. An attempt was made to simulate this situation by dividing the data set randomly into a training set of 590 and a prediction set of 10. Then 33 BPCs were trained-one BPC for an intensity cutoff of 0.1%, one for 0.5%, and one for 1.0% for each of eleven mle positions. The results are shown in Table VI. Before the training of each BPC, the training set members that (a) did not have peaks a t mle positions 29 through 37, or (b) had a smaller molecular weight than the mle position for which the BPC is to be trained, were discarded. (These are the same criteria invoked with the training sets of 150 used above.) All 110 descriptors were used to code each molecular structure, and the 33 BPCs were trained. In the cases where the training algorithm could not find a separating decision surface in a reasonable amount of time, the molecules most responsible for the difficulty in training were removed from the training set. Column three reports the number of compounds that were eliminated to attain linear separability. Once the BPCs could be trained, then variance feature selection was 1570 * ANALYTICAL CHEMISTRY, VOL. 47, NO, 9, AUGUST 1975

Real Spectrum

current

1

Artificial Snectrum

'

I l l

-m /-e Figure 1. Real and artificial 11-peak mass spectrum of 2,2,5-trimethylhexane

used to reduce the number of descriptors to the minimum shown in column four for each BPC. The predictive ability attained for each BPC is shown in column five; the average is 90%. The predictive abilities shown on Table VI are based on the use of the ten unknowns originally removed from the entire data set. These ten compounds were randomly selected and represent a wide variety of the many types of compounds present. I t is interesting to note that mle position 42 with an intensity cutoff of 1.0%displayed the lowest predictive ability and correspondingly had the highest number of compounds removed before linear separability could be obtained. The inability of the descriptors used to represent a large number of compounds in the training set would also be expected to apply to the compounds present in any prediction set. The partial mass spectra shown in Figures 1-5 represent the real and artificial spectra for five of the ten compounds utilized as unknowns. The artificial spectra were calculated in the following manner. The 0.1% intensity level BPC is used to calculate the dot product for all m/e positions. Those peaks yielding a positive value are then tested with the 0.5% intensity level BPC for each respective peak. If the dot product is negative or zero, then the intensity is set to 0.1%; if the dot product is positive, then the 1.0%BPC is utilized. This results in the artificial spectra with peaks having intensity levels of 0.0, 0.1, 0.5, and 1.0%corresponding to predicted peak intensities in the ranges 0.0% I I < 0.1%, 0.1% II < 0.5%, 0.5% II < 1.0%, and I I1.0%, respectively, where I is the predicted intensity level. The real mass spectra taken from the API Research Project 44 tables are plotted in the same way so that direct visual comparison can be made. In general, the figures show a high degree of similarity. The results obtained indicate a high potential applicability of this method as an analytical tool. Given a set of possible compounds with the unknown mass spectrum, artificial mass spectra could be calculated and compared to the unknown. The compound associated with the artificial spectrum which compared most closely would then be suspect-

cH3x3

C2H5CH3CHOCHZOCHCH3C2H5 di(2-butoxy)metkane

CH3

C9H2002

1,1,3-trimethylcyclopentane

Real Spectrum

'8"16

1.0

I Percent ion Current

Percent ion 0.5 Current

0.1

-mle-

Percent ion Current

0.1

29

39

41

42

43

51

55

53

67 77

+

I

I

79

I

,

-d -e Figure 2. Real and artificial 11-peak mass spectrum of 1,1,3-trimethylcyclopentane

@I[:

0

C-CH3

CH3

CH3 tert-butylbenrene 'gHgN

ClOHlll

,

l'O Percent ion current 0 . 5

0.1

i 1 J

I

Real Spectrum

!

I

1

1

I

i 29

39

~

I

I

!

j

I

I

l!lll

42

43

51

Percent ion Current

53

55

o.5! I

I

0.1-

, I

I

71

29

39

41

42

43

51

53

55

67

77

79

-mle-

Artificial Spectrum

1.0 percent ion 0.5 Current 0.1

1

l i

Artificial Soectrum

Percent ion

Current

li

I/

1 -i_i__.-~

-1-

-

29

39

41

42

43

51

53

I

I

/

55

67 7 7

19

I l i

-m/e

Figure 3. Real and artificial 1 1-peak mass spectrum of terf-butylbenzene

ed to be the compound producing the unknown spectrum. If BPCs were generated for more mle positions, the usefulness of this approach could be considerably extended. Descriptor Correlation. Since the learning machine is attempting to describe the results of processes occurring inside a mass spectrometer, correlations should exist between the required descriptors and the fragments responsible for a given m/e position. One must be careful when investigating such correlations since descriptors adequately describing the data set in use may not necessarily be suffi-

Flgure 5. Real and artificial 1 1-peak mass spectrum of 2,2-dimethylpyrrole

cient to describe the universal set. The variance feature selection method offers an excellent means to determine descriptor correlations. Since descriptors having the lowest relative variation are expected to be those most useful to the learning machine for separating the classes, the following discussion will deal with the descriptors ranked highest in importance by the variance feature selector. Mass spectral peaks are most commonly caused by the appearance of an odd or even electron positively charged ANALYTICAL CHEMISTRY, VOL. 47, NO. 9, AUGUST 1975

1571

~

~~~~

~

Table VII. Highest Ranked Descriptors for Selected Peaks Percent Peak

cutoff

30

0.5

40

0.5

45

0.5

55

0.1

Descriptor type

acyclic and cyclic substructures environmental acyclic substructures environmental acyclic substructure acyclic substructure environmental fragment acyclic and cyclic substructure fragment acyclic substructure acyclic and cyclic substructure environmental acyclic and cyclic substructure fragment acyclic substructure acyclic substructure fragment acyclic and cyclic substructure

0.5

0

II

77

0.5

1 .o

fragment having a mass corresponding to the mle position value, a positively charged ion corresponding to loss of a neutral species from the molecular ion, or a positively charged fragment containing 13C or 15N isotopes. The intensity cutoffs employed should be sufficiently above peaks due to isotopic effects that they can be neglected. Correlations will be attempted with the former, positively charged fragments, since the majority of peaks are due to these species and they are the easiest to study. It must be remembered that the learning machine is attempting to describe members of both the positive and negative classes, resulting in positive and negative correlations. Table VI1 lists the most important descriptors for selected mle positions and intensity levels. The descriptor, descriptor type, and a descriptor number are listed with benzenoid bonds indicated by parallel continuous and dotted lines. Descriptors numbered 1 are most important; that is, they have the lowest relative variation. The first three peaks are chosen from data obtained for the BPCs generated using 150 membered training sets. The most common fragments responsible for peak 30 are CH4N and NO (23).Descriptor 2 provides information concerning the presence of a nitrogen attached to the methylene substructure. Information concerning tri- and tetra-bonded carbon should also be available from this descriptor indicating the likelihood of cleavage. Descriptor 4 should be of considerable use concerning peaks due to NO since a nitrogen atom adjacent to the oxygen centered substructure would ordinarily produce a much higher numerical value for this descriptor than would a carbon. The remaining two descriptors do not correlate as well and are probably useful 1572

*

ANALYTICAL CHEMISTRY, VOL. 47, NO. 9, AUGUST 1975

acyclic substructure cyclic (6-membered ring) substructure cyclic (6-membered ring) substructure acyclic substructure environmental cyclic (6-membered ring) substructure environmental environmental acyclic substructure in describing the negative class. Peak 40 is due primarily to the presence of the C2HzN ion. The first three descriptors correlate extremely well and the fourth, molecular weight, probably is more data-set dependent. Descriptors 1 and 3 of peak 45 correlate well with the primary ion responsible, C2H50. Descriptors 2 and 4 may be useful in providing cleavage probabilities associated with the formation of this ion. The descriptors listed for peaks 55 and 77 are obtained from the results of the study using training sets of 590 compounds. One of the primary ions responsible for mle position 55 is C4H7 which is formed by simple cleavage from alkenes or cycloalkanes and by the following rearrangement: RCOOCdH, --+ RC(OH), + CdH, Another ion of importance is C2H3CO formed from alkenyl and cycloalkyl carbonyls, cyclic alcohols, and ethers. All descriptors for both intensity levels correlate well with the exception of molecular weight. This again may be an artifact of the data set. Two fragments responsible for mle position 77 are CsH5 from phenylalkyls and C5H3N from pyridyl type compounds. It is interesting to note the number of benzenoid descriptors considered necessary for both intensity levels. Descriptors 3 and 4 of intensity cutoffs 0.5 and LO%, respectively, are probably related to the type of alkane substitution on the benzene or pyridine ring. Descriptor 4 of the 0.5%cutoff may be useful in describing the negative class: that is the lack of a singly bonded nitrogen environment is important for a compound displaying a peak a t mle position 77. Much information can be obtained from the learning ma-

chine as evidenced by the above correlations. However, often the method does not utilize a simple combination of descriptors, and correlations are in such cases extremely difficult to elucidate. The results suggest that the variance method of feature selection can be extremely useful in deciding which of the large number of descriptors necessary for linear separability contribute most to class distinction.

CONCLUSIONS The use of descriptors which can be computer generated from a connection table representation of a molecule appears to be a viable alternative for the artificial generation of mass spectra. Although predictive abilities using small training sets are in the same ranges as those obtained by a previous study, the descriptors employed are indicated to be more representative of the data classes as evidenced by the linear separability observed for all BPCs. The partial mass spectra obtained by the use of large training sets with the resultant overall predictive ability indicate the potential usefulness of this technique for analytical applications. Generation of binary pattern classifiers for more mle positions and a larger number of intensity cutoffs should extend the range of this method. The high degree of correlation between the descriptors ranked most important by the variance method of feature selection and the phenomena which the learning machine is attempting to duplicate provides an excellent reason for future applications of this technique to other chemical problems which are perhaps less well understood.

ACKNOWLEDGMENT We express our apprciation to Andrew J. Stuper for the

LITERATURE CITED (1)J. Lederberg, G. L. Sutherland, B. G. Buchanan, E. A. Feigenbaum, A. V. Robertson, A. M. Duffield. and C. Djerassi. J. Am. Chem. SOC.,91,2973 (1969). (2)A. Buchs. A. B. Delfino. A. M. Duffield, C. Djerassi, B. G. Buchanan. E. A. Feigenbaum, and J. Lederberg, Helv. Chim. Acta, 53, 1394 (1970). (3)A. B. Delfino and A. Buchs. Helv. Chim. Acta, 5 5 , 2017 (1972). (4)A. B. Delfino and A. Buchs, Org. Mass Spectrom, 9,459 (1974). (5) P. C. Jurs and T. L. Isenhour, "Chemical Applications of Pattern Recognition". Wiley-Interscience, New York, 1975. (6)T. L. Isenhour, B. R . Kowalski, and P. C. Jurs, CRC Crit. Rev. Anal. Chem., 4, 1 (1974). (7)B. R. Kowalski, in "Computers in Chemical and Biochemical Research", Vol. 2, C. E. Kopfenstein and C. L. Witkins, Ed., Academic Press, New York, 1974. (8)T. L. Isenhour and P. C. Jurs, Anal. Chem., 43, (lo),20A (1971). (9)J. Schechter and P. C. Jurs, Appl. Spectrosc., 27, 30 (1973). (IO) J. Schechter and P. C. Jurs, Appl. Spectrosc., 27, 225 (1973). (11) P. C. Jurs. "Computer Representations and Manipulation of Chemical Information", W. T. Wipke, et al., Ed., Wiley-lnterscience, New York,

1974. (12)N. J. Nilsson, "Learning Machines", McGraw-Hill, New York, 1965. (13)Harry C. Andrews, "Introduction to Mathematical Techniques in Pattern Recognition," Wiley-Interscience, New York, 1972. (14)William S.Meisel, "Computer-Oriented Approaches to Pattern Recognition", Academic Press, New York, 1972. (15)G. S. Zander, Masters Thesis, Department of Chemistry, The Pennsylvania State University. 1974. (16)"Computer Handling of Chemical Structure Information", M. F. Lynch et al., Ed., American Elsevier. New York, 1971. * (17)R . J. Feldman and D. Koniver, J. Chem. Doc., 11, 154 (1971). (18) Edward H. Sussenguth, Jr., J. Chem. Doc., 5 , 36 (1965). (19)E. M. Eugler, J. D. Andose, and P. von R. Schleyer, J. Am. Chem. SOC., 95,8005 (1973). (20)W. T. Wipke, in "Computer Representation and Manipulation of Chemical information", W. T. Wipke et al., Ed., Wiley-lnterscience, New York,

1974. (21)E. S.Buffa and W. H. Taubert. "Production-Inventory Systems. Planning and Control, rev. ed., R. D. Irwin, Inc.. Homewood. Ill., 1972. (22)A. J. Stuper and P. C. Jurs, J. Am. Chem. SOC.,97, 182 (1975). (23)F. W. McLafferty, "interpretation of Mass Spectra", W. A. Benjamin, Inc., New York, 1967.

use of the Environmental Descriptor routine, and William

RECEIVEDfor review December 11, 1974. Accepted May 1,

E. Brugger for the use of the Molecular Model Building

1975. Financial support of the National Science Foundation is gratefully acknowledged.

routine.

Source Identification of Oil Spills by Pattern Recognition Analysis of Natural Elemental Composition D. L. Duewer and B. R. Kowalski Laboratory for Chemometrics, Department of Chemistry, University of Washington, Seattle, Wash. 98 195

T. F. Schatzki United States Department of Agriculture, Agricultural Research Service, Western Regional Research Laboratory, Berkeley, Calif. 94 7 10

Pattern recognition techniques are applied to the problem of determining the source of an oil spill, using its elemental composition as analyzed by neutron activation analysis, when the field sample has undergone weathering. The utility of various elements for making this determination is assessed. Classification procedures using comparisons of the field sample to single ("point") known source samples and to multiple ("fuzzy") artlficially weathered source samples of a given type are presented and discussed. A technique of the latter form is recommended as a superior tool ?or the solution of this problem, and others of similar nature.

In terms of direct and indirect cleanup costs, legal fees and fines, the economic liability associated with waterway oil spills is such that techniques for establishing responsibility for an unknown source oil spill are of considerable interest.

The problem encountered by the chemist may be stated as follows: Given a spill sample as received from the field and a group of possible source samples, can the true source be identified (or recognized as missing) from analytical data, and with what confidence can the identification be reported? Two approaches to the solution of this problem have been argued in the literature: "active tagging," where a known chemical and/or physical label is introduced into the oil, and "passive tagging," which is a misnomer for using the natural composition of the oil to identify the source ( I , 2 ) . These approaches have been extensively discussed, with the essential conclusion that while "active tagging" is much superior for making absolute identification, it is unlikely to be implemented soon ( 3 ) .It is thus important to determine how characteristic the natural composition of oils actually is, bearing in mind that the analysis of the spill sample as received from the field will differ from the analysis of the true source sample for a t least three reasons: weathering, contamination, and analytical error. ANALYTICAL CHEMISTRY, VOL. 47, NO. 9, AUGUST 1975

1573