Multicategory prediction using arrays of binary pattern classifiers

Journal of Chemical Information and Computer Sciences 1978 18 (1), 55-55 ... Computerized Structure Retrieval and Interpretation of Mass Spectra. GAIL...
0 downloads 0 Views 569KB Size
The agreement between laboratories for these analyses was particularly gratifying and we feel demonstrates both that isotope dilution mass spectrometry can be a very precise and accurate technique and, more importantly, that though techniques or methods may vary, if these are carefully evaluated, reproducible results may be obtained a t different laboratories. We are now confident that even the small differences that are shown here are due almost entirely to differences

in laboratory blank values and that because of substantial improvements made in this area since this work was done, even better agreement would be obtained were the work to be repeated. Received for review, October 26, 1972. Accepted December 15, 1972. Publication authorized by the Director, U S . Geological Survey.

Multicategory Prediction Using Arrays of Binary Pattern Classifiers Wayne L. Felty’ and Peter C. Jurs

Department of Chemistry, The Pennsylvania State University, University Park, Pa. 76802

Arrays of binary pattern classifiers have been used previously in a branching tree and parallel manner for predictive purposes, and recently the use of Hamming-type binary codes has been suggested. In the present work, prediction of carbon number for a set of C3-C10 organic compounds and a subset of c4-c10 hydrocarbons, using mass spectrometric data, has been shown to be feasible by means of three types of binary codes, two with errorcorrection capability. Carbon number prediction was also carried out by the parallel and branching tree methods and the predictive abilities and ease of implementation for the various classification schemes compared. The highest prediction, 95.4%, was obtained for the hydrocarbon data set using both the branching tree and Hamming (7,4) binary code.

Previous studies have demonstrated the usefulness of learning machines utilizing binary pattern classifiers for the interpretation of complex chemical data, particularly low-resolution mass spectra ( I ) . In the present work, several methods of combining binary decision makers to yield quantitative prediction are investigated. Binary pattern classifiers, or threshold logic units, have been described in detail by Nilsson ( Z ) , and some chemical applications have been discussed by Isenhour and Jurs ( I ) . A multidimensional piece of data, e.g., a mass spectrum, is represented by a pattern vector, X = ( x l , x z , . . . , x d ) , where the xi's are, for example, the intensities of selected m l e positions. The decision surface is given by the weight vector, W E (wl,wZ,. . .,wd+l).The classification process involves taking the dot (inner) product of these two vectors to give a scalar, X.W = s. The algebraic sign of the scalar indicates into which of two categories the pattern vector is classified. The weight vector for a particular decision is obtained by means of a training algorithm employing error-correction feedback, iteratively modifying some initial weight vector so as to correctly classify the members of a known set of pattern vectors. Present address, Department of Chemistry, Mansfield State College, Mansfield, Pa. 16933. (1) T. L. Isenhour and P. C. Jurs, Anal. Chem., 43(9). 20A (1971) and references cited therein. ( 2 ) N. J. Nilsson, “Learning Machines,” McGraw-Hill, New York, N.Y., 1965.

While a number of chemically significant questions can be answered by a binary decision, such as the presence or absence of a certain functional group or element, multicategory decisions, such as the number of certain groups or atoms per molecule, must also be made. One means of achieving multicategory classification is through the use of an array of binary classifiers. Perhaps the most obvious manner of combining binary classifiers is the branching tree method, which has been applied to the determination of empirical formulas ( 3 ) . More often, a parallel arrangement of classifiers, each with a different cutoff, has been used ( 4 ) . A third method, utilizing Hamming-type binary codes and offering the interesting possibility of self-detection and -correction of errors, has been recently suggested by Lytle ( 5 )but was not implemented. While the use of binary codes has several inherent advantages, the feasibility of the method depends upon whether the weight vectors for the pertinent binary pattern classifiers can be successfully trained. The present investigation answers this crucial question, using the test problem of carbon number prediction. The three methods of pattern classification, branching tree, parallel, and. binary code, are then compared with respect to carbon number prediction as well as general features.

DATA AND COMPUTATIONS Data Set. Low resolution mass spectra were obtained from a collection purchased on magnetic tape from the Mass Spectrometry Data Centre, Atomic Weapons Research Establishment, United Kingdom Atomic Energy Commission. Six hundred spectra, pertaining to compounds of molecular formula C3-10H2-zz00-4N0-2, were taken from that portion of the tape comprised of American Petroleum Institute Research Project 44 spectra. The spectra were randomly divided into a training set of 200 and a prediction set of 400. A second data set, consisting of the hydrocarbon subset of the 600 spectra, was also used. Five C3 hydrocarbons were deleted because of the low population of this category. The remaining 372 C4-ClO hydrocarbons were randomly divided into a training set of 200 and a prediction set of 172. The same training and (3) P. C. Jurs, E. R. (1969). (4) P. C. J u r s , B. R . 42, 1387 (1970). (5) F. E.

Kowalski, and T. Kowalski, T .

L.

Isenhour, Anal. Chem., 41,

L. Isenhour. and C. N.

21

Reilley, ibid.,

Lytle. ibid., 44, 1867 (1972).

ANALYTICAL CHEMISTRY, VOL. 45, NO. 6, MAY 1973

885

Table I . Binary Pattern Classifiers for Parallel Array Training set Cutoff

Prediction set Number of feedbacks

Negative category

Per cent prediction

Positive category

Negative category

Positive category

161 123 81 47 18 7

39 77 119 153 182 193

93 61 83 51 24 3

141 102 67 41 24 6

31 70 105 131 148 166

97.7 98.3 98.3 99.4 100.0 99.4

179 151 117 85 59 29 8

21 49 a3 115 141 171 192

91 180 115 124 149 101 71

335 272 21 7 171 101 52 21

65 128 183 229 299 348 379

92.8 97.0 97.3 95.5 95.5 95.0 98.3

Hydrocarbons 9 8 7 6 5 4

Entire data set 9 8 7 6

5 4 3

0

0

INPUT MJUSTED

SPECTPVW

3

BREAK DETERMINES CLASSIFICATION

1

A

1

Figure 1. Parallel arrangement of binary pattern classifiers clo

prediction sets were used throughout the investigation so that the results of the various classification schemes could be compared. One hundred thirty-two m / e positions were of significance, considering the entire set of spectra, and therefore each spectrum was represented as a 132-dimensional pattern. The original peak intensities, normalized with respect to the highest peak in each spectrum, fell within the range 0.01 to 99.99. The intensities were renormalized based on the total ion current, or sum of intensities for each spectrum, in order to place all spectra on the same intensity scale. Subsequent logarithmic transformation gave intensities of 10 to 59. In the hydrocarbon set 3.5% of the spectra exhibited no parent peak (intensity