Comparison of various K-nearest neighbor voting schemes with the

voting schemes with the self-training interpretive and retrieval system for identifying ... Citation data is made available by participants in Cro...
0 downloads 0 Views 341KB Size
Comparison of Various K-Nearest Neighbor Voting Schemes with the Self-Training Interpretive and Retrieval System for Identifying Molecular Substructures from Mass Spectral Data S. R. Lowry’ and 1. L. Isenhour” Department of Chemistry, University of North Carolina, Chapel Hi//, North Carolina 275 14

J. B. Justice, Jr. Department of Chemistry, Emory University, Atlanta, Georgia 30322

F. W . McLafferty, H. E. Dayringer,2 and Rengachari Venkataraghavan Department of Chemistry, Cornell University, Ithaca, New York

14853

Computer pattern recognition uslng the K-Nearest Neighbor technique has been applied to 500 unknown test spectra for the prediction of the presence of 20 substructures. The results from this system were compared to those found previously for the Self-Training Interpretive and Retrieval System (STIRS) using the same substructure assignments and data base. The generally superior performance of STIRS appears to be due largely to prior data selection for STIRS based on mass spectral knowledge.

weighted combination of matchings found for individual data classes (10-12). In the K-Nearest Neighbor classification scheme (4,5 , 15), each spectrum is considered to be a point in a multidimensional space. Each axis of the space corresponds to a possible mass position in the spectrum, and the abundance of each mass position in a particular spectrum defines the distance along a specific axis. T h e distance between an unknown spectrum, U , and a spectrum in the reference file, X, is calculated by Equation 1,

(“

D= c T h e popularity of combined gas chromatography-mass spectrometry has greatly increased the need for computeraided interpretation of mass spectral data (1-14). A variety of techniques, including pattern recognition and artificial intelligence methods, have been proposed, but little has been done to assess their relative merits on a quantitative basis. One of these methods, the “Self-Training Interpretive and Retrieval System” (STIRS) ( I O ) , has been extensively tested, both with model compounds and real unknowns. STIRS has been available to outside users since January 1974 (11-14), and its predictive capabilities for 179 common substructures of organic compounds have been tabulated (11). As a recent study ( 5 ) showed the K-Nearest Neighbor (KNN) algorithm t o be superior to five other common pattern recognition methods in its ability to extract information from mass spectra, it appeared of interest to make a quantitative comparison of the capabilities of K N N and STIRS for substructure identification, as in both methods identification is based on the substructures present in compounds whose spectra are found to be the best matches in the reference file. For STIRS, a number of classes of mass spectral data have been selected for their high structural significance; these include characteristic ions, series of ions, and masses of neutrals lost. T h e computer then matches the data of the unknown spectrum in each class against the corresponding data of all reference spectra, retaining the fifteen best matches in each data class. These compounds are then examined by the computer for the presence of each of the 179 substructures, evaluating the probability of substructure presence using a random-drawing model (11). In general the best results are found using an “overall match factor” (MF11) which is a ’Present address,T. R. Evans Research Center,Diamond Shamrock Corporation, Painesville, Ohio 44077. Present address, Monsanto Agricultural Research, Mill Zone V18, 800 North Lindbergh, St. Louis, Mo. 63166. 1720

-

ANALYTICAL CHEMISTRY, VOL. 49, NO. 12, OCTOBER 1977

I=1

( U i - Xi)2

i“:

where u , is the abundance of the ith mle position in the spectrum U , and xiis the abundance of the ith mle position in the spectrum X. The primary assumption of the K-Nearest Neighbor classification is that members of the reference file with the smallest values of D for a particular unknown spectrum correspond to compounds whose structures are most similar to that of the unknown. Thus any molecular substructures contained in the compounds closest to the unknown should also frequently appear in the unknown compound, and the more of the nearest neighbors that contain the substructure the greater the confidence in assigning it t o the unknown. Note the basic similarity between this and the mode of STIRS selection in each data class (11).

EXPERIMENTAL The capabilities of several K-Nearest Neighbor classification schemes were evaluated for the assignment of 20 substructures using a set of 500 “unknown” spectra selected randomly from a file representing 18806 different compounds (16);the substructure assignments and data base were identical t o those used in the STIRS study ( 1 1 ) . Because the selected “unknowns” were eliminated from the reference file used in examining them, and because an entirely different set of randomly selected unknowns gave STIRS results that were the same within experimental error ( I I ) , the results should be applicable in general to unknowns not in the data base. A variety of voting schemes were tested for the identification of substructures in the unknown according t o the substructures present in the K-Nearest Neighbors. If certain statistical distributions are assumed for the data, an optimum value of K can be calculated ( 4 ) ;however, for mass spectral data these distributions do not hold, and a simple majority vote of the three to nine nearest neighbors is normally used for decisions (5). The first classification scheme used in this study identifies a substructure as being present in the unknown if it is found in the nearest neighbor. The second classifier requires instead that the substructure appear in compounds corresponding to the two nearest spectra; the other classifiers used require that the sub-

rn

. N

x

ANALYTICAL CHEMISTRY, VOL. 49, NO. 12, OCTOBER 1977

1721

structure appear in the three nearest, two of the three nearest, three of the four nearest, three of the five nearest, and four of the fifteen nearest spectra. Statistical Evaluation. The results of the study (Table I) were measured by three terms: recall (RC), false positives (FP), and the percent correct (%C),whose applicability and utility are justified separately (17). These are defined in Equations 2-5,

RC = I,/P, FP = I f / P f %C = 100 Ic/(Ic + If) 100 P;RC %C = P;RC + P f . F P where I , equals the number of correct identifications, I f equals the number of false identifications, P, equals the possible correct identifications, and Pfequals the possible false identifications (P, + Pf= 500). [The performance “reliability” reported previously ( 1 1 - I 3 ) , actually designed as a measure of the system‘s utility, is equal to RC/(RC FP), and is equivalent to %C when P, = Pf.] An example using the two out of three (2/3) nearest neighbor vote to identify the sulfur substructures may help to clarify the use of these terms. For each of the 500 spectra in the test file, the three nearest spectra in the reference file are determined; if two of these three compounds contain sulfur, the unknown is identified as a sulfur-containing compound. By these criteria, 36 of the 500 compounds were identified as sulfur compounds; only 22 of these compounds actually contain sulfur, while 53 of the 500 spectra in the test file actually are of sulfur compounds. Using these three values, RC = 22/53 = 41%, FP = (36-22)/ (500-53) = 3.1%, and % C = 22/36 = 61%. In addition t o the % C value employed in previous pattern recognition studies ( I , 51, the RC and FP values have also been reported here to measure, respectively, the ability to identify the substructure when it is present, and the ability not t o indicate it when it is not present. From Equation 5, the % C result will be dependent on the proportion of the test set compounds containing the particular substructure. For example, if the system can achieve RC = 50% and FP = 1% for the sulfur substructure, a test in which 200 compounds contained the substructure and 300 did not would give % C = 97%; if only 10 of the tested compounds contained sulfur and 490 did not, the same RC and FP performance would give % C = 50. In the present study the same data base and substructure definitions were used for both the KNN and STIRS evaluations, so that the values of RC, FP, and %C should provide direct measures of the performance of these systems relatiue to each other. A further caution should be repeated (10-13) with reference t o the absolute values for RC, and thus also for %C; the ability t o interpret a particular mass spectrum in order to identify a specific substructure is dependent on the influence of that substructure on the mass spectral fragmentation relative to the influence of the molecule’s other functionalities; although STIRS gives a wide range of recall values for 200 substructures, the values correlate generally with experience from human interpretation of mass spectra (10-12). A variety of substructures were included in the present study to ascertain whether the performance of KNN vs. STIRS is substructure dependent.

+

RESULTS AND DISCUSSION As expected, more strict requirements for substructure identification, such as a greater proportion of K-Nearest Neighbors containing the substructure, generally reduces RC while improving %C and FP, so that both values of an RC/FP or RC/ 70C pair must be compared for a relative evaluation

1722

ANALYTICAL CHEMISTRY, VOL. 49, NO. 12, OCTOBER 1977

of the systems (Table I). Although the K-Nearest Neighbor studies cover a much wider range of values, where these values overlap STIRS generally gives superior results; for K N N 0 value for a particular substructure voting schemes giving a ‘7C comparable t o the STIRS value, in every case the corresponding RC value for KNN is lower. Also, the average results for the 20 substructures can be compared for 3/3 K N N vs 1 2 % STIRS; with comparable FP (2.1% vs. 1.9%) and % C (80 vs. 84) values, the 27% recall found for K N N is substantially below the 42% value found for STIRS. This overall trend indicates that the STIRS selection of parameters based on mass spectral behavior ( I O ) provides improved reliability for substructure identification, similar to the advantages found for feature selection in pattern recognition methods ( I ) . In looking a t the results of different voting schemes, the 2/2 vote generally gives the best trade-off in recall and percent correct. Although the 3 / 3 vote results in a more reliable identification, the recall is often so low that an insufficient percentage of compounds actually containing the substructures are found. The reverse problem occurs when the 4/15 voting scheme is employed. Here the recall is excellent but the percent correct is too low t o allow confident identification of substructures. This indicates that the use of a K value greater than 5 does not appear to be helpful. In fact even when only the 5 nearest spectra are used, the K N N performance decreases noticeably. I t is apparent that application of mass spectral knowledge in the selection of parameters on which the similarity measures are based, as is done in STIRS, is valuable for improving the classifier performance. A judicious selection of the parameters employed in the K-Nearest Neighbor classifier should substantially improve the outcome of substructure identification and should give results closer to those obtained from STIRS.

LITERATURE CITED (1) T. L. Isenhour. B. R. Kowaiski, and P. C. Jurs, Crit. Rev. Anal. Chem., 4 , 1 (1974). (2) G. M. Pesyna and F. W. Mdafferty, in “Determination of Organic Swuctues by Physical Methods”, F. C. Nachod, J. J. Zuckerman, and E. W. Randall, Ed., Academic Press, New York N.Y.. Vol. 6, 1976, pp 91-155. (3) L. R. Crawford and J. D. Morrison, Anal. Chem., 40, 1469 (1968). (4) B. R. Kowalski and C. F. Bender, Anal. Chem.,44, 1405 (1972). (5) J. B. Justice and T. L. Isenhour. Anal. Chem.. 46, 223 (1974). (6) P. Kent and T. Gumann, Helv. Chim. Acta., 5 8 , 787 (1975). (7) D. D. Tunnicliff and P. A. Wadsworth, Anal. Chem.. 45, 12 (1973). (8) J. Franzen and H. Hiiiig, Adv. Mass Spectrom., 6 , 991 (1974). (9) D. H. Smith, B. G.Buchanan, R . S. Engelmore, H. Aidercreutz, and C. Djerassi, J . Am. Chem. SOC.,95, 6078 (1973). (10) K.-S. Kwok, R. Venkataraghavan, and F. W. McLafferty, J . Am. Chem. SOC.,95, 4185 (1973). (1 1) H. E. Dayringer. G. M. Pesyna, R . Venkataraghavan, and F. W. McLafferty, Org. Mass Spectrom. 11, 529 (1976). (12) H.E. Dayringer and F. W. McLafferty, Org. Mass Spectrom., 11, 543 (1976). (13) H. E. Dayringer, F. W. McLafferty. and R . Venkataraghavan, Org. Mass Spectrom., 11, 895 (1976). (14) F. W. McLafferW, H.E. Dayrimer. . - and R. Venkataraahavan. Ind. Res.. 18, 78 (1976). (15) T. M. Cover and P. E. Hart, I€€€ Trans. Info. Theory, IT-13,21 (1967). (16) E. Stenhagen, S. Abrahamsson. and F. W. McLafferty. “Registry of Mass SDectral Data”. Wilev-Interscience. New York. N.Y.. 1974. (17) F: W. McLafferty, Anal. Chem., 49, 1441 (1977)

RECEIVED for review September 2, 1976. Resubmitted June 8,1977. Accepted July 15,1977. The financial support of the National Science Foundation is gratefully acknowledged by the North Carolina group, as is the support of the National Science Foundation and the National Institutes of Health by the Cornel1 group.