Performance prediction and evaluation of systems for computer

Performance Prediction and Evaluation of Systems for Computer. Identification of Spectra. Sir: A wide variety of computer systems for the retrieval...
0 downloads 0 Views 434KB Size
CORRESPONDENCE

Performance Prediction and Evaluation of Systems for Computer Identification of Spectra Sir: A wide variety of computer systems for the retrieval and interpretation of unknown spectra, especially mass spectra, have been proposed in recent years (1-14). Reliable ways to evaluate these systems are necessary both to develop further improvements and to allow the user to choose the best system for his needs. For tests of such systems it has been common practice to report the proportion of correct answers, with designations such as “reliability” or “70correct”. Although the knowledge of this probability is obviously of importance to the user of the system, for the performance evaluation of document retrieval systems it has long been recognized that the “recall”, the proportion of possible relevant items retrieved, must also be measured (15);“recall” has been utilized in evaluating particular systems for both the retrieval (5, 13) and the interpretation (8, 9, 16) of mass spectra. Further, it has been pointed out that the observed reliability value is dependent on the proportion of the matching structure in both the tested spectra and the reference file; a modified performance reliability term was proposed which combined the recall value and the proportion of false positive identifications, as neither of these is dependent on the structural composition of the test or reference set (9). Recently Wilkins and co-workers (1I ) , amplifying theoretical conclusions of Rotter and Varmuza (IO),have presented a comprehensive approach to the evaluation of pattern recognition systems. It is the purpose of this correspondence to propose a modified approach applicable to both retrieval and interpretation systems, and to recommend criteria for (1) comparing and improving such systems, and (2) evaluating the reliability of a system’s predictions for a particular unknown spectrum. The discussion will be limited to the evaluation of systems for binary classification, the answer of a retrieval system to the question “Is this particular compound (that giving the reference spectrum) present in the unknown?”, and of an interpretive system “Is this substructure present in the unknown?”. As shown in Table I, four possible results can be found for each “unknown” tested; all four are tabulated and combined into a “figure of merit” in the evaluation scheme of Wilkins and co-workers (11, 12). However, for retrieval systems that match unknown spectra against a large data base, the user seldom wishes to know which reference spectra do not match that of the unknown; a comprehensive list would be so long that it would be meaningless. Similarly, only predictions of “present” are usually justified for mass spectral interpretive systems, in our opinion (17);this was a significant premise in the design of our “Self-Training Interpretive and Retrieval System (STIRS)” (16). Functional group information varies widely in its “visibility” in the mass spectrum, in sharp contrast to the behavior found in many other types of spectra. For example, in their infrared spectra both acetone and dimethylaminoacetone show a strong carbonyl stretching frequency characteristic of the keto group. In the mass spectrum of acetone, the acetyl group produces the base (100%) peak a t m l e 43, but in the mass spectrum of dimethylaminoacetone this peak is reduced to 5% by the strong fragmentation-directing influence of the amino group. Thus a main reason for selecting a mass spectral interpretive system

Table I. Possible Results for a Tested Unkown Actual Predicted Result Structure present Present Correct positive Structure absent Present False positive Structure present Absenta False negative Correct negative Structure absent Absenta a These predictions are not made for most retrieval systems and for mass spectral interpretive systems such as STIRS. should be its ability to determine “correct positives”, which thus must be determined separately from its ability for “correct negatives”. (Prediction of the absence of structures could also be helpful in specific cases, such as functionalities which strongly direct mass spectral fragmentations, and such predictions should also be investigated.) Considering only cases for which the system has predicted the presence of a compound or substructure (first two lines, Table I), its capability will depend both on the probability that it will predict “present” when this is correct (recall, RC), and that it will not predict “present” when the structure is absent (false positive, FP). These are related as shown in Equations 1 and 2,

RC = IcJPc FP = I f / P f where I, and I f are the number of correct and false identifications, respectively, and P, and Pf are the total possible correct and false identifications, respectively (P, Pf = N , the total number of unknowns tested). (Note that RC and FP are, in effect, determined by testing with separate data bases which do, and do not, contain the structures sought.) The values of RC and FP obtained will depend on the strictness of the matching criteria used, for example, the minimum acceptable value of the “Similarity Index” ( 3 ) , “Confidence Value” (5),or proportion of matching “K-Nearest Neighbors” (18);stricter criteria will improve (lower) the false positives, but degrade the recall. Evaluation should cover a range of RCIFP pairs to allow the user to choose the optimum matching criteria for his particular problem (5, 13, 15, 18). [For system performance measurement it would obviously be useful to combine RC and FP into a single value which is relatively independent of the matching criteria chosen; we proposed (9)the relationship RC/(RC + FP) which in practice has caused confusion, and Dromey (19) has recently suggested an alternative formulation.] System performance values of RC and FP should be determined on a statistically large sample of “unknowns” randomly selected from the universe of structures of interest; evaluation should not be carried out with monofunctional compounds if applicability to general types of organic molecules is sought, nor carried out with pure compounds if applicability to mixtures is desired (5). Comparison between systems is the most reliable if the same unknowns and reference spectra are used in each evaluation

+

(18). ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977

1441

To reiterate, the user’s choice of retrieval and interpretive systems should be based on the RC and FP values the system achieves with unknown spectra representing the universe of unknowns with which the user will deal (4).However, when the chosen system is applied to a particular unknown spectrum, the user wants to know, instead, the probability that a prediction of “present” is correct; we will call this the reliability (RL) of the prediction (Equation 3).

RL = Ic/(Ic + I f ) = P;RC/(P;RC

(3)

+ P,*FP)

(4)

Substitution of the above definitions of RC and FP shows (Equation 4) that the reliability of a prediction depends, in addition, on the occurrence probability of the unknown structure in the reference file, as Pf = N - P,. Obviously if there is no chance that the unknown structure is among those represented in the reference file (P, = 0), RL must be zero. On the other hand, the reliability is increased if the proportion of spectra of the unknown compound in the reference file is increased; this has been done by including in the file multiple spectra of the more commonly occurring compounds (5), or by limiting the reference file to compounds similar to the unknown ( 4 ) . Consider an application of the “Probability Based Matching (PBM)”system (5),assuming that confidence (“K”) values of 50 and 100 correspond, respectively, to RC values of 55% and 15% and FP values of 1/45000 and 11 1000 000. An unknown mass spectrum is matched by PBM against a reference file of 25000 spectra, all of different compounds, but containing a spectrum of the unknown compound (or spectra of the compounds if the unknown is a mixture), and two spectra are retrieved with K values of 50 and 100. For the compound of K = 50, RL = 1-55%/(1.55% + 25 000145000) = 50%,which can be rationalized as follows. When a spectrum in the file corresponding to a correct answer is matched against the unknown, the recall performance indicates a 55% probability that this correct spectrum will be retrieved with K = 50; because a false positive a t the K = 50 level should result once in every 45000 attempted matches, there is also a 55% probability that this answer found by matching 25000 spectra is wrong, Thus the 50% reliability results from the fact that the probabilities of right and wrong answers are equal. For the compound of K = 100, RC = 15% signifies that in only 15% of attempted matches with a spectrum of the same compound will the two spectra match this well; it does not mean that there is only a 15% chance of this being the correct answer. On the contrary, since in only 2.5% (25000/106) of matches giving K = 100 will this be a false answer, the probability that this answer is correct is 86% [RL = 1-15%/(1.15% 0.025)]. Note that the RL value of 86% does not correspond to (1- FP);from Equation 4, RL approaches 1 - FP when half of the reference file are spectra of correct answers (P, = Pf)and RC and RL approach 100% [because 1/(1+ FP) (1- FP)]. If the system is used to report only one answer for each unknown, both P, and Pi will equal the number of unknowns tested, which is also equal to I , I f , and thus RC = RL = (1- FP). Plots of RC vs. RL ( 5 , 15), as well as of RC vs. FP (vide supra), are useful for system evaluations if the composition of test data is sufficiently similar to that of the unknowns. Although this definition of reliability (Equations 3 and 4) is the same as that used for PBM ( 5 ) ,it is not identical with that we used for the STIRS evaluations, RC/(RC + FP) (9). The latter was meant, rather, as a measure of system performance, as distinguished from the use here as a measure of the probability that an answer is correct. For future studies we propose to use reliability only as defined by Equations 3 and 4 for interpretive, as well as for retrieval, systems, as described below.

+

-

+

1442

*

ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977

For the evaluation of an interpretive system the basic Equations 1-4 are also applicable, but it should be recognized that a substantially poorer “false positive” performance can be tolerated to achieve comparable levels of reliability. Each substructure sought by the interpretive system can be considered as a reference spectrum against which a match will be sought; pattern recognition studies have employed 60 (Z), 11 (IZ),and 20 (18)substructures, while the STIRS evaluation (9) utilized 179 substructures. Thus for the latter an FP value only as low as 1/358 is necessary to give a 50% probability that one of these substructures will be selected as a false answer, compared to the FP value of 1/50000 necessary in the PBM example above for a reference file of 25 000 spectra; for the pattern recognition study of 11substructures (12),only FP = 1/22 is required. Further, for interpretive systems RC performance values are usually determined for the individual substructures, in effect matching the unknown against a file of only one reference spectrum; thus for a statistically meaningful evaluation (to make Pf > 0) the test set must contain unknowns which do not contain the particular substructure (tests of retrieval systems use unknowns which are present in the file). The proportion of the unknown spectra which contain the substructure under test will affect the observed reliability, in the same way the composition of the reference file affects retrieval reliability (vide supra). As an example consider a substructure such as chloro for which STIRS (9, 16) gives RC = 50% at FP = 1/100; if for 100 unknown spectra tested, 66 contain C1 then (Equation 4) RL = 66.50%/(66.50% + 34.1%) = 99%. If only 6 of the 100 spectra contain C1, RL = 6-50%/(6-50% 94.1%) = 76%; this drastic reduction in the reliability is not due to a change in the basic capabilities of STIRS, but is only due to the fact that a reduction in the probability that a substructure is present must increase the proportion of identifications which are false positives. If a system such as STIRS is used as an aid to the interpreter, however, the poor reliability inherent in the prediction of “rare” substructures is not necessarily a serious problem. For example, although only 6 of 18806 reference compounds in the STIRS evaluation (9) were of azulenes, a 50% recall was achieved at the 1 % FP level. Thus if STIRS indicates an azulene substructure for an unknown mass spectrum, the interpreter will know that there is a very low probability that this is correct if the occurrence probability for azulene in the universe of structures from which the unknown was drawn is 6118806, as then RL = 6.50%/(6-50% 18806.1 %) = 1.6%. The presence of azulene will only be worth considering if the interpreter can rationalize a much higher occurrence probability for the particular unknown. Note that this P,(P, + Pi., value of 6/18806 is still larger than values such as 1/25000 occurring in tests of retrieval systems, again emphasizing the difference in FP performance (at a particular RC level) required (and achievable) for retrieval in comparison to interpretive systems. The average unknown compound in the STIRS tests (9) contained more than five of the 179 substructures; a comparable test of a retrieval system should thus use spectra of mixtures containing five unknown compounds. Similarly, artificially limiting an interpretive system test to the spectra of compounds containing only one functional group (11,12)will produce a much higher RC/FP performance (8). Ranking of Predictions. Of course it is more important that a system performs well on real unknowns (high RL values) rather than that it scores well on tests (optimum RC and FP values). For most retrieval systems, the ratings of predicted compounds are based on the degree to which the unknown spectrum matches the reference spectrum; to the extent that the data of the nonmatching spectra me randomly distributed over the possible mass and abundance values [PBM ( 5 ) at-

+

+

tempts to do this by weighting these values], this rating should be a measure of the probability that the selected spectrum did not match by the coincidence of its data. Such a rating should thus reflect the “false positives” probability, not the reliability. The predictions could be ranked instead by RL values derived using Equation 4 from FP,RC,and estimates of the occurrence probability of the unknown. Alternatively, as mentioned above, the reference file composition could be chosen to reflect the occurrence probability of unknowns by including several different spectra of the same compound for common compounds. Obviously, a retrieval system which must be used in a variety of problems can do little to correct for variations in occurrence probabilities, and the user must weight the system evaluations himself in this regard. The ranking of substructure predictions in interpretive systems has been based on both FP and RL. Ratings for STIRS substructure identification should reflect FP,as the number of compounds containing the particular substructure in the 15 best matching spectra is weighted for the occurrence probability in the reference file using a random-drawing model (9) (if 15 compounds are drawn from the file at random, there would be a substantial probability that two contain chlorine, but very little chance that two would contain azulene). For a K-Nearest Neighbor identification (6, 18) requiring two of the three “nearest” compounds to contain the substructure, the chance of a “false positive” for chlorine would obviously be much greater than for azulene, but the chance that an identification of chlorine is correct would also be higher (Equation 4, assuming equivalent RC values). In actual use, the rankings of such a system will reflect reliability values only if the occurrence probability of the unknown’s substructures are similar to those in the file. On the other hand, use of a system such as STIRS (9) with rankings based on FP demands that the interpreter rank further the predicted substructures according to his concept of their occurrence probabilities. For a system which provides only a binary ranking of “structure present” or “structure absent” (Table I), the interpreter has little opportunity to modify the results in light of his conception of the occurrence probabilities,

those ranked by FP values) must be further modified to recognize the probability that the prediction is correct. Finally, systems proposed for unknown mass spectra have become so numerous (7) that potential users deserve to have direct comparisons based on closely related reference and unknown data (181, which for interpretive systems should include a separate evaluation of the capability for indicating the presence of substructures (16,171.

CONCLUSIONS As a final plea, the potential user of a retrieval or interpretive system deserves to know its basic capabilities, which should be given as a range of RCIFP pairs determined on a statistically valid sample of unknowns similar to those commonly encountered. The user should also be made to understand to what extent the reported results (especially

Department of Chemistry Cornel1 University Ithaca, N.Y. 14853

ACKNOWLEDGMENT The author is grateful to R. Venkataraghavan, R. G . Dromey, T. L. Isenhour, and C. L. Wilkins for helpful discussions. LITERATURE CITED (1) L. R. Crawford and J. D. Morrison, Anal. Chem., 40, 1464 (1968). (2) P. C. Jurs, B. R. Kowaiski, T. L. Isenhour, and C. N. Reilley, Anal. Chem., 42, 1387 (1970). (3) H. S.Hertz, R. A. Hites, and K. Biemann, Anal. Chem., 43, 681 (1971). (4) T. 0. Gronneberg, N. A. 8. Gray, and G. Eglinton, Anal. Chem., 47, 415 (1975). (5) G. M. Pesyna, R. Venkataraghavan, H. E. Dayrlnger, and F. W. McLafferty, Anal. Chem., 48, 1362 (1976). (6) P. C. Jurs and T. L. Isenhour, “Chemlcal Appllcations of Pattern Recognition”, Why-Interscience, New York, 1975. (7) G. M. Pesym and F. W. McLafferty in “Determination of Organic Structures by Physical Methods”, Vol. 6, F. C. Nachod, J. J. Zuckerman, and E. W. Randall, Ed., Academic Press, New York, 1976. (8) P. Kent and T. Gaumann, Helv. Chim. Acta, 58, 787 (1975). (9) H. E. Dayringer, G. M. Pesyna, R. Venkataraghavan, and F. W. McLafferty, Org. Mass Spectrom., 11, 529 (1976). (10) H. Rotter and K. Varmuza, Org. Mass Specfrom., I O , 874 (1975). (11) L. J. Sohberg, C. L. Wilkins, S.L. Kaberiine, T. F. Lam, and T. R. Brunner, J . Am. Chem. SOC.,98, 7139 (1976). (12) T. F. Lam, C. L. Wiikins, T. R. Brunner, L. J. Sobberg, and S.L. Kaberllne, Anal. Chem., 48, 1768 (1976). (13) N. A. B. Gray, Anal. Chem., 48, 1420 (1976). (14) R. C. Fox, Anal. Chem., 48, 717 (1976). (15) G. Sakon, “Automatic Information Oraanization and Retrleval”. W a w - H i l l . New York, 1968. K . 4 . Kwok, R. Venkataraghavan, and F. W. McLafferty, J . Am. Chem. SOC., 95, 4185 (1?73). F. W. McLafferty, Interpretation of Mass Spectra”, Second Edition, Benjamin Addison-Wesley, Reading, Mass., 1973, p 98. T. L. Isenhour, S.R. Lowry, J. 6.Justice, Jr., F. W. McLafferty, H. E. Dayringer, and R. Venkataraghavan, Anal. Chem., submitted. R. G. Dromev. Research School of Chemistrv. The Australian National Unlversity, Canberra, A C T . 2600, Private Communicatlon, February 1977.

F. W. McLafferty

RECEIVED for review February 4,1977. Accepted May 2,1977. The author thanks the Environmental Protection Agency (grant R804509) for generous financial support.

ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977

1443