Automatic identification of chemical spectra. Goodness of fit measure

“goodness of fit” measure which is the heart of any file searching procedure from hypothesis testing. It will be shown how the weighting factors i...
0 downloads 0 Views 666KB Size
(27)S-C. Lin. J. Appl. Pbys., 25, 54 (1954). (28)G. L. Clark, J. J. Hickey. R. J. Kingsley, and R. F. Wuerker, Exploding Wires”, Vol. 2,W. G. Chace and H. K. Moore, Ed.. Plenum Press, New York. 1959.I) 175. I’

(31)J. H. Park, J. Res. NaN. Bur. Std. ( U S . ) ,39, 191 (1947). (32)D.J. Johnson, F. W. Plankey, and J. D. Winefordner, Anal. Chern., 46, 1898 (1974).

Automatic Identification of Chemical Spectra. A Goodness of Fit Measure Derived from Hypothesis Testing S. L. Grotch Jet Propulsion Laboratory, California Institute of Technology, Pasadena, CA 9 1 103

The heart of a file searching procedure is the algorlthm used to measure “goodness of fit.” In previous work, the weights applied to each channel have generally been obtained in an ad hoc manner. A maximum likelihood approach is used here to pre-select these weights to maximize the separation between spectra which are the same and those which are different. With this method, the effects of errors on the weights is expressed analytically. Using binary encoded mass spectra, this new weighting procedure invariably produces results which are superior to other weighting procedures.

Because of the increasingly widespread use of computers in analytical chemistry, the automated identification of spectra continues to be a very active research field. In only the past five years, over 100 papers have appeared in this area. These have been most recently reviewed in Refs. 1-3. This study addresses the problem of file searching from a somewhat novel viewpoint, first suggested to this author by Herman Chernoff of MIT. This new view derives the “goodness of fit” measure which is the heart of any file searching procedure from hypothesis testing. It will be shown how the weighting factors in a goodness of fit metric can be calculated, a priori, using a hypothesis testing concept. Furthermore, as a direct consequence of these procedures, it will be seen how errors in measurement or encoding will affect the choice of these weighting factors. Finally, it will be seen, using binary mass spectral patterns, that the weighting factors predicted from this theory yield results which are, in virtually all cases, superior to the best weighting procedure previously used. The extension of this technique to the more general non-binary situation will also be described.

also apply to a wide variety of other types of chemical spectra. In binary coding, a spectrum is encoded as a string of I‘ 0 I s and “1’s’’ where the “1’s” typically represent the presence of a line or a band and the “0’s’’ the absence of such a feature. Generally, a large variety of encoding procedures is possible with any type of spectrum. For example, with mass spectra, the mass positions of the M most significant peaks could be denoted by 1’s in a binary string in which each bit represents a nominal mass. The mass position of the single most intense (or two most intense) peaks per 7 amu ( 1 6 ) or 14 amu (9) could similarly be encoded. The spectrum could also be encoded by thresholding; e.g., only the masses of peaks with intensities above a transition level could be encoded as 1’s (12). Clearly, many similar procedures could be employed with other forms of spectra. In the actual file searching procedure, an unknown code is identified. by comparison against a larger known file which has been previously encoded using the same coding algorithm as the unknown. When an unknown code is compared against a specific library member, a “goodness of fit” is calculated for each comparison. At the conclusion of the search, those K spectra in the file “closest” to the unknown code are displayed and the nearest neighbor(s) is (are) assumed to be the correct answer. Typically, this goodness of fit, or distance, can be considered as the summation, over the N channels of interest, of values taken from a “truth table” (See Table I). The values of the unknown code and the library code are compared, channel-by-channel. Depending upon the simultaneous appearance of a “00”, “ O l ” , “IO”, “11” in channel n, one c y n ( l O ) , or c y , ( l l ) is added to value; either c y n ( O O ) , (~~(011, yield a distance or goodness of fit, d: 9)

N

d =

BINARY ENCODED SPECTRA Because of simplicity, speed of searching, and the extreme amount of data compression which can be achieved using a binary encoded spectrum, a number of workers have investigated these methods in the file searching of chemical spectra (2-15). Most of this work has focused on mass spectra, but infrared, NMR, and emission spectra have all been considered. It should be realized that most techniques in this field are generally applicable to different types of spectra since the same mathematical principles underlie the identification of binary codes. Thus, although this paper will focus on mass spectra, the same techniques

ol,(ij) n=l

where: cy,(ij) = value in the truth table for the n t h channel ~~

~~

~

Table I . T r u t h Table for Binary Codes

ANALYTICALCHEMISTRY, VOL. 47, NO. 8, JULY 1975

1285

when the unknown value is i and the library value is j (i,j = 0,l for binary vectors). Note that, in general, the a n ( i j ) are a function of the channel number n. In the more general multilevel case, where i and j can assume up to L levels, L2 entries in the truth table are possible for each channel. (Because of storage limitations, if L is large, the a’s would probably be determined by an analytical expression.) The remainder of this paper is concerned with the question of the choice of the a’s in the binary case so as to achieve better identifications in the presence of error. In previous work (5, I I ) , two choices of the weighting factors a were usually employed: the logical XOR in which a(00) = a(l1) = 0 and a(O1) = a(10) = 1; and the logical AND in which a(00) = a(O1) = a ( l 0 ) = 0, and a(l1) = 1. In all cases, the a ( i j ) were assumed to be independent of channel number. With the logical XOR, one is measuring the total number of channels disagreeing in any match, and with the logical AND, the number of channels in which both codes have a “1”. (For consistency, if the distance measure, d is to be minimized for best fits, the negative of AND should be used, e.g., a ( l 1 ) = -1, rather than +l.) Previous work had demonstrated (11)that using a linear combination of XOR and AND, d = XOR - p (AND), much better identifications could be achieved with binary encoding than using either XOR or AND separately. In this case, the truth table is: a(O0) = 0, a(O1) = a(10) = 1 and a ( l 1 ) = - p . It was observed empirically for a number of test mass spectral unknowns that p si: 2 appeared to give the most reliable results. I t was also observed (6) that normalizing d by dividing by the maximum possible d for the given match also significantly improved the search results. With respect to the truth table, it should be noted that a constant may be added to or subtracted from all entries in the table for any channel without changing the relative positions of the matching library spectra. In effect, adding a constant applies a different offset to the distance d , but the relative positions of all library codes remain unchanged. For convenience, this constant is chosen here as a(00) so that the contribution of a “00” match to d is zero for all channels. Several questions arose from the earlier work using binary coding: 1) why did a weighting value of p = 2 often markedly improve the search results? 2) how might the weighting factors, a , be chosen even more judiciously, so as to further improve the search results, and 3) how do errors in measurement influence the weighting factors to be used? Answers to all of these questions result from the hypothesis testing approach described below.

HYPOTHESIS TESTING The subject of hypothesis testing is one which has occupied statisticians for many years. A number of texts in this area are available (17, 18). In simplest terms, hypothesis testing as used here is merely a statistical method by which one of several alternative hypotheses is shown to be most likely. Consider the file searching problem from the following viewpoint. A library spectrum measured in laboratory X is encoded and inserted in the library. When worker Y measures the same spectrum, it will generally differ from the X’s result because of instrumental differences, impurities, etc. The resultant spectral code will usually reflect these differences. When this unknown spectral code is compared against any library member, we are asking which of two alternative hypotheses is correct: I. The two spectral codes being compared are, in fact, observations of the same true spectrum, but they may differ because of errors in either or both. 1286

ANALYTICAL CHEMISTRY, VOL. 47,

NO. 8, JULY

1975

11. The two spectral codes being compared are, in fact, observations of two different independent spectra which differ because the underlying true spectra are different and because errors in either or both occur. Assume that in matching, the distance, d , measuring “goodness of fit” could be calculated so that hypothesis I is, in some manner, “best” differentiated from hypothesis 11. We would say, intuitively, that such a distance measure would more clearly distinguish spectra which are the same from those which are different. This basic reasoning underlies the methodology now to be discussed. Because of its simplicity, binary codes will first be considered. Later, it will be shown how these results can be generalized to the more complex multilevel situation. In a recent paper, Franzen (19)has also used a maximum likelihood approach, but only in the context of a two-class pattern separation problem. HYPOTHESIS TESTING WITH BINARY CODES First, consider the comparison of two binary codes in any channel. Assume that for the library of codes used, p n , is the probability that a “I” occurs in channel n. Assume further, for the codes being considered, that: to is the probability that a true “0” is observed as a “1”and €1 is the probability that a true “1”is observed as a “0”. When the values in a single channel of two binary codes are compared, one of four possible outcomes can result: 00, 01, 10, 11. Assuming that hypothesis I is correct, it is possible to calculate the probabilities of occurrence of each of these possibilities:

In the notation adopted here, P I ( i j ) is the probability of occurrence of a value i in one code and a value j in the other in channel n, assuming hypothesis I is correct. (For simplicity, the channel subscript n, is omitted, but it must be understood that the P’s are, in general, also a function of channel number.) These equations arise in the following manner. Assume that when two codes are compared, the observed pair of values in channel n, is 00. If there is the possibility of error, the true pair of values in this channel could have been either 11 or 00. (Note that since under hypothesis I, the true spectra are the same, the true pairs of values 01 or 10 cannot occur.) The probability that a true 00 is observed as a 00 is (1 p ) ( l - to)2 and the probability that a true 11 is reported as a 00 is pel2. These two terms sum to yield Equation 2 for P1(00).The remaining equations follow directly from analogous arguments. These equations can be examined in the limiting case for which there is no error (to = t l = 0). In this case, Pl(00) = 1 - p ; P1(01) = P1(10)= 0; P1(11)= p . This is, of course, what one would expect since for no error, under hypothesis I, the situation (“01”) or (“10”) cannot occur and the (11) or (00) would simply arise in proportion to the probability of a “1”. Consider now the case in which hypothesis I1 is correct; namely, that the two underlying codes being compared are, in fact, independent. Under this hypothesis, a different set of P’s is calculated:

ESTIMATION O F PARAMETERS

In each equation, the term within the brackets represents the probability that either a “0” or a “1” will be observed in one spectral code. Since under hypothesis 11, independence is assumed, the appropriate bracketed terms are multiplied to yield the final joint probabilities. Note that in this case even when €0 = t l = 0, the (01) or (10) condition can arise. Due to symmetry, under either hypothesis, P(O1)= P(10). In hypothesis testing, the likelihood-ratio A assumes great importance. For each ij combination in each channel n, A, is defined as:

If the A,(ij) were large (A, >> l),one would say that hypothesis I1 is more likely to be correct than hypothesis I, e.g., the spectra are, in fact, likely to be different. Conversely, if A, were small (A, l),if the assumption is made of channel independence, the overall likelihood-ratio is simply the product of the individual channel likelihood-ratios (overall A = A1-Az . . . AN). Since the logarithm of X varies monotonically with A, and since log (X1-A2-A3 . . . A,) = log A1 + log A2 + . . . log AN, it is simpler to use the logarithm of the likelihood-ratio rather than A directly. If we assume that in any channel the weighting factors c y ( i j ) are proportional to log ( A ( i j ) ) , we have established a measure of distance which will reflect the likelihood-ratio. Here we use directly the values of log A(ij); e.g., in any channel a ( i j ) = log (A(ij)).In effect, we will weight the resulting occurrences in a match by their effect on the overall likelihood-ratio. Those occurrences for which hypothesis I1 is most likely to be correct will be weighted most strongly, and the corresponding matches will consequently have large distances, d. For the binary case, the weights cy are, therefore, a known function of € 0 , q, and p for each channel. For example:

+

While the analytical expressions relating these parameters to the weights are complex, once values for the to, €1,p are known, the calculation of the a’s need only be performed once for any library and any encoding procedure, and are trivial to execute using a computer. With these expressions, it becomes possible to assess the effect of errors on the weighting procedure. For example, it can be shown that for very small p(