Matching of Mass Spectra When Peak Height Is Encoded to One Bit S . L. Grotch Jet Propulsion Laboratory, California institute of Technology, Pasadena, Calif.
The problem of the identification of low resolution mass spectra has been approached from the viewpoint of statistics and information theory. The mass spectral peak heights of 3000 different organic compounds were quantized to only two levels; a “0” denoting no peak, a “1” a peak above a specified transition. All spectral patterns of three groups of 1000 spectra were compared pairwise and the distribution of mismatches was calculated. A theoretical analysis permits the accurate prediction of the mean and variance of this matching distribution from the statistics of each spectral group. These results indicate that mass spectral patterns are highly specific signatures even when encoded to only one bit. The relationship between the information content of spectra and the matching results is also explored.
RECENTADVANCES in analytical instrumentation have inundated the chemist with a deluge of data. For example, the coupled gas chromatograph/mass spectrometer (GC/MS) can produce a mass spectrum every few seconds for periods of many minutes in a single sample analysis. A commercially available infrared Fourier transform spectrometer can be scanned in a few seconds to produce a complete infrared spectrum. To rephrase a familiar quotation of Sir Winston Churchill, “Never have so many, owed so much, to so few instruments.” To even log this flood of data the chemist has called on the assistance of computers. In the last few years a number of computer systems for the real-time acquisition and processing of mass spectral data have been reviewed in the literature (I, 2). Since a detailed interpretation of these data is generally time-consuming, it is also natural to explore methods for automating this interpretation. In the past, it was felt that spectral interpretation was essentially different for each class of spectral information. With the increasing use of automated equipment and computers, it is becoming more apparent that spectral interpretation has many common features. In the work with mass spectra reported here, for example, many common features can be found which are related to the interpretation of infrared spectra. Simply stated, the primary question raised in this study is, “How specific a signature is a mass spectrum when peak height is quantized to only two levels?” The answer to this question will be sought using statistics and information theory. As a simple introduction, assume that low resolution mass spectra are available over a mass range of 200 amu. Assume that at each mass the only information known is the absence or presence of a peak. Peak height is thus quantized to two levels, or one bit; a “zero” denoting no peak, a “one” denoting a peak. Since 200 mass units are covered, the maximum information in a spectrum is 200 bits, If all possible combinations of these 200 bits occurred, a total of 2200,or approximately 2 X loBodistinct combinations (1) C. W. Childs, P. S. Hallman, and D. D. Perrin, Tuluntu, 16,
629 (1969). (2) E. Kendrick, Ed., “Advances in Mass Spectrometry,” Vol. 4, The Institute of Petroleum, London, 1968, pp 3-122. 1214
would result. Since less than IO7 compounds are now referenced by Chemical Abstracts (3), a mass spectrum of only 200 bits would, in theory, contain more than enough information to provide for the unambiguous identification of all known compounds. Obviously, these arguments are oversimplified. All 2100 possible combinations of patterns do not, in fact, arise. Many bit combinations occur more frequently than others, while still others do not occur at all. Nevertheless, the viewpoint posed above is useful because it provides a basis for quantitatively calculating the “information” in spectra and for relating this information to the ambiguity or uniqueness of spectra in general. The approach of the present study was a statistical examination of the matching of large groups of mass spectra. The spectrum of each member of a group was compared with that of every other member in the group and histograms were obtained for the number of mismatches observed. These histograms are a quantitative measure of the degree to which the spectra of the group differ from one another. A theoretical attempt has been made to predict this histogram from the statistics of the spectral groups. MOTIVATION FOR STUDY
The present study is an extension of work carried out in support of a pyrolysis gas chromatograph-mass spectrometer (GC/MS) experiment for the Mars lander of Project Viking (4). In this experiment a sample of Martian soil is to be pyrolyzed in an oven and the effluent passed first through a gas chromatograph for separation, and then into a mass spectrometer for compound identification. If the pyrolysis products can be identified from their mass spectra, it should be possible to infer in some detail the compounds introduced into the pyrolysis oven. At planetary distances, the data produced by the mass spectrometer must be digitally encoded before transmission to earth. The encoding of these data is important particularly with respect to compound identification when spectral peak height is encoded to only a few bits (54). Another question of importance is, “For a specified number of levels, where should the transitions between levels be set ?” It is believed that by approaching the problem of identification from information theory, it may be possible to develop matching criteria which have a firmer theoretical basis and are not merely the results of intuitive reasoning. This approach may also provide a better understanding of other spectral identification schemes. Additionally, it becomes possible to answer in a more definitive fashion, at least for the binary case, the question raised by Crawford and Morrison (5), “How specific is a mass spectrum?” (3) D. P. Leiter, H. L. Morgan, and R. E. Stobaugh, J. Chern. Doc., 5, 238 (1965). (4) P. G. Simmonds, G. P. Shulman, and C . H. Stembridge, J. Chromutogr. Sci., 7, 36 (1969). ( 5 ) L. R. Crawford and J. D. Morrison, ANAL. CHEM., 40, 1464 (1968).
ANALYTICAL CHEMISTRY, VOL. 42, NO. 11, SEPTEMBER 1970
Table I. Statistics of the Mass Spectral Library Group of 1000 spectra Total 1 2 3 Molecular weight 16.0 32.0 16.0 18.0 Minimum 159.1 123.8 143.5 210.0 Average 536.0 536.0 536.0 164.0 Maximum No. of Peaks > O . O l % 1 2 1 3, Minimum 78.4 65.6 76.2, 93.6 Average 177 165 177 158 Maximum No. of Atoms 1 2 1 6 Minimum 22.8 29.2 20.5 18.7 Average 110 110 34 105 Maximum No. of Compounds Containing only 233 45 134 54 C, H 1481 481 625 375 C, H, 0 186 43 131 12 C, H,N 852 0 516 336 Contain Halogen INTERPRETATION OF MASS SPECTRA The fundamental problem of mass spectrometry is the deduction of chemical structure from a mass spectrum. Because this problem lies at the heart of mass spectrometry, it has received very extensive attention with a resultant vast literature devoted to the problem(6-13). As is often the case in problems of this sort, no single approach is “best” in aU situations since each technique has cer-
( 9 K. Biemann, P. Bommer, and D. M. Desiderio, Tetrahedron Lett., 26, 1725 (1964). (7) A. L. Burlingame and D. H. Smith, Tetrahedron, 24, 5749 (1968). (8) J. Lederberg and E. Feigenbaum, “Formal Representation of Human Judgement,” B. Kleinmuntz, Ed., Wiley & Son, New York, 1968, Chap. 7. (9) B. Petterson and R. Ryhage, Ark. Kemi, 26, 293 (1967). (10) V. V. Raznikov and V. L. Talroze, Dokl. Akad. Nauk SSSR, 170, 379 (1966). (11) B. Petterson and R. Ryhage, ANAL.CHEM., 39,790 (1967). (12) P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, &id., 41, 690 (1969). (13) L. R. Crawford and J. D. Morrison, ibid., 40, 1469 (1968).
tain features which can make it well suited to a particular application, while poorly suited to yet another. The approach investigated here is the library search technique; the comparison of an unknown spectrum against a library of previously measured known compounds. This technique readily lends itself to digital computation with a minimum of programming effort. It is of great generality, since in principle, it requires no additional information other than a library of low resolution mass spectra. The criteria used for “best fit” are generally chosen arbitrarily. This is due in large part to the observation of workers (5)that any number of different criteria will correctly identify an unknown if it is present in a library. Much work remains in the selection of matching criteria which have more theoretical justification. More attention must be directed to the problems of identification in the presence of spectral differences due to both instrumental factors and operating conditions. Also worthy of greater attention is the proper use of a library search even when the unknown spectrum is not present in the library. Low resolution mass spectral libraries now available in computer-compatible format contain about 5000 nonredundant spectra. The existence of organizations such as the Mass Spectrometry Data Centre in Great Britain should facilitate the continued accumulation and dissemination of these collections. MASS SPECTRAL LIBRARY
For this study, a large library of experimentally-measured low resolution (unit mass) mass spectra was used. These spectra were normalized in the conventional manner (Le., the maximum intensity peak in each spectrum was 100). Two sources of spectral data were utilized: The Dow Chemical Company uncertified collection (1967 spectra) (14) and a collection provided by Prof. K. Biemann of MIT, consisting of the ASTM collection and other spectra measured in this (14) “Uncertified Mass Spectral Data,” R. S. Gohlke, Ed., Eastern Research Laboratory, Dow Chemical Co., Framingham, Mass., Oct. 1963.
60
50
40
2 I-
i
v)
8
30
Y e
2
2 20
IO
0 MOLECULAR WEIGHT
Figure 1. Molecular weight distribution of 3000 spectra ANALYTICAL CHEMISTRY, VOL. 42, NO. 11, SEPTEMBER 1970
1215
Figure 2. Molecular weight distribution of first group of 1000 spectra
MOLECULAR WEIGHT
laboratory (15) (2004 spectra). Approximately 3000 additional spectra (primarily the API and MCA collections) were available but were only partially used in the work reported here. The Dow and MIT collections were screened to remove multiple spectra and a library of 3000 nonredundant spectra (1903 Dow and 1097 MIT) was finally obtained. Characteristics of this library are presented in Table I, and the molecular weight distribution is plotted in Figure 1. To assess the sensitivity of results to the choice of library, the larger group of 3000 spectra was arbitrarily divided (by accession number) into three subgroups of 1000 spectra each. The characteristics of these groupings (which remained the same in all calculations) are also given in Table I and Figures 2-4. It should be noted that a few of the spectra used in this work may be incomplete or even incorrect. It is felt, however, that these few spectra will only influence the statistical results of a group of 1000 very slightly. MATCHING OF LARGE GROUPS OF MASS SPECTRA
To quantitatively determine how different a group of N spectra are from one another, each member of the set should (15) K. Biemann, Massachusetts Institute of Technology, Cam-
bridge, Mass., personal communication, June 1968.
be compared against every other member using some criterion of disagreement. The statistical distribution of this criterion is a quantitative measure of how different these spectra are, on the average. This is the approach adopted here. Previous workers (5, 9) have done similar studies but have presented matching results for only a few specific compounds, rather than for a group in the statistical sense. Unlike previous work, however, spectral peak height is quantized here to only one bit at each mass position. In effect, for each spectrum one only knows whether or not a peak is present above a specified transition at a particular mass over a specified mass range. One-bit encoding has been adopted for several reasons : In order to more clearly understand the principles underlying spectral matching, the simplest case should be understood first. In space applications, the number of bits available for storage or transmission is often severely limited. It is, therefore, important to investigate the uniqueness of spectra when encoded to only a few bits. One-bit encoding is the simplest case and provides a very useful datum point. If very large groups of spectra are to be compared in detail, computer costs become prohibitive if traditional comparison techniques are employed, For example, Crawford and Morrison (5) cite search times of approximately 100 comparisons per second (on a CDC 3200 computer) for comparing only the 6 major peaks of a spectrum. At this rate, approxi-
50
40
i
y
30
m 0
is w
H
2
20
10
0
0
40
80
I20
160
200
240
280
320
MOLECULAR WEIGHT
Figure 3. Molecular weight distribution of second group of 1000 spectra 1216
. I
ANALYTICAL CHEMISTRY, VOL. 42, NO. 11, SEPTEMBER 1970
360
400
40 ’
MOLECULAR WEIGHT DISTRIBUTION
30.
22
THIRD GROUP OF IO00
” l
if
20-
I
_ I/I
mately one hour of computer time is required for a complete pair-wise comparison of only a 1000-spectrum library. Since computation time increases as the square of the size of library, such investigations soon become prohibitively expensive. As will be seen, a one-bit quantization results in an order of magnitude increase in comparison rates. One-bit encoding results in a very efficient utilization of computer core storage since 32-36 masses can be stored in each computer word instead of a single channel as in conventional techniques. It becomes possible to increase significantly the number of spectra which can be simultaneously stored in core or alternatively, for a given number of spectra, to reduce core requirements. CALCULATION PROCEDURE
Prior to the matching calculations, the original spectral data were first converted into a new set in which all peak heights above a specified threshold were represented by a “1,” all others by an “0.” These bit strings were then packed into computer words. In these calculations three different computers were utilized: an IBM-7094, a UNIVAC 1108, and an IBM 360144. The preparatory calculations required a total of about five minutes of IBM-7094 time to pack 3000 spectra for each transition. Three factors were fixed for each set of matching calculations: the group of spectra used; the mass range examined (maximum 12-200); and the transition in peak height between a level of 0 and a level of 1. (Assumed to be a constant, independent of mass.) The matching calculations were performed in the following manner : Every member of a selected set of N spectra was compared with every other member of the group. For N spectra, N(N - 1)/2 comparisons were required. (The number of combinations of N things taken 2 at a time.) In the comparison of two spectra A and B, the appropriate computer words were combined using a logical “exclusive or” instruction. A bit in the resultant word C = XOR(A,B) contains a “0” where the two spectra agree (both 0 or both 1) and a “1” where they disagree (one 0 and the other 1). (See Table I1 for an example.) The number of “1” bits in word C is the number of channels which disagree between spectrum A and spectrum B. These “1”s are counted and a histogram of disagreements is maintained for each spectral comparison. For a search over the mass range 12-140, approximately 1000 spectral comparisons per second could be performed using the IBM-7094. With the Univac 1108, the rate was about 3000 comparisons per second. It should be noted that all these programs were written in Fortran IV and it should be possible
I
I
I I
WORD
1
2
3
4
5
6
7
A
0
1
1
0
1
1
0
B
1
1
0
0
1
0
C
1
0
1
0
0
1
M M
= =
M * * e *
1
1
.*.*
1
1
*e*.
0
Number of bits in a computer word 36 for Univac-1108 or IBM-7094; 32 for IBM-360
to increase even further these processing speeds through the use of assembly language. MATCHING STATISTICS FOR ONEBIT ENCODED SPECTRA
Using the procedure outlined above, calculations were performed on a Univac-1108 computer on three different groups of 1000 mass spectra covering the mass range 12-200. In this section, these numerical results will be presented and in the following section, a theoretical analysis will be given. The observed matching statistics for the first group of 1000 spectra is summarized in Figure 5 for level transitions of 0.01, 1, and 10% of the base peak. For 1000 spectra, 1000 X 999/2 = 499500 matches were performed. This histogram represents the distribution of the number of peak mismatches when all binary spectra in the group were matched against one another, mass by mass, for the amu range 12-200. Although this curve appears continuous, it is in fact discretized since only integral numbers of disagreements can arise. The probability that a given number of disagreements M occurs, is given by the ratio of the ordinate corresponding to the abscissa M , divided by the total number of matches attempted. This histogram statistically represents how different the mass spectra arein the grouping when quantized to only one bit. Although quantitatively different, the matching results for the other two groups of 1000 spectra are typified by Figure 5. Several conclusions are apparent from these results: As the transition between level “0” and level “1” increases, the maximum of the frequency distribution shifts to fewer disagreements, Le., the spectral patterns become more similar. One would intuitively expect this behavior, since as the transi-
ANALYTICAL CHEMISTRY, VOL. 42, NO. 11, SEPTEMBER 1970
1217
4f
I 12 < 7 AMU V
I
0'
o 00 9q
f
OO00
0
0
0 I
0 .
I
.
0
C NUMBER OF CHANNELS DISAGREEING
Figure 5. Matching histogram for first group of 1000 spectra
tion increases, fewer peaks occur, and, hence, fewer disagreements wiil arise, on the average. As the transition increases, the spread in the histograms becomes smaller. Since the total number of matches is constant, the frequency of the maximum increases as the spread is reduced. The result will be discussed in the following section. From the viewpoint of spectral uniqueness, it is more informative to consider the matching distribution as a cumulative plot; e.g., the percentage of matches with less than a given number of disagreements us. the number of disagreements. The results of Figure 5 are presented in this manner in Figure 6 using a nonlinear scale for the abscissa. (On this scale, a true normal distribution would be a straight line.) For a level transition of 1.O Z of the base peak, typically less than one match in two thousand yields fewer than five disagreements. Furthermore, of the 189 channels examined, approximately 40 disagree on the average for this transition. These results show that the mass spectra of pure compounds are indeed quite unique even when encoded to only one bit. A similar conclusion was reached by Jurs et ai. (12) who found in their pattern classification work with mass spectra that one1218
Table 111. Perfect Matches Perfect matches observed in the matching of 3246 compounds, encoded to one bit with a level transition at 1.0% base peak for the mass range 13 to 140 1. 1-Butene 2-Butene 2. rn-Xylene p-Xylene 3. 2,3-Dimethylpyridine 3,4DimethyIpyridine 4. 1,4-Dicyanobenzene 1,2-Dicyanobenzene 5. 2,4-Dimethylbenzyl alcohol 3,5-Dimethylbenzyl alcohol 6. o-Chloro-toluene pchloro-toluene 7. t-Dichloroethylene c-Dichloroethylene Additionally, eight other perfect matches were found for high molecular weight compounds (generally, natural products) which had peaks in excess of 1.0% of the base peak at all masses above 50-70 to 137.
bit encoding still provided adequate information for good pattern separations. Furthermore, such results provide assurance that one-bit encoding can be an important aid in the identification of mass spectra [A spectral matching computer program based on these principles has been developed (16) 1. It is interesting to examine those cases in which perfect matches resulted when the spectra of two different compounds were compared. In these cases, ambiguity of identification would result with one-bit encoding. A library of 3246 mass spectra (including the 3000 previously discussed) was compared using an IBM-360/44. All spectra were encoded to one bit, with the level transition set at 1.0% of the base peak. Each encoded spectrum of the group was compared with every other spectrum over the mass range 13-140 for a total of 5.2 X lo6matches. Perfect matches between spectra of different compounds were found in only 15 cases (approximately 3 perfect matches per IO6 attempted). Seven of these compound pairs are listed in Table 111. The remaining eight pairs were all high molecular weight natural products with a peak at every mass in the mass range of 50-60 to 140. In all cases in Table 111, both compounds in the pair are close isomers. This result again indicates that a binary mass spectrum is a very specific chemical signature. In the following section a theoretical analysis of these results will be presented with a view to the a priori prediction of these matching statistics. PREDICTION OF MATCHING RESULTS FROM STATISTICS OF SPECTRA
Although the frequency distributions for the matching of large groups of spectra can be obtained by actually performing the matching calculations, it is important to have alternative methods available. If the matching histograms could be predicted, a deeper understanding of the principles underlying the matching process might become apparent. This understanding would serve as a foundation for further studies of the more commonly encountered, but more complex, multibit situation. (16) S. L. Grotch, Eighteenth Annual Conference on Mass Spec-
trometry and Allied Topics, San Francisco, Calif., June 1970.
ANALYTICAL CHEMISTRY, VOL. 42, NO' 11, SEPTEMBER 1970
Even though the calculations in the matching process can be made rapidly, the time required increases as the square of the number of spectra considered, and for very large groups of spectra (say, N > 5000) it becomes impractical to perform such calculations. If a faster means of prediction could be developed, the matching calculations might prove unnecessary. In this section the matching process is discussed from the viewpoint of statistics for the one-bit situation. The generalization to the multibit case is simple and is discussed after the one-bit case is explored. It is obvious that the mathematical results derived here are applicable to spectral classes other than mass spectra (for example, infrared spectra), but primary attention will be focused on the mass spectral example. Consider a group of N spectra in which peak height has been quantized to one bit over a total of L channels (atomic mass units in the case of mass spectra, wave number or wavelength increments in the case of infrared spectra). Note that the transition between level “0” and level “1” need not be a constant, but could be a function of channel number. Consider the matching of two spectra, X and Y , each represented by a vector of L elements (“0’s” or “1’s”). The individual elements xi. and y t denote the presence (“1”) or absence (“0”) of a peak above the specified level transition in channel i. Define functions F (xi.,y r )and DL in the following manner:
100
I
I
I
I
I
I
I
I
I
I /
I
12 SAMU 5 200
PERCENT
OF M T C H E S W I T H S N DISAGREEMENTS
Figure 6. Cumulative matching distribution for first group of 1000 spectra
F(xi, yi) = 0
if xi = yi
(1)
Table IV. Average Number of Disagreements (DL)for Matching 3 Groups of 1000 Mass Spectra over the AMU Range 12-137 Level transition
F(xi, yi) = 1
if xi # yi
(2)
1st 1000
(3)
Observed Predicted
0.1
1.o
10.0
47.56 47.51
37.41 37.38
14.62 14.61
53.21 53.16
40.60 40.56
13.29 13.10
54.35 54.30
49.20 49.16
18.76 18.75
2nd 1000
In the context of spectral matching, if two spectra (Xand Y ) are compared, channel by channel, a peak mismatch [F ( x t , yr) = 11results only when one spectrum has a peak in a given channel i and the other does not (xi. # y t ) . The number of disagreements over a range of L channels is the sum of the F ( x t , y d , or DL. The frequency histograms presented are the observed distributions of the variable DL. The problem, therefore, is the prediction of the distribution of DL knowing the statistics of X and Y. To this end, the mean and variance of DL will be calculated. MEAN OF D L
Define D L as the expected value or mean of the variable DL and let the operator E[ ] denote expected value. Assume that X and Y come from the same probability distribution and let: pi = probability of a “1” in channel i @e., Prob. (xi = 1)) qi = probability of an “0” in channel i ( = l - p i ) (Prob. (xi = 0 ) )
From the definition of expected value it follows that: (4)
Since the expected value of a sum is the sum of the expected values :
Observed Predicted 3rd 1000 Observed Predicted
number of spectra in the group, rather than as N2,for the acutal matching. It can easily be shown that DL is maximized when for all i, p i = That is, the average number of disagreements is maximized when it is equally likely that in each channel a “0” or “1” will occur. In this case, the average number of disagreements is one-half the number of channels considered. For identification purposes, it is desirable to maximize the differences between spectra. To maximize dL,the transitions between “0” and “1” should be chosen such that all p c approach0.5. A comparison of the predicted average number of disagreements (from Equation 5 ) and that actually found in the matching of three different groups of 1000 spectra is presented in Table IV for level transitions : 0.1, 1, and 10 X of the base peak, over the mass range 12-137. The observed values agree very closely with those calculated from Equation 5 , as they should, since both calculations should yield identical results. VARIANCE OF D L
Using Equation 5 , it is a simple matter to predict the average number of disagreements for the matching process from the statistics of the spectral group. This calculation is extremely rapid and the time required increases only linearly with N,the
The calculation of the variance of DL is more complex since in general cross-correlations between channels must be considered. Define v 2as the variance of DL. By definition:
ANALYTICAL CHEMISTRY, VOL. 42, NO. 11, SEPTEMBER 1970
1219
Table V. Variance (u*)of Distribution of Disagreements for the Matching of Three Groups of lo00 Mass Spectra over the AMU Range 12-137 Level transition 0.1 1.0 10.0 1st 1000 Observed 206.6 153.7 36.92 Predicted Equation 15 28.0 23.8 11.23 Equation 14 208.7 154.9 37.09 2nd 1000 Observed 197.6 149.4 42.49 Predicted Eauation 15 29.9 25.7 10.80 E4uation 14 200.3 151.0 42.83 3rd 1000 Observed 261.4 241.2 141 .O Predicted Equation 15 30.0 28.3 14.7 Equation 14 265.0 243.4 141.3
Substituting Equation 3 into Equation 6 and recalling that E[DL] = DLyields:
E
1
z
1
F2(Xi, r i )
+ 2 cc F(xi, yi)F(xj, yj)J i j>i
(7)
Since F(xi, y i ) = 0 or 1 only: E[F2(xi,yi)] = E[F(xi, yi)l
(8)
Equation 7 becomes : g2
= DL(~ -
BL)+ 2E CC F(xi, yi)F(xj, y j ) ] i j>i
[
(9)
The expected value of the second term may be computed from the joint probability distribution of the peaks in a spectrum. For any pair of channels i and j in a spectrum define:
rzizi = Probability that channel i is xi and channel j is xj (where xi, xj = 0, 1) For example, 7rlioi, is the probability that channel i contains a “1” and channel j contains an “0”. For each channel pair i,j the contribution to the sum in Equation 9 is: E[F(xi, ~
i F(xj, )
yj)l =
+ rlioir~iljl
Xr~ioirlili
(10)
The r may be readily calculated from a set of spectra by tabulating the number of times a particular peak configuration (00, 11, 01,lO) occurs for each channel pair i, j > i. For each channel pair in a set of N spectra define the mean values : 1 k=N pij = xixj (11)
N
k=l
and
The summations are taken over all N spectral members in a given set. Again, p i denotes the probability of finding a “1” in a single channel i, whereas p f j is the probability of simultaneously finding a “1” in both channels i and j . Equation 10 becomes : 1220
ETF(xi, Yi)F(xi, ~ i ) = l
wfj- p i b f l - + olfj- P ~ P ~(13) ~ I
The final result for 13 into Equation 9. a2 = d,(l
4
u 2 is
obtained by substituting Equation
- DL) +
cc [261fj i
j>i
Ptx.Pfl
- P A + &if
- PiPhl
(14)
Equation 14 is simplified considerably if the assumption is made that all channels are independent. In this case, Equation 14 reduces to:
In Table V are presented the observed and predicted variances as calculated from Equations 14 and 15 for three different groups of 1000 spectra. If the channels are assumed independent, the predicted variances are about an order of magnitude lower than those found in the actual matching calculations. When the correlation between channels is included, Equation 14 predicts the variance very well. These results show that in mass spectra, channels are generally not independent, but are in fact, highly correlated. Other calculations have shown, for example, that on the average, if a peak is observed at a given mass, it is much more likely that a second peak will also be present at an adjacent mass rather than no peak occurring. (This is due to isotopic effects and the common loss of a hydrogen atom.) Calculations indicate that this correlation extends well beyond adjacent masses. It can be seen in Table V that as the transition between “0” and “1” increases, the discrepancy between the observed and the predicted variance assuming channel independence (Equation 15) tends to become smaller. This is due to a reduced correlation between channels (e.g., the covariance contribution) when only the larger peaks are considered. Although Equation 14 appears formidable, it is not difficult to program and the calculations can be performed very rapidly (typically, less than one minute is required on the Univac 1108 for 1000 spectra for the mass range 12-200). By using Equation 5 to predict the mean, and Equation 14 to calculate the variance, it is possible to accurately predict, a priori, the first two moments of the matching distribution. With these equations, the matching distribution may be calculated about an order of magnitude more rapidly than by actually performing the matching calculations. FORM OF MATCHING DISTRIBUTION
The process giving rise to the distribution of DL is analogous to that described by Feller (Ref. 17, p 205) as “Bernoulli trials with variable probabilities.” In the present context, each trial consists of the matching of a given channel in two spectra with a probability of success (i.e,, a match) that varies as a function of channel number. Feller calculates the mean and variance for the distribution of the number of successes under the assumption that each trial represents an independent event, i.e., there is no correlation between the probabilities of success in different trials. The mean and variance calculated by Feller are equivalent to Equations 5 and 15. A more de(17) w.Feller, “An Introduction to Probability Theory and its Applications” Vol. I, John Wiley & Sons, New York, N. Y., 1950.
ANALYTICAL CHEMISTRY, VOL. 42, NO. 11, SEPTEMBER 1970
0 1st
90
.
80
-
ACTUAL MATCHING
I
I
I
I
IC00 SPECTRA
12 5 AMU 5 137 TOTAL MATCHES = 499,500
NON-INDEPENDENT CHANNEL MODEL
14
--
2 Y
12
dn
D, = 48.7, u = 1 5 . 4
-
0 1st 1000
PREDICTED FROM INDEPENDENT EHANNEL MODEL
A 2nd 1000 0 3rd 1000
-
-
to -
-
50-
-
70
n 10
z ?
8
$
6
?
-1
‘0. II-n
8 6
P $
5
2
4
w
2
2
8 3
n 0 NUMBER OF DISAGREEMENTS, DL
Figure 7. Predictions of matching model with observed data
tailed discussion of these matching distributions may be found in Ref. 18 where a Poisson limit was obtained. Feller (Ref. 17, p 217) points out a seemingly paradoxical feature of these distributions; namely, for a given mean, as the probability of success at each trial becomes more uniform (ix.,all p i + constant) the magnitude of the variance of the distribution of successes, u Z , increases, i.e., the distribution of successes becomes less uniform. In the case of all p r = I/*, the variance is maximized and equal to D,. If separability is , than DL, to be measured by a parameter such as ~ L / urather optimum separability may not be achieved by choosing all p t = 0.5. The matching histograms of Figure 5 do not appear to be statistically normal, although they may be approximated by normal distributions with the variance calculated from Equation 14 (see Figure 7).
ov 0
I 40
I I 80 120 TOTAL ENTROPY, BITS
I 160
i 200
Figure 8. Average number of disagreements us. total entropy for three groups of 1000 spectra
GENERALIZATION OF THE ONE-BIT RESULTS TO MULTIPLE LEVELS
To generalize the results derived for the one-bit case define: p i k = probability that a peak in channel i will fall
F(xi, y i ) = 0 when xi = yi = C(xi, vi) for
X,
in level k (16)
# yi
(17)
The function C(xi, y j ) depends upon the criterion which is applied to a disagreement between levels. If, for example, the measure of disagreement is the absolute value of the difference between the levels in channel i C(xi, yi) = Ikl
- kzj for xi
=
kl and y i = kz
or xi
=
k z and yi = kl
INFORMATION CONTENT OF MASS SPECTRA (1 8)
The expected value of F i n channel i for Mlevels is: M
E[F(xi, ~
i ) l=
2
M
C ~ i kk C i PikaC(k1, kz) ki=l r>h
The variance of DL,assuming channel independence, is obtained by summing the individual uiZ over L channels. For non-independent channels, the result is more complex and the form depends on the “best fit” criterion, C.
(19)
If D L is defined by Equation 3, then the mean of D for L channels is :
In space applications, it is frequently important to determine the information content of a data source for encoding purposes. This content can be expressed quantitatively in terms of the information “entropy” measured in “bits” (not to be confused with the thermodynamic quantity of the same name). Since entropy is a quantitative measure of the inherent information in a source, for spectral identification it is important to encode spectra so as to maximize this entropy, For more detailed descriptions of these concepts see Ref. 19. or any of the many texts on information theory. For M levels, the single channel entropy hi of a spectrum is
The variance of each individual channel is : (18) D. E. Barton, J. Roy. Statist. SOC.,20, 73 (1958).
(19) S. Goldman, “Information Theory,” Prentice Hall, Englewood Cliffs, N. J., 1953. ANALYTICAL CHEMISTRY, VOL. 42, NO. 11, SEPTEMBER 1970
1221
TOTAL ENTROPY FOR RANGE AMU = 1'2-200
transition is given in Figure 9. For each group the total entropy is substantially constant for transitions over the range 0.01 to approximately 0.3 % of the base peak, and the entropy falls rapidly above about 1.0% of the base peak. In effect, the peaks below about 0.3 contribute relatively little to the information content. This is also reflected in DL which increases by less than 13 in going from a transition of 0.3 down to 0.01 %. In a practical situation, the setting of the optimum transition is also strongly influenced by the noise level of the system which is not taken into account in these arguments. The calculations above indicate that for the spectral groups examined, the entropy increases continuously as the transition level drops, at least down to 0.01 of the base peak. In many systems, however, noise may preclude going to such low levels and a compromise has to be made. Figure 9 indicates quantitatively how much information is lost in achieving this compromise. The generalization of the above arguments to the multilevel case is simple. It can be shown that HL is maximized and equal to L log&f bits only when the probability that a peak will fall in any level is the same for all levels in all channels (allpit = l / W . In the absence of noise, the optimal transitions between levels are dictated by the peak height distribution of mass spectra. This distribution has been found to be approximately lognormal (20). Thus, nearly optimal levels can be achieved by choosing equal increments on a logarithmic height scale. Ideas derived from information theory should prove useful in providing quantitative insights into problems of spectral matching. A recent paper (21) explores some of these concepts in general terms for spectroscopic problems. Hopefully, these notions will find application in future studies in this area.
x
0 0.01
1 .o
0.10
10.0
100.0
LEVEL TRANSITION, % BASE PEAK
Figure 9. Total information entropy as a function of level transition defined as :
where pit is the probability of a given peak in channel z falling in level k . The total entropy for L channels is: L
L
HL =
i=l
hi
=
- i-1
M pik
k-1
log2 pik
(23)
For simplicity, consider the one-bit case, M = 2. Using the definitions given above, Equation 23 becomes: HL =
-
L
i= 1
L.i log2 pi
+ (1 - pi) log, (1 - pi)]
(24)
It is easily shown that HLis maximized and equal to L bits if, and only if, for all i, p f = '/z. Stated in another way, the total information of a class of spectra is maximized if at each mass the transition level is set so that it is equally likely that a peak will or will not occur. In terms of the binary matching studies given above, a comparison of Equation 5 for the average number of disagreements, DL,with the entropy function of Equation 24 shows that as the entropy of the source increases, dL also increases and that both are maximized when for all channels, pc = '/z. Thus, to achieve maximum average separability (as measured by BL),the level transitions should be set so as to obtain maximum entropy. (See Figure 8.) The total entropy as calculated from Equation 24 for the three groups of 1000 spectra as a function of a constant level
1222
ACKNOWLEDGMENT
The author acknowledges the statistical insights provided by his colleagues H. Lass and C. Solloway. RECEIVED for review March 4, 1970. Accepted July 2, 1970. Material presented at the Joint Conference of the Chemical Institute of Canada with the American Chemical Society, Toronto, Canada, May 24-29, 1970. This paper presents the results of one phase of research carried out at the Jet Propulsion Laboratory, California Institute of Technology, under Contract No. NAS 7-100, sponsored by the National Aeronautics and Space Administration. ~~
(20) S. Grotch, ASTM E-14, Conference on Mass Spectrometry, Dallas, Texas, May 1969. (21) H. Kaiser, ANAL.CHEM., 42, No. 2, 24A (1970).
ANALYTICAL CHEMISTRY, VOL. 42, NO. 11, SEPTEMBER 1970