Classification of mass spectra using adaptive digital learning networks

erations and within a very short time. The method can be adopted to the special circumstances of a given sample, since the sequence followed above could easily be replaced by others; for instance, replacing the extraction with Zn(DDC)2 after the reduction of the sample by an extraction with In(DDC)S would yield a fraction containing only Sb, and a subsequent extraction with Zn(DDC)2, then one with only As. Another typical application is the determination of traces of metals in ancient silver coins by neutron activation analysis ( 1 5 ) . In this case, the signals from the impurities in the y-spectra of the entire sample are completely hidden by the large activities of IlornAg,Ig8Au,and 64Cu;extraction of the dissolved sample with Bi(DDC)3 removes these three activities completely from the aqueous phase and allows the interference-free determination of all other activities, which stay quantitatively in the aqueous phase.

Classification of Mass Spectra Using Adaptive Digital Learning Networks T. J. Stonham and 1. Aleksander' The Electronics Laboratories, The University, Canterbury, Kent, England

M. Camp, W. 1.Pike,2 and M. A. Shaw Unilever Research, Port Sunlight, Wirral, Cheshire, England

Digital learning networks of adaptive logic elements have been applied to the problem of automatic routine identlflcatlon of mass spectral data according to the functional groups present. The technique, which is an embodiment of the n-tuple method of pattern recognition is not limited solely to the classification of linearly separable data, and offers a saving In computer time and storage requirements over the discriminant analysis approach. The structure and mode of operation of the learning nets is discussed, and results Present address, Department of Electrical Engineering, Brunel University, Kingston Lane, Uxbridge, Middlsex, England. Present address, Proprietary Perfumes Ltd., Kennington Road, Ashford, Kent, England.

are given for three classification experiments. Finally, the separabilities of the 28 groups employed In the multicategory classification are considered, thereby enabling a comparison to be made between the digital learning net approach and the spectroscopist's Interpretation.

The interpretation of scientific data and, in particular, chemical data, has traditionally been based on theoretical analysis, and involves the detection of explicit relationships developed from previous experimentation and logically constructed models based on one's knowledge of chemistry. The advent of the computer made possible the development of large libraries of classified data such that interpre-







Flgure 1. Correspondence of principal terminals of SLAM and RAM elements (the terminals of the RAM are shown in parentheses)

tation of “unknown” data is reduced to a library search, and successful recognition requires that the data are unique and have previously been classified. In recent years, however, much interest has been paid to empirical methods of data interpretation where it is assumed that a relationship exists between the data and its defined classification, and pattern recognition techniques have been applied to the interpretation of these data. Much work has been carried out on applying learning machine methods as described by Nilsson (1) to spectral identification, particularly to the identification of mass spectra (2, 3 ) . A limitation to the learning machine approach is that the data must be linearly separable and it has been shown ( 4 ) that spectral data exhibit linear inseparability to a considerable extent. A pattern classifier using piecewise-linear discriminant functions, which gives improved performance with mass spectra has been described ( 4 ) . However, this is achieved only a t the expense of computing time, as the complexity of the training and testing procedures is increased. This paper reports on the application of digital learning networks ( 5 ) (DLN) to the interpretation of mass spectral data. This approach is less dependent on linear separability of data than other methods of classification, since the functions a DLN can perform are dependent on the network topology, which can easily be modified or changed. Although a DLN in a particular configuration cannot perform all possible linearly inseparable functions, interleaved areas of recognition in pattern space are permitted. The DLN approach can be implemented in hardware, using commercially available memory elements, or simulated on a digital computer. Despite having to carry out a parallel processing operation in a serial manner in the computer simulation, the latter still requires less computer time and storage than other learning methods based on discriminant analysis.

DIGITAL LEARNING NETWORKS Based on the Bledsoe and Browning (6) n-tuple method of pattern recognition, the digital learning network approach involves the computation of joint probabilities of occurrence of randomly chosen subsets of binary pattern elements derived from the mass spectra (7). The inherent generalization ability in the technique enables both recognition and prediction to be carried out. DLNs were first introduced in 1968 (8) and comprise an interconnection of adaptive logic units. An adaptive logic unit is essentially a memory element and, a t the outset of this work, it was taken for granted that memory elements would become available in integrated circuit form. Initially, no such device was available, and a unit called the SLAM (Stored Logic Adaptive Microcircuit) was developed ( 9 ) . However, this device has now been superseded by the commercially available Random Access Memory (RAM) which can now be regarded as the basic element of a DLN. The principal terminals of a memory element (Figure 1) 1818


Figure 2. A single-layer digital learning network

are the inputs (addresses in a RAM) and the output (dataout terminal) and the latter performs a function of the former. In the case of a 4-input memory element, the device performs the following Boolean function

f = f l f 2 i 3 f 4 4 0 + 3 1 % 2 i 3 x 4 41 + . . . + x l x 2 x 3 x 4 4 1 5 where XI, x p , x3 and x 4 are the binary inputs to the elements, 4 = (do,&, . . ., 415) is a binary vector representing the store contents of the memory element, is the logic operation OR and i (NOT x ) is the INVERT operation requiring x = 0. The inputs to the memory elements must be binary; therefore, in any pattern recognition application, the data (in this case the mass spectra) must undergo some preprocessing to produce binary patterns of fixed format. In practice, the memory elements are used in networks. Figure 2 shows a single-layer network of 128 elements. The system is “trained” by the application of a pattern sample to the inputs of each memory element and the transmission of the desired output to the teach (data in) terminal. A signal at the teach clock (read/write) terminal enables the teach information to be stored in the memory element. When the response of an element is sought, the teach clock is not operated and an input address accesses the appropriate store location giving a data output. The value of n (determining the n-tuple sampling of the patterns) used in these investigations was 4, Le., the pattern samples are 4-tuples. In computer simulations of DLNs, n can be varied,-an advantage over a hardware classifier where n must be fixed-however, RAMS with up to 12-bit addresses are currently available. Apart from the physical problem of increasing storage with n (the storage capacity of each memory element is 2n bits) the generalization ability of the net and training requirements is also related to the pattern sample size. For very small values of n, there is a tendency to overgeneralize, although the network performs well with small training data sets; while for large values of n, less generalization occurs but the training sets must be representative of their class to a greater extent. Investigations into the effect of the sample size on the recognition performance have shown that n has an optimum value (IO).While it is not claimed that n = 4 is optimal for mass spectral recognition, it can be seen that this value gives satisfactory performance. If n is taken to its limits, the n-tuple method becomes analogous to other pattern recognition techniques. In the case of n = 1, one is performing template matching where the template is a superimposition of all the training patterns, and for n equal to the total number of bits in the patterns, the method becomes a form of library search. In the experiments to be described, the input patterns derived from mass spectra are 256 bits in size and the connection data comprise a 2 to 1 mapping involving 512 connections to the input space. Initially, the mapping is ran-

dom because, in the first instance, no preferential mapping is assumed. There are in excess of 101ooo different mappings and, therefore, exhaustive comparisons are impossible.

PREPROCESSING OF MASS SPECTRA Since the data input to the learning networks must be binary, some preprocessing of the mass spectral data is necessary. A simple direct coding of mass spectra involves applying a threshold to the intensity a t each integer mass value, with pattern origin (defined as the top left-hand corner of a pattern) corresponding to zero mass point. The pattern is then made up to a standard size (256 bits in this case) by inserting zeros after the intensity for the highest mass value if this is less than 256. The spectrum must be truncated if mass values greater than 256 are encountered. With this method of coding, emphasis will be placed on the characteristic fragment ions which are independent of the molecular weight of the compound (e.g., mle 74 in the mass spectra of fatty acid methyl esters), since they remain a t a fixed point on the net. An alternative coding is to apply the intensity threshold in the same way but make the pattern origin the molecular weight of the compound. Thus, emphasis will be placed on the characteristic neutral losses (e.g., M - 31 in fatty acid methyl esters) since the appropriate fragment ions will now occur a t the same point in the input pattern to the nets. One drawback of a direct coding with intensity threshold is that the intensity information is considerably reduced. In order to give greater emphasis to the intensity data, the two following methods were employed. The binary pattern was divided into sixteen 16-bit words, each word consisting of two parts giving information as to the amplitude and position of a peak within the spectrum. In this third method, the most significant peaks can be selected from the spectrum (up to 16 peaks) and more intensity information can be incorporated into the patterns, The numerical information is binary Gray-coded and this has an advantage over the simple binary codes insofar as the Hamming distance between consecutive numbers is always 1; and as the binary structures of the numbers are regarded by the networks as patterns, peaks of similar dimensions will give rise to similar patterns. (The Hamming distance is numerically equal to the number of differing bits in two binary patterns.) In the fourth method of coding, the reduced mass spectrum (ion series) was used. This is a standard form of representation for spectral data when employing file or library search routines ( 1 1 ) . In these calculations, m/e values less than 26 and mle equal to 28, 32, 40, and 44 are excluded because, especially in GC-MS work, intensities a t these mass numbers are very much affected by instrumental background. The lists start with m/e = 29 series and ends with m/e = 42 series. The reduced spectra are calculated by performing the ion series summations using the equation where m = 1, . . ., 14; n = 0, 1, 2, . . ., covering the whole spectrum; I, is the relative intensity a t mass j ; and S, is % contribution of ion series m to the total ion intensity. This coding is particularly appropriate to classes of compounds where CH2 is the basic repeating unit. For example, series of compounds of the form R1(CHZ),-RZ give rise to similar reduced spectra. Gray-coding of the numerical data was again employed.

NATURE OF THE CLASSIFICATION In the DLN approach to data interpretation, it has been the aim to classify mass spectra according to the functional groups present since this is often the first objective of the

Flgure 3.

Piecewise-linearpartition of pattern space

spectroscopist when faced with the plethora of data resulting from a GC-MS analysis. Pattern classification with DLNs can be divided into two main operations: (i) the training phase and (ii) the response phase. In the training phase, reference is made to a known data set, in order to set up the logic which will enable the classification to be effected in the response phase. It has been stated that each memory element of a DLN can be taught to associate information (0 or 1) with the ntuple of pattern it sees at its inputs. The teaching involves the storing of this information within the memory element at a location addressed by the n-tuple. Taking the learning net as a whole, the teach information input during training and, subsequently, output during the response phase can be regarded either as a pattern vector or interpreted numerically, according to the mode of training. In the former case, feedback can be incorporated into the system by use of a suitable mapping between the output and the input of the network. It was felt desirable, however, to maintain the learning system in as simple a form as possible while investigating the applicability of DLNs to the spectral problem. Thus, the following mode of operation was adopted. A learning net is trained on a specific class of mass spectra by teaching each element of the network to output a 1 for all pattern n-tuples encountered in the training. (Initially, all stores are set to zero.) The connection mapping remains fixed throughout the classification, therefore, the locations on the binary patterns (derived from the mass spectra) sampled by each element do not change. In the case of n equal to 4, there are 16 possible 4-tuples which can be seen by each element, (0000-1111). If a data set is being used in which the existence of some characteristic features is postulated, one would expect a limited set of n-tuples to occur at the input of each element during training (12). If there is no common characteristic within the data, there is equal probability of each possible n-tuple occurring, and all the stores will eventually be set to 1 if a sufficient number of training patterns are available. In the response phase, the pattern to be classified is input to the learning net in an identical manner and the stored data in the memory elements are now accessed. Therefore, if the n-tuples input to each element, address locations which have previously been addressed at any time during the training phase, a 1 is output, otherwise the element output is 0. A measure of the response of the whole net to a pattern is obtained by an arithmetical summation of the outputs of all the elements in the net. The aim is to obtain a strong (i.e., numerically large) response from a trained network, for patterns to be classified with the training set to the exclusion of all other patterns. The training of a digital learning net as outlined above is a straightforward process compared with the determination of an optimum hyperplane in the discriminant analysis approach (2) since the response of the net to previous training patterns is not affected as training progresses. It is sim-




... ... ... ... ... ... ... ...




... ... ... 9






... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...

found that by reducing the amount of training of group 9, the two spectra in question could be correctly classified, giving zero error over the whole data set. SEPARABILITY O F CHEMICAL CLASSES BY DIGITAL LEARNING N E T CLASSIFICATION O F MASS SPECTRA I t is obvious that the DLN approach to the identification of mass spectra is fundamentally different from the spectroscopist’s interpretive methods. The former method has been employed to achieve a basic classification while the latter is a highly developed technique enabling precise identifications to be made ( 1 5 ) . A measure of the reliability of a classification with DLNs can be inferred from the difference between the mean response of the spectra to be classified with the training group and those of other groups. As each mean response has a distribution of individual responses associated with it, the probability of a misclassification decreases with increasing separation between different groups. Data have been compiled (16) on the separabilities of the 28 groups in the classification experiment summarized in Table V, where the range of responses encountered in the “no” classification (Le., those patterns to be classified as not belonging to the training group), can vary from 20-128. The responses obtained from a trained net depend on the Hamming distance of the test patterns from the training set ( 1 7 ) and, for each net, the response groups can be placed in an order of similarity with respect to the training group. For purposes of comparison, the following mean response levels were chosen to assess group similarities: (i) response


Table VI. Chemical Group Similarities Obtained by Classification of Mass Spectra by a System of DLNs Mean responses

Training groups



3. 4. 5. 6.


8. 9.

10. 11. 12. 13. 14. 15.

16. 17. 18. 19. 20 * 21. 22. 23. 24. 25. 26.

27. 28.

Tertiary alcohols ( 1 1 4 ) , Secondary alcohols ( 1 1 2 ) . Higher ketones ( l l l ) , Aliphatic amines (111) Alkanes (120): Higher ketones (118), Secondary alcohols (116). Amines (116) Me thy1 ketones Higher e s t e r s (118), Methyl e s t e r s (115) Carboxylic acids Alkanes (120), Methyl e s t e r s (114), Higher e s t e r s (112) Ethyi e s t e r s Ethyl e s t e r s (119), Secondary alcohols (118), Amines (118), Higher ketones (116), Higher e s t e r s Alkanes (115), M?thyl e s t e r s (112), Methyl ketones (112) Secondary alcohols (123), Tertiary alcohols (118), Aldehydes (118). Alkenes (117). Primary alcohols M-thy1 ketones (115), Amines (113), Alkanes (113), Higher ketones 1113) Primary alcohols (112). Secondary alcohols (111) Aldehydes Amines (119). Alkanes (119), Secondary alcohols (117). Tertiary alcohols 1113). Higher ketones Nitriles (113). Methyl ketones (113) Methyl ketones ( l l l ) , Higher ketones (112), Tertiary alcohols (121), Amines 1118), Secondary alcohols Alkanes (114) Secondary alcohols (122), Amines (117). Alkanes (111) Tertiary alcohols Methyl e s t e r s (112), Methyl ketones ( 1 1 4 ) , Aldehydes ( l l j ) , Higher ketones (116). Diesters Secondary alcohols (114), Tertiary alcohols (117) Well resolved Substituted keto- acids 1-Phenyl alkanes Well resolved Well resolved Terpene s ??-Phenyl alkanes ( 1 2 I 1) 1-Phenyl alkanes (115) Higher ketones (112), Secondary alcohols (119). Tertiary alcohols (117). Alkanes Aliphatic anlines (112). Nitriles (111) Mer c apt ans Sulfides (112) Sulfides Mercaptans (120) Alkanes (113), Nitriles (112) Alkenes Alkanes Methyl ketones 1113), Higher ketones (113). Secondary alcohols (111). Am'nes ( 1 1 2 ) , Alkenes (117), Nitriles (113) Alkenes (lll), Alkynes ( 1 1 2 ) , Pyrazines (ill), Furans (113) Nitriles Alkynes Alkenes (117), Nitriles ( 1 2 1 ) , Substituted pyrazines (120) Pyrazines Well resolved Substituted phenols Terpenes (117), Pyrroles (112) Nitriles (120), Alkynes (117), Substituted pyrazines (119), Pyrroles (117) Furans Pyrroles Alkynes (112), Substituted pyrazines (118). Furans (111) Well resolved Thiophenes Substituted phenols (113) Aromatic e s t e r s Methyl e s t e r s

greater than 120 indicates a high degree of similarity with the training group and (ii) responses of 110-120 show a lesser degree of similarity. Groups having response means in the above ranges are listed in Table VI. Groups having a response mean of less than 110 are easily distinguished from the training group and are not considered here. The mean response is shown in parentheses. In the case where the training group is specified as being well resolved, the response means of all the groups, excluding the training group, were less than 110. The training group response mean was always 128. One of the salient features of the results summarized in Table VI is that a similarity is revealed between classes with a common functional group. This can be seen in particular for the esters, alcohols, and ketones. Thus, the possibility of redefining the classification groups arises, thereby developing a multi-level classifying scheme. T h e preliminary classification can group together classes where common peak positions prevail, e.g., Sulfides and Mercaptans. These preliminary groups can then be separated in a secondary classification where optimization and varying input thresholds can be employed. Limiting the preliminary classification overcomes a practical constraint, for single-layer multicategory classifiers cannot be expanded indefinitely, despite the fact that no apparent fall-off in performance occurred with the 28 categories. A preliminary classification would allow further investigations to be carried out t o determine structural information without encountering the difficulties which can

arise due to the existence of isomers, when tests pertinent t o specific chemical groups are used.

CONCLUSIONS It has been clearly demonstrated that DLNs can be used to classify compounds with a high rate of success from their mass spectra, even when a compound is a complete unknown. However, the best predictive ability is obtained by using all known spectra as the training group. Work in progress concerns the extension of the technique t o larger sets of mass spectral data and to the classification of infrared spectra. Indeed, recent results (18) have been obtained with a data set comprising 42 chemical groups and recognition figures in excess of 99% have been obtained. ACKNOWLEDGMENT The authors wish to thank the Directors of Unilever Ltd. for permission to publish this work. LITERATURE CITED (1) N. J. Nilsson, "Learning Machines," McGraw-Hill. New York, 1965. (2) T. L. lsenhour and P. C. Jurs. Anal. Chem., 41, 21 (1969) (and references cited therein). (3) A. G. Baker, M. Camp, E. Huntington, W. T.Pike, and M. A. Shaw, "Recent Analytical Developments in the Petroleum Industry," Institute of Petroleum, 1974. (4) N. M. Frew, Ph.D. Thesis, University of Washington, Seattle, Wash., 1971. (5) I. Aleksander, "Microcircuit Learning Computers," Mills & Boon, London, 1971. (6) W. V. Bledsoe and I. Browning, "Pattern Recognition and Reading by Machine," Proc. Eastern Joint Computer Conf., 225 (1959).


(7) T. J. Stonham, I. Aleksander, M. Camp, M. A. Shaw, and W. T. Pike, Electron. Len., 9, 391 (1973). (8) I. Aleksander and R. C. Albrow, Cornput. J., 11, 65 (1968). (9) R. C. Albrow, Electron. Cornrnun., 3, 6 (1967). (10) J. R. Ullmann, /€€€Trans. Cornput., 18, 1135 (1969). (1 1) L. R. Crawford and J. D. Morrison, Anal. Chern., 40, 1469 (1968). (12) T.J. Stonham and M. A. Shaw, "Pattern Recognition," in press. (13) S. L. Grotch, Anal. Chern., 42, 1214 (1970). (14) T. J. Stonham, I. Aleksander, and M. A. Shaw, Electron. Lett., 10, 301 (1974). (15) F. w. McLafferty, "Interpretation of Mass Spectra," w. A. Benjamin, Reading, Mass., 1973.

T. J.

Stonham, Internal Research Report, University of Kent, March 1974. (17) I. Aleksander, Necbon. Lett.,6, 134 (1970). (18) T.J. Stonham, Internal Research Report, University of Kent, October 1974.


RECEIVEDfor review March 28, 1975. Accepted May 19, 1975. The financial contribution to the support of this work made by the United Kingdom Science Research Council is acknowledged.

