Interpretation of mass spectrometry data using cluster analysis. Alkyl

Interpretation of mass spectrometry data using cluster analysis. ... Hierarchical Cluster Analysis of Ignitable Liquids Based on the Total Ion Spectru...
0 downloads 0 Views 249KB Size
Interpretation of Mass Spectrometry Data Using Cluster Analysis-Alkyl Thiolesters Stephen R. Heller’ and Chin L. Chang,

Heuristics Laboratory, National Institutes of Health, Bethesda, Md. 20014 Kenneth C. Chu Computer Science Laboratory, Division of Computer Research and Technology, National Institutes of Health, Bethesda, Md. 20014

The interpretation of mass spectral data has followed these three main courses: Library searching ( I ) , Pattern Recognition ( 2 ) , and Artificial Intelligence ( 3 ) . These techniques have as their goal the determination of the structure of the molecule from a given mass spectrum. We now wish to introduce the use of another technique, cluster analysis, as an aid for the interpretation of mass spectral data. Structure determination can be divided into several subgoals and in this presentation using cluster analysis one subgoal is considered-namely, functional group classification. The various library searching techniques that determine structures are “non-intelligent” in the sense that they simply perform a task of comparing an unknown, by a variety of methods, to known compounds in a library file and indicating possible solutions. The value of such systems is dependent on the size of the library. Thus, as the library grows, so does the cost t o store and process the data, as well as the real and elapsed time to obtain possible answers. In an attempt to interpret mass spectra without resorting to a large library, two main techniques have been employed. The first, pattern recognition, involves using a data base or training set to devise a way to interpret spectra by “teaching” the computer which peaks and losses are “good” and “bad.” The decision rules obtained from the training set are then used to “predict” the classifications of unknown spectra. The second approach, used in the DENDRAL project, is to program empirically derived mass spec fragmentation rules. A list of possible structures is constructed and the spectra from these possibilities are generated; then the spectrum which is most similar to the unknown identifies it. This technique has been applied successfully to monofunctional acyclic amines and ethers (3). Both these methods are based on the assumption that the classes to be studied are known. In cluster analysis, the usual approach is to allow the method to classify data into categories or clusters of its own making. Thus, it is sometimes referred to as “learning without a teacher” or unsupervised learning. In addition, there are cluster analysis procedures which are “supervised,” but these have not been used here. Cluster analysis has the advantage of possibly finding new methods for understanding old or puzzling data. The particular cluster analysis procedure used in this presentation is a graph-theoretical method called the Present address. Management Information and Data Systems Division, Environmental Protection Agency, Washington. D.C. 20460. S. R . Heller, Chapter 8. “Computer Representation and Manipulation of Chemical Information,”W . T. Wipke. S. R. Heller, R . J. Feldmann, and E. Hyde. E d . . John Wiley. New Y o r k , N . Y . . 1974. (2) P. C. Jurs. ,bid., Chapter 1 1 . (3) D. S. Smith, ibid.. Chapter 12. (1)

shortest spanning path (SSP) ( 4 ) . This procedure creates an ordered list of the sample points which reflects the minimum path through these sample points. Each sample point has 227 components (i. e . , a 227-dimensional vector in space). Starting with an ordered list of the sample points, the algorithm iteratively reorders the list so that the resulting ordered list has the minimum sum of distances between adjacent sample points. This collection of minimum distances represents a short path through all the sample points, hence the name SSP. Thus, applying this procedure to the mass spectral data gives a linear ordering of the spectra which tends to cluster them.

EXPERIMENTAL In the particular study undertaken, the data file consisted of 323 mass spectra of compounds containing only one sulfur atom and any other atoms in any amounts, taken from a master file of 8782 spectra. This subset file was generated using the imbedded molecular formula search routine of the DCRTjCIS Mass Spec Search System ( 5 ) . The initial file consisted of 625 compounds. However, the file was reduced to 323 compounds by removing those spectra in the file which did not have peaks beginning a t least a t m / e = 26. In addition, duplicates were not removed. The experimental spectra consisted of the peaks from m / e = 14 to 140 and all losses from the parent ion to M - 99 which gave a total of 227 feature points. The choice of the features selected was arbitrary, and it may be necessary to modify the features used when attempts are made to cluster other compounds. By experimentation, it was found that by replacing the actual intensities, with (single) weighted intensities, better clusters appeared to be formed. The choice was quite arbitrary and might very well be expected to vary as other classes of compounds are studied. In this work on the sulfur compounds, peaks and losses in a spectrum with an intensity of 0.01-49% were given an intensity value of 1. Those peaks with an intensity of 50-100% were given an intensity value of 2 . Those losses with an intensity of 50-100% were given an intensity value of 4. The programs for this work were written in FORTRAS and SAIL (an ALGOL type language), and all were run on a DEC PDP-10 computer. The clustering program for the 227 features of the 323 sulfur compounds required about 86K words of core and about 30 minutes of cpu time to run.

RESULTS After the data, consisting of the 323 sulfur compounds, were formulated into a linear path by the SSP procedure, the data were divided into a number of linear segments by using the intuitive judgment of the chemists. (In later work it is hoped to automate this manual step.) Each of these linear segments, manually defined by a chemist, constitutes a cluster (class). From these segments, part of which is shown in Figure 1, one class has been tentatively defined as the alkyl thiol esters, and this class, which has been investigated in depth, will be discussed here. The master file of 8782 spectra contained 45 (actually 46, but one spectrum was found to be incorrect) monofunctional straight-chain alkyl thiol esters of the general formula: ( 4 ) J. R. Slagle. C. L. Chang, and S R. Heller, Ana/. Chem.. 46, in press (5) S. R. Heller, Ana/. Chem.. 44, 1951 ( 1 9 7 2 ) .

A N A L Y T I C A L C H E M I S T R Y , VOL. 46, NO. 7 . J U N E 1974

951

MW

217 30 1 266 273

233 295 146 146 146 146 148 1&6

118 116 118 I18

90 80

M 34 46 94

90 78 78

MF

NAME

C 9 H 1 2 N 05 P S SUMITHION ClOH12N304PS O X Y G E N ANALOG O F GUTHION C14 H 1 0 N 2 0 3 S 46-DIPHENYL-1 2.3 5 - O X A T H 1 4 D l A 2 I N E - -2.2-OIOXIDE C11 H12 N 0 3 C L S CHLORMEZANONEITRANCOPAL~ C13 H16 N 3 0 4 N A S METHAMYYRONEIDIPYRONE POWOER U L M E R I C ~ ~ H I B C L N ~ S CHLOROTHENITAGATHENI C6M18S 2 2.4.4-TETRAMETHY L-3-THtAPENTANE C6H18S TERT-BUTYL SULFIDE C7H140S ISO-0UTY L THIOL NOR-PROPANOATE C7H14OS NOR-BUTYL THlOL NOR-PROPIONATE C7HllOS E T H Y L THIOL NOR-PENTANOATE C7 H 1 4 O S E T H Y L THIOL I S 0 PENTANOATE C5HlOOS E T H Y L THIOL ACETATE C5HlOOS M E T H Y L THIOL NOR-BUTYRATE c5 HI0 0 s NOR-PROPYL THIOL ACETATE C5H100 S ISO-PROPYL THIOL ACETATE C3H60S M E T H Y L THIOL ACETATE cos CARBON OXYSULFIDE 02 SULFUR D I O X I D E H2 S H Y D R O G E N SULFIDE c H4 METHANETHIOL I M E T H Y L MERCAPTAN# C2H602S DIMETHYLSULPHONE C2H6N2S B I S I M E T H Y L I M I N OsuLPnuA I C2H60S 2-MERCAPTO ETHANOL C2HSOS SULFOXIDE D I M E T H Y L

s

s

Part of reordered list of sulfur compounds after the SSP procedure had been applied

45 spectra, all thiol esters, out of 8782 were found to meet these 29 criteria. No additional compounds were found to meet this criteria. Thus, a rule based on these 29 features was able to separate alkyl thiol esters from any and every other class of compounds in the file. In further experimentation with these criteria, it was found that 13 of 29 features could be eliminated without finding additional spectra that met this criteria. The features eliminated were: Peaks present: Peaksabsent: Losses absent: Losses present:

45 51,52,65,66,80,93,106,107,108 30, 31 89

Figure 1.

Last, a spectrum of nor-heptyl thiol nonhexanoate thought to be a bad spectrum because it did not meet the criteria was re-run and found to meet the identification criteria derived from the cluster analysis. Thus, the 16 features given below appear to be able to characterize straight-chain alkyl thiol esters. The criteria are:

All these spectra were run on a Bendix TOF mass spectrometer (6). There were no thiol esters with aromatic or saturated rings of other functional groups. Thus, the classification rules are to be considered applicable only to this limited class of compounds. The matrix array of the spectra features arranged in the linear order found by the SSP program was manually inspected, and 29 features were picked out and found to characterize alkyl thiol esters. The features consist of peaks and losses that were found to be always present or absent in the class. These 29 features were then processed against the entire 8782 spectra from the master file. Only

We are indebted to D. Black for the TOF mass spectra. We wish to thank H. M. Fales, J. R. Slagle, M. Shapiro, and R. C. T. Lee, for thoughtful discussions and also wish to thank R. J . Feldmann for the SAIL program used in the SSP procedure.

( 6 ) W . H. McFadden, R. M. Seifert, and J. Wesserman, Anai. Chem., 37, 560 (19 6 5 ) .

Received for review July 23, 1973. Accepted December 26, 1973.

Present 0 27, 29, 41, 42 43, 47, 57, 61

Losses: Peaks:

Absent

M-1 31, 32, 36 37, 38, 50

ACKNOWLEDGMENT

Modified Approach for Submicrogram Determination of Selenium in United States Geological Survey Standard Rocks V. Lavrakas, T. J. Golembeski,' G. Pappas,2 and J. E. Gregory3 Department of Chemistry. Loweil Technological Institute. Lowell. Mass. 07854

H. L. Wedlick Department of Radiological Sciences, Lowell Technological Institute, Lo well, Mass. 07854

The determination of trace elements in the environment is of increasing importance; many trace elements are related to the health of plants and animals. Selenium causes livestock poisoning ( I , 2), poisoning of humans (3), and anemia and hypertension ( 4 ) occur where certain limits are exceeded. Some evidence exists that selenium del Present address, Yew E n g l a n d N u c l e a r Corp., B i l l e r i c a , Mass. * P r e s e n t address, New E n g l a n d N u c l e a r Corp., Boston, M a s s . 3 Present address, U n i r o y a l Corp., N a u g a t u c k , Conn.

( 1 ) I. Rosenfeid and 0. A. Boeth, "Selenium Geobotany, Biochemistry and Nutrition," Academic Press, New York, N.Y. 1964. ( 2 ) W H Allaway and E.E Cary, Ana/. Chem.. 36, 1359 (1964) (3) R. H . Tomlinson and R C. Dickson, Proc. inter Conf.. April 19, 66 (1965). ( 4 ) K . P McConnell, Proc inter. C o n f . . Dec. 1. 139 (1961)

952

A N A L Y T I C A L C H E M I S T R Y , V O L . 46, N O

7 , JUNE 1974

creases the toxicity of methylmercury added to diets containing tuna ( 5 ) .Atmospheric selenium has also been extensively investigated as an indicator of atmospheric sulfur pollutants (6). Various chemical methods have been used to determine the presence of trace quantities of selenium in materials; however, most suffer from the deficiency that interfering ions must first be removed and that the presence of selenium at trace levels in reagents, unless corrected, will contribute to erroneous results. Many of the analytical techniques lack the required sensitivity and accuracy to determine submicrogram levels of selenium. For example, other workers, using a coprecipitation and photometric ( 5 ) H E Gantheretai Science 175. 1122 ( 1 9 7 2 ) ( 6 ) K K Sivasankara Pillayefa/ Envfion S o Techno/ 5, 76 ( 1 9 7 1 )