pattern recognition - ACS Publications - American Chemical Society

Classification of Mass Spectra via Pattern Recognition. James R. McGill and B. R. Kowalski. Journal of Chemical Information and Computer Sciences 1978...
0 downloads 0 Views 246KB Size
1

AIDS FOR

I

Alternate Representation of Mass Spectra for the Spectral Identification Problem (Pattern Recognition) C. F. Bender, H. D. Shepherd, and B. R. Kowalski' Lawrence Livermore Laboratory, Livermore, Calif. 94550

THEFIRST APPLICATION of a linear pattern classifier to a chemical problem dealt with the interpretation of low resolution mass spectra ( I ) . I n the application, the spectral intensities constituted the variables. One of the problems in applications of linear classifiers is the required number of samples relative to the number of variables. This difficulty has been pointed out by the authors previously ( 2 ) where R , the ratio of the number of samples to the number of variables, was introduced as a measure of the reliability of data sets. Sammon et ai. (3) have shown that for R less than 2 , the results can be meaningless. Only for data sets with R > 3 should linear pattern classifiers be applied. For low resolution mass spectra with three hundred mass readings, a minimum of nine hundred compounds for each class is required. Even with the most sophisticated computational methods and facilities, this poses a formidable data processing problem. Another solution, that taken by this paper, is to devise a reduction in the number of variables. Average values, moments, and histograms were used as the variables in our representation. Although such a n analysis has been applied to chromatographic peak separation ( 4 , 5 ) , n o application to mass spectra involving pattern recognition has appeared in the literature. For discrete data (e.g., spectra) the sample mean is defined as

Table I. K-Nearest Neighbor Classification Efficiency (per cent correct)

Name of data set Original c1

7 92.3 82.6 90.6

(3) This leads to the definition of partial averages, Z,, and partial moments, Pab. Histograms are simply bar charts of various populations. In the application given in this paper, H z - y denotes the number of mass to charge ratios with intensities between x and y . EXPERIMENTAL

Low resolution mass spectra for 298 hydrocarbons with carbon numbers 6, 7, and 8 were used in this study. The three class problem had class populations of 77 for CS,86 for Ci, and 116 for Cs. Non-zero intensities were digitized a t each mje from 1 to 128. Relative intensities, I,, were calculated with the maximum intensity assigned a value of 1. In all cases the square root of the intensity was used, a technique which has been shown t o be more effective for pattern recognition applications ( I ) . The unweighted K-nearest neighbor rule (6) was used for evaluation of the effectiveness of the new representation. Three values of K ( = 3, 5 , and 7 ) were used. K = 1 was not used because of compound duplication in the data base. The purpose of the classification rule was to correctly determine the carbon number. Rather than using two data sets, one for training and one for evaluation, the "leave-one-out'' method ( 6 ) was used. Here the classification of each of the n samples is based on the known classification of the remaining n - 1 samples. The K-nearest neighbor classifications using the full spectra (128 variables) are given in Table I.

where W , is the weight assigned the kth measurement. For the spectral case W , is often related to the intensity of the signal. X,is the value of the measurement. Moments are usually defined relative to the sample means, i.e.,

X)"

i

These higher moments ( n = 2 , 3. . . ) are related to the form of the weighted measurement distribution. The n = 2 moment is related to the variance and the n = 3 moment is a measure of the skewness. The fourth moment, kurtosis, is a measure of the skewness of the variance. Normally, averages and moments are defined for the full range but partial values, Z,, can be defined for a specific Present address, Department of Chemistry, Colorado State University, Fort Collins, Colo. 80521.

RESULTS AND DISCUSSION

(1) P. C . Jurs, B. R. Kowalski, and T. L. Isenhour, ANAL.CHEM.,

41.21 (1969). (2) B. R. Kowalski and C . F. Bender, J . Amer. Chem. SOC.,94, 5632 (1972). (3) Proc. IEEE Symp. Adaptice Processes, U. of Texas at A i i s f i i i ,

Kb

Ra

range, i.e.,

i

Wi(Xi -

of

variables 1 3 5 93.3 93.6 2.4 128 30 10 89.3 85.2 F 20 15 94.3 91.9 a R = number of patterns/number of variables. * K i n the number of neighbors used for classification.

x = cwixi

X'lE

Number

The first contraction (Cl) included the mean, four moments (n = 2 , 3, 4, 5) and five histograms, H o - o . ~H, o . ~ - o . ~ , H0.4-~.H 6 ,o . ~ - o .and ~ , Ho.s-l.o. Results using the contraction to 10 variables are given in Table I. The 10 variables were autoscaled ( 2 ) . The results indicate between a n 8 and 10% loss in classification efficiency. Further studies with

1970, p 1x21.

(4) S . N. Cheder and S. P. Cram, ANAL.CHEM., 43,1923 (1971). (5) 0. Grubner, ibid.. p 1934.

(6) B. R. Kowalski and C. F. Bender, ibid., 44,1405 (1972) ANALYTICAL CHEMISTRY, VOL. 45, NO. 3, MARCH 1973

617

increased numbers of moments and/or histograms yielded similar percentages. Increasing the number of variables to 20 improved the efficiencies by only a few tenths of a per cent. One interesting outcome was found in calculating the correlation coefficients for the variables. For moments with the same parity or x 2 n - l ) the correlation coefficients were large (greater than 0.95) indicating that little new information can be added by including higher moments. Although the above mentioned 10 variable contraction has been successfully used in transformation studies (7), a further improvement was desired. Examination of statistical parameters indicated there was insufficient peak position information in the contraction. To obtain more position information, two sets of partial averages and moments were included. The ranges were determined by the sample means. The “low” set had the range:

(x”

1 l X l X

(4)

and the upper $,et:

(7) C. F. Bender and B. R. Kowalski, ibid., 45, 590 (1973).

Moments with n = 2, 3, and 4 were used for the three ranges leading to twelve variables including the means. Three more variables were included in the final 15 variable contraction ( F ) , the total number of peaks, H o , ~ ~and - ~ ,the , m/e value of the largest peak. As in all cases the data were autoscaled ( 2 ) . The classification efficiency of this case is also given in Table I. Clearly little information has been lost in the 9fold contraction. CONCLUSIONS

Moments have been used to contract the representation of low resolution mass spectra to fifteen variables with only a slight degradation (average less than 1 %) in classification efficiency when compared to a calculation using all the data and R < 3. Such a contraction gives rise to a considerable savings in computational effort. This new representation also allows further detailed pattern recognition studies with small manageable data sets. RECEIVED for review July 12, 1972. Accepted October 24, 1972. This work was performed under the auspices of the U. S. Atomic Energy Commission. H. D. Shepherd is a Military Research Associate (USN).

Rapid Methylation of Micro Amounts of Nonvolatile Acids Monte J. Levitt Departments qf Pathology and Obstetrics and Gynecology, Unicersitj, of Pittsburgh School of Medicine, Pittsburgh, Pa. 15213

A DIAZOMETHANE GENERATING SYSTEM is described which permits microgram or smaller amounts of complex organic acids to be quantitatively esterified without detectable side-product formation. Multiple samples can be processed in sequence at two-minute intervals. The methyl esters produced are suitable for further analysis by electroncapture gas chromatography. The apparatus is constructed from components present in most laboratories, and may be dismantled easily for storage. EXPERIMENTAL

Description of the Esterification Apparatus. An esterification technique ( I ) utilized with milligram amounts of acids has been modified to permit nanogram amounts of acids to be treated with minimal handling in a manner compatible with subsequent analysis by electron-capture gas chromatography . The system consists basically of a generator and a split stream of inert gas (Figure 1). The generator is a 20- X 150-millimeter side-arm test tube with a 2-hole rubber stopper. Through one hole of the stopper is inserted a dropping-tube consisting of a piece of glass laboratory tubing with a stopcock. An inert gas is delivered through an inlet tube constructed from a length of laboratory glassware which is bent at 90” and inserted through the other hole in the stopper to extend to the bottom of the generator. Diazomethane formed in the generator is delivered to the samples through a glass capillary pipet, bent at 90°, and connected to the side-arm of the generator by means of a 2-centimeter length of Tygon tubing. [For permanent installation, shrinkable polyethylene or Teflon (Du Pont) tubing could be used.] (1) H. Schlenk and J. L. Gellerman, ANAL.CHEM., 32, 1412 (1960). 618

0

ANALYTICAL CHEMISTRY, VOL. 45, NO. 3, MARCH 1973

The two pieces of glass are butted against each other to avoid leaching any contaminants from the plastic tubing. Nitrogen or other inert gas is passed through a molecular sieve filter and then through copper tubing which has been cleaned of manufacturing oils by heating or washing with solvents. The gas stream is split by a T-connector, part going to the generator, and part to a glass capillary pipet for evaporating the samples after esterification. A micro-valve in the gas line aids in regulating the flow of gas through the generator. Removable metal-to-metal and metal-to-glass connections are conveniently made with Swagelok or Cajon fittings (Crawford Fitting Company, Cleveland, Ohio). All glass components are fire-polished to avoid initiating a diazomethane explosive reaction ( 2 ) . Operation of the Esterification Apparatus. The generator with attached delivery tube is clamped in a fume hood behind an explosion shield, then 2 milliliters each of ethyl ether a n d 2-(2-ethoxyethoxy)ethanol (Carbitol, Aldrich Chemical Company, Milwaukee, Wis.) are added. Approximately 25 milligrams of N-methyl-N-nitroso-p-toluenesulfonamide (Diazald, Aldrich Chemical Company) are added and rinsed to the bottom of the generator with 3 milliliters of ethyl ether. The rubber stopper is then inserted in the generator and the inlet tube is connected to the gas line. Inert gas is passed through the system for approximately 30 seconds, during which time 1 milliliter of 60z aqueous potassium hydroxide is added to the dropping tube. Then the gas flow is momentarily interrupted and the stopcock is opened to allow the KOH to drain into the generator. After the stopcock is closed, gas flow is resumed at the rate of 1-3 bubbles per second. Ethereal diazomethane is swept out of the gen(2) Th. J. de Boer and H. J. Backer, “Organic Syntheses,” N. Rabjohn. Ed.. Coll. Vol. 4, John Wiley and Sons, New York, N.Y., 1963, p 250.