Computerized pattern recognition applied to gas chromatography

Computerized pattern recognition applied to gas chromatography/mass spectrometry identification of pentafluoropropionyl dipeptide methyl esters. James...
0 downloads 0 Views 846KB Size
1732

ANALYTICAL CHEMISTRY, VOL. 51, NO. 11, SEPTEMBER 1979

apparent in this cell when the Cu2+:M2+ratio was large. In such instances, substantial diffusion from the narrow channel leading to the reference electrode necessitates the use of relatively long deposition times to prevent the deposition of Cu on the GCMFE. Future modifications of the cell design will be directed toward minimizing the effects of Cu2+diffusion by further isolating the GCMFE from the channel.

Oglesby, D. M.; Anderson, L. B.: McDuffie, B.; Reilley, C. N. Anal. Chem. 1965, 37, 1317. DeAngelis, T. P.; Heineman, W. R. Anal. Cbem. 1976, 4 8 , 2262. DeAngelis. T. P.; Bond, R. E.;Brooks, E. E.; Heineman, W. R. Anal. Chem. 1977, 49. 1792. Vydra, F.; Stulik, K.: Julakova, E. "Electrochemical Stripping Analysis"; Halsted Press: New York, 1976; pp 58-66. Barendrecht, E. In "Electroanalytical Chemistry"; Bard, A. J., Ed.; Marcel Dekker: New York, 1967: Vol. 11. Copeland, T. R.; Osteryoung, R. A,: Skogerboe, R. K. Anal. Chem. 1974, 4 6 , 2093. Robbins, D. G.: Enke, C. G. J . Electroanal. Chem. 1969, 23 343. Shuman. M. S.; Woodward, G. P., Jr. Anal. Chem. 1976, 48, 1979. Crosum, S. T.: Dean, J. A,; Stokely. J. R. Anal. Chim. Acta 1975, 75, 421. Kemula, W.; Kublik, Z. Nature 1958, 182, 1228. Stromberg, A. G.; Gorodovykh, V. E. Zh. Neorg. Khim. 1963, 8, 2355. Stromberg, A. G.; Zakharov, M. S.; Mesyots. N. A. Nektrokhimiya 1967, 3 , 1440; 1968, 4 , 987. Ostapczuk, P.; Kublik, Z. J . Nectroanal. Chem. 1977, 83, 1. Anderson, L. B.; Reilley, C. N. J . Electroanal. Chem. 1965, 10, 295, 538. Napp, D. T.: Johnson, D. C.: Bruckenstein, S. Anal. Chem. 1967, 39, 481. Stulikova, M. J . Electroanal. Chem. 1976, 4 8 , 33. Stojek, Z.; Stepnik, B.; Kublik, Z. J . Electroanai. Chem. 1976, 74, 277. Stojek, Z.; Kublik, Z. J . Electroanal. Chem. 1977, 77, 205. McDGffie, B.; Anderson, L. B.; Reilley, C. N. Anal. Chem. 1986, 38, 883. Paul, D. W.: Ridgway, T. H. "Abstracts of Papers", 176th National Meeting of the American Chemical Society, Miami Beach, Fia., Sept. 1978; American Chemical Society: Washington, D.C.; Abstr. COMP 2.

CONCLUSION Selective deposition of interfering metals onto different electrodes is an effective means of avoiding intermetallic interactions in thin-layer ASV as demonstrated here for Cu-Zn and Cu-Cd. Results reported here were obtained in the parts per million concentration range with linear sweep voltammetry as the stripping technique. The development of instrumentation for twin-electrode differential pulse AS\' (2.5') should enable analysis in the parts-per-billion range as has been demonstrated for single-electrode thin-layer ASV ( 7 , s ) . The magnitude of these intermetallic compound interferences in this lower concentration range is being investigated.

LITERATURE CITED (1) Hubbard, A. T.: Anson, F. C. In "Electroanalytical Chemistry"; Bard, A J., Ed.; Marcel Dekker: New York, 1970: Vol. I V , Chapter 2. (2) Hubbard, A. T. CRC Crit. Rev. Anal. Chem. 1973, 3 , 201. (3) Reilley, C. N. Rev. Pure Appl. Chem. 1968, 18, 137. (4) Kuwana, T.; Heineman, W. R. Acct. Chem. Res. 1976, 9, 241. (5) Heineman, W. R. Anal. Chem. 1978, 50, 390A.

R E C E ~ Efor D review March 12,1979. Accepted June 25,1979. The authors gratefully acknowledge support provided by the National Science Foundation.

Computerized Pattern Recognition Applied to Gas Chromatography/Mass Spectrometry Identification of Pentafluoropropionyl Dipeptide Methyl Esters James N. Ziemer and S. P. Perone" Purdue University, Department of Chemistry, West Lafayette, Indiana 47907

R. M. Caprioli* and W. E. Seifert University of Texas Medical School, Department of Analytical Chemistry, Houston, Texas 77025

A promising new technique for the indentification of amino acid sequences in polypeptides involves the enzymatic hydrolysis of intact polypeptides to dipeptides followed by analysis of the products with gas chromatography/mass spectrometry. The feasibllity of this approach for fast on-line analysis was demonstrated here by the use of computerized pattern recognition in the identification process. The fundamental basis for classificationwas the separate identificationof the N- and C-terminal amino acids in the dipeptides using two muiticategory k-nearest neighbor (kNN) analyses. Two sets of ions characteristic of amino acids derived from either the N- or C-terminus were chosen as features for the tests on the basis of similarities in intensities among the members of each class. Features were normalized to the sum of only those ions used in a particular test in order to ensure that the relative ion intensities used were not influenced by the charge retaining characteristics of the amino acid on the other terminus. Classification accuracy for 86 PFP dipeptide methyl esters was 100% in the N-termlnal amino acid test, and 9 7 % in the C-terminal test.

I n recent years there has been considerable interest in the development of mass spectrometric methods for the se-

quencing of polypeptides. One such technique which has shown promise involves the enzymatic hydrolysis of intact polypeptides to dipeptides by dipeptidylaminopeptidases (DAP) I and IV, followed by product identification with gas chromatography/mass spectrometry (GC/MS) ( I , 2). The key to implementing this technique on a routine basis lies in the development of a method for identification of the dipeptide mass spectra which is fast, accurate, and largely unaffected by the presence of impurities. This paper reports on the potential use of computerized pattern recognition for the interpretation of these low resolution dipeptide mass spectra. The enzymatic hydrolysis-GC/MS method provides amino acid sequence information from two related mixtures of dipeptides. The two dipeptide mixtures are obtained from two separate enzymatic hydrolyses, one involving the original polypeptide and the other involving the polypeptide whose N-terminal amino acid has been removed via the Edman method. DAP I and IV operate by sequentially cleaving the polypeptide into dipeptide fragments from the N-terminus. Each dipeptide mixture is subjected to acylation of the Nterminal amino acid with pentafluoropropionyl (PFP) anhydride and esterification of the carboxyl groups with methanol prior to separation and identification by GC/MS. Because of the generally nonredundant nature of most polypeptide amino acid sequences, stemming from different 1979 American Chemical Society

ANALYTICAL CHEMISTRY, VOL. 51, NO. 11, SEPTEMBER 1979

combinations of the 20 basic amino acids, a unique sequence of approximately 30 residues can usually be found from matching up the two sets of overlapping dipeptides. This technique has been successfully applied to the sequence determination of the 98 amino acid residue spinach plastocyanin (3)and the 181 residue S-carboxymethylated soybean trypsin inhibitor (2). In order to use this technique for the routine sequencing of longer polypeptides, an automated procedure is needed to facilitate analysis of the GC/MS data. Pattern Recognition Considerations. T h e use of computerized pattern recognition for chemical data analysis has been applied to several areas of research. Recent examples include atmospheric ( 4 ) and lake sediment ( 5 ) pollution analysis, nucleotide sequencing ( 6 ) , gasoline typing (71, voltammetric classification (&IO), identification of structure/activity relationships ( I I ) , and functional group presence/absence from mass spectra (12), nuclear magnetic resonance data (13),and infrared spectra (14). In addition, several excellent reviews by Kowalski (15) and Bender (16), as well as a book by Jurs and Isenhour (17) have been published, and the reader is referred to these for a more complete discussion of the field. For the purpose of this paper the term "pattern recognition" is meant t o imply an automated procedure for identifying prominent structural characteristics of an unknown compound on the basis of similarities between its mass spectrum and the spectra of compounds with known characteristics. This is accomplished by representing the mass spectrum of a compound in digital form as an n-dimensional vector, X,, = ( X I , xq, x3, . . x,) called a pattern, where x, is the relative intensity of a fragment ion a t m / e = i. A mass spectral pattern for each compound can then be described as a point in n-dimensional hyperspace with coordinate values corresponding to the relative intensity a t each m/e position. Similarity between patterns is indicated by proximity in this hyperspace. The structural characteristics of a n unknown compound can subsequently be inferred by noting the particular pattern types which lie nearest it. Implicit, therefore, in the application of pattern recognition to this work is the assumption that characteristic fragmentations occur for dipeptides with similar structural properties and that these properties are dominant enough to yield fragmentation patterns distinct from all other dipeptides. There is good reason to believe this is the case. Derivatized dipeptides are known to cleave primarily in either of two locations: (a) between the CY carbon of the N-terminal amino acid and the carbonyl carbon of the peptide bond, or (b) between the carbonyl carbon of the peptide bond and the nitrogen of the C-terminal amino acid (18). a b

x,

9

i

I

CF,CF,CO-NH-CH-CO- -NH-CH-COOCH, R, R2 Because of the frequency with which these processes occur (in many spectra the molecular ion peak is absent), the mass spectra of these compounds are largely dictated by the independent fragmentation processes occurring a t each end of the molecule. Furthermore, the highly independent nature of these processes indicates t h a t those compounds with common N- or C-terminal amino acids should have similar fragmentation patterns for selected ions. Applying this fact to dipeptide identification, 40 structural classes were defined representing the 20 commonly occurring amino acids in both the N- and C-terminal positions. Two separate sets of m/e's (henceforth referred to as features) were then considered, one consisting of ions characteristic of amino acids derived from the N-terminus of a dipeptide, and the other characteristic of ions from the C-terminus. Classification

1733

of an unknown dipeptide then proceeded by considering separately the location of the unknown in the hyperspace defined by the N- and C-terminal features respectively. If, for example, an unknown pattern were found to lie in a region of space occupied principally by serine reference patterns using C-terminal features, and nearest to patterns containing tyrosine using N-terminal features, then the unknown would be classified as tyrosylserine regardless of whether or not that particular dipeptide was present among the reference set (or training set) compounds used for classification. In terms of speed and computer storage requirements the advantage to this approach over file searching methods is obvious. Some 400 different dipeptides could be identified (even though all are not present in the mass spectral file) from the selected features of several representative compounds from the forty N- and C-terminal classes. Although several pattern recognition algorithms exist for classifying unknown compounds, the method which was used exclusively in this work was the nearest neighbor analysis (19). This algorithm works best for multiclass problems with training sets of limited size. Class decisions are made on the basis of the majority class among an unknown's nearest neighbors. Here, distance is defined as a simple Euclidean metric given in Equation 1, ll

D, =

(XLk - x ~ k ) 2 ] " 2 k=l

(1)

where D,, is the distance between unknown pattern vector i and training set pattern vector j , both with n features.

EXPERIMENTAL Mass spectra of 86 dipeptides were obtained according to the procedure of Caprioli et al. ( 2 ) . Samples were separated and analyzed using a Finnegan 3200/6000 GC/MS/data system. Separation was effected using a 0.2 X 45 cm column packed with 3% Dexsil300 on 100/120 mesh Chromosorb G (Applied Science Labs). On-column injection was used with the injector at 200 "C and a linear temperature program from 100 to 250 at 10 "C/min. Helium was used as the carrier gas at a flow rate of approximately 30 mL/min. Mass spectra were obtained using an electron-impact source operated at 70 eV and a temperature of 100 "C. Each spectrum ranged in mass value from m / e 75 to 452. Only ion intensities which exceeded 1% of the base peak intensity were recorded. Three spectra were obtained per sample from three separate runs and averaged to produce one representative compound spectrum. The pattern recognition processor was a Hewlett-Packard 2100s computer with 32K words of core memory. Peripherals included a 5 Mbyte moving head disk drive (HP-7900), paper tape reader and punch, a Tektronix 603 storage display monitor, a Centronics 306 serial printer, a Calcomp 565 digital plotter, and a Teletype. All preprocessing, feature extraction, and pattern recognition programs were written in Fortran IV and operated under a Hewlett-Packard DOS-M executive. From the 86 members of the data set only 17 of the possible 40 amino acid classes were examined. This was done in order to ensure that each class was represented by at least five different compounds. The class types and their respective members are listed in Table I. Note that in several instances, dipeptides could be used only in one terminal analysis owing to the lack of sufficient representatives of the other terminus to define a class for that amino acid. Feature Selection. Ultimately, the successful application of any pattern recognition method depends on the degree to which the user can select discriminant features pertinent to the defined classes. In this project it was necessary to identify sets of ions for each class known to originate exclusively from one end of the dipeptide molecule. Possible complications to finding such ions due to the generation of common mass fragments from both ends of the molecule were fortunately minimized by the presence of the pentafluoropropionyl derivative on the N-terminus. The additional 147 mass units which this group added to the principal N-terminal fragments were generally sufficient to remove those

1734

ANALYTICAL CHEMISTRY, VOL. 51, NO. 11, SEPTEMBER 1979

Table I. Amino Acid Classes and Members N-terminal classes C-terminal classes Glycine (Gly or G )

Alanine (Ala or A)

Gly -ASP Gly G l n Gly Ser* Gly -Met Gly -LYS Gly -Tyr Gly -Thr Gly P h e Gly-Ala Gly -Ile* Gly Glu

Ile -Ala Ser -Ala Ala-Ala Glu -Ala* Val-Ala* Gly -Ala Asp -Ala Met-Ala Leu -Ala Pro-Ala Phe -Ala

Alanine (Ala or A)

Aspartate (Asp or D)

Ala-Asp Ala-His Ala-Trp Ala-Thr Ala-Lys Ala-Asn Ala-Ser Ala-Ala AlaGly Ala-Leu* Ala-Tyr Ala-Val* A1a -Glu Ala-Phe Ala-Pro Ala-Met A1a -1le

Val-Asp Gly -Asp Met-Asp Glu -Asp Asp -Asp** Ala -Asp Lys-Asp

Aspartate (Asp or D)

*

N-terminal classes

Ile -Val Glu -Val Ala-Val* Tyr-Val Val-Val*

Serine (Ser or S)

Leucine (Leu or L)

Glutamate (Glu or E ) Leu G l u * Met-Glu AlaGlu Phe-Glu Val-Glu* Tyr -Glu Lys-Glu Gly Glu

Leucine (Leu or L) Leu -Glu * Leu-Phe* Leu -1le Leu-Gly Leu-Met Leu-Ser Leu-Ala

Phenylalanine (Phe or F)

Glutamate (Glu or E )

Asp -Phe Lys -Ph e Thr-Phe Ala-Phe Leu -Phe* Val-Phe Gly Phe Tyr-Phe*

Glu -ASP Glu-Pro Glu -Val Glu-Lys Glu Ser Glu -Leu * Glu-Ala* Glu -Tyr Tyrosine (Tyr or Y )

Tyrosine (Tyr or Y )

Tyr-Leu Tyr-Glu Tyr-Phe* Tyr-Val Tyr-Tyr*

Pro-Tyr Ser-Tyr* Ala-Tyr Glu -Tyr Gly T y r Ile -Tyr Tyr -Tyr*

Valine (Val or V)

Asp -Phe Asp S e r * Asp GlY Asp -ASP* Asp -Pro Asp -Ala

Phe-Ile Met-Ile Ala-Ile* Leu -1le Gly -Ile*

Val-Ser Val- Asp Val-Glu Val-Phe Val-Ala* Val-Pro Val-Val*

Serine (Ser or S) Val-Ser Asp S e r Trp S e r Ser-Ser* Ala -Ser Gly S e r * Glu Ser Leu S e r

or v)

Valine

Cterminal classes Isoleucine (Ile or I)

Thr-Leu Ser-Met Tyr -Leu Ser-Ser* Ala-Leu* Ser-Tyr* Glu -Leu* Ser-Gly Phe-Leu Ser-Ala Designates patterns used in final set for classification.

fragments from the mass range expected for C-terminal ions. Therefore, with a few exceptions, one could expect to find characteristic ions originating from the C-terminal amino acids in the general mass range 75-175 and from the N-terminus in the range 175-250. One aspect which did complicate feature selection, however, dealt with the problem of comparing ion intensities from different mass spectra. This grew from the discovery that known characteristic ions varied considerably between class members because of the influence of the particular amino acid at the other terminus. Hence, even though ions formed after the (a) or (b) split described earlier are peculiar to the individual classes, their measured intensities are dependent on the charge retention characteristics of both amino acids. Examination of the intensities for ions 88 and 102 for several dipeptides of the class C-terminal alanine in Table I1 illustrates this problem. The two ions, representative of the C-terminal alanine class, are reported both as a percentage of the base peak in each spectrum and as a percentage of the total ion current (in accordance with the two most common scaling procedures). Here, the scaled intensities vary dramatically among the class members even though the ratio between the two ions in every case remains fairly constant. This is a particularly undesirable property for pattern recognition because it jeopardizes the formation of distinct class clusters in feature space. In the final analysis this difference

Table 11. mle's 88 and 102 Scaled Intensities for C-Ala Dipeptides

dipeptide Gly-Ala Ala-Ala Ile-Ala Phe-Ala Glu-Ala Pro- Ala

m/e 88, % total base ion peak current

10 7 7 4 2 1

1.9 1.7 0.9 0.6

m/e 102, %

base peak 100

46 48

0.3

24 16

0.3

6

total ion m/e 881 current m/e 102 19.4 0.10 11.4 0.15 6.3 0.15 3.4 0.17 2.8 0.13 1.8 0.17

can be eliminated by scaling ions to the sum of only those ion intensities derived from the same end of the molecule. But with respect to feature selection there is no way to know a priori which ions should be used in the scaling process. This unprecedented situation immediately renders most established methods of feature selection inapplicable. One-dimensional pattern distributions cannot be studied (20,21)because no single unbiased scaling factor exists for initially normalizing the relative ion intensities. Furthermore, statistical selection methods based on the evaluation of multidimensional density

ANALYTICAL CHEMISTRY, VOL. 51, NO. 11, SEPTEMBER 1979 functions fail because the distribution along any feature is a function of all other features in the set (22). Because one cannot initially say which ions are derived entirely from one terminus (without invoking a scaling process based on present theories of fragment ion formation), one is forced to treat all ions during the selection process as if they originated from the same terminus and scale accordingly. This causes problems for two reasons. First of all, any set of ions considered initially includes m/e’s from both termini which introduce artifacts in the relative feature magnitudes used in class density evaluations. Secondly, because the elimination of any one feature from the original set requires that every pattern be rescaled with respect to the new feature set, class distributions may result which completely contradict trends suggested from the previous evaluation. Hence, the assignment of feature intensity within any given pattern remains a relative one throughout the selection process. In order to treat this problem, a purely empirical approach was sought which would select a set of ions yielding tight pattern clusters for each class. The method which appeared most applicable was the iterative feature removal algorithm used by Thomas and Perone (23). This selection method sequentially removes one feature from the original set, rescales the pattern vectors with respect to the remaining ions, and then tests each pattern for nearest neighbors. In order to best reflect upon how each feature affected pattern distribution, features were scaled to the sum of the ion intensities considered rather than the most intense ion (base peak) in the subset. If removal of any feature results in a greater percentage of similar class patterns becoming nearest neighbors, then that feature is permanently eliminated from the feature set. There is substantial reason to believe this approach eventually leads to removal of all extraneous ions. Nearest neighbor analyses, in addition to providing accurate unknown pattern classifications, are valid nonparametric estimators of class density functions (22, 24). When used to test the effect of feature removal on overall nearest neighbor analysis, they also reflect upon changes in the class distributions for better or worse. The advantage of this approach to those techniques discussed earlier is that feature significance is assessed after rather than prior to removal of features. In this way, successive feature elimination occurs simultaneously with optimum pattern spacial distribution. The major drawback to this method arises from the amount of time required for the procedure. Before the poorest feature in a set of m ions can be eliminated, m different tests must be run corresponding to the temporary removal of each feature. Obviously this becomes impractical when a large initial feature set is considered for reduction. Therefore, in order to make the procedure applicable to the present study, several simple preselection measures were taken. F i r s t a n d Second Preselection Steps. In the first preselection step, each class of compounds was considered separately. Potentially useful features were selected for each class based on the presence of those ions in all spectra of that class. The second step involved correlation studies of the ions in these class subsets to determine which ions varied uniformly for all members. This was done in order to identify from each class one distinct fragmentation pattern. The measure of correlation used here was the cosine of the angle between each class m / e vector and a unity vector. Mathematically this correlation between two vectors X, and X, is given by

rr, = cos 8,) = (X,, X,)/IX,JIX,I =

(e p=1

k

x p , x p J / ( C xp,2)”2( p=1

5

Xpr2)1/2

p=1

(2)

where rt, is the inner product of their individual elements x p i and x p , divided by their vector magnitudes ( 2 5 ) . Equation 2 reduces to k

rL, = ( C x p , ) / ( p=1

k

C~ ~ , ~ ) ‘ ~ ~ ( k ) ’ ’ ~ p=1

(3)

when X, is a unity vector. Here a m / e class vector (not to be confused with a pattern) consists of the intensities of m / e = j for each of the k class members. A simple example illustrates the second preselection step. After the first step, four dipeptides of the class C-terminal alanine have

1735

Table 111. Principal Class Features C-terminal ions alanine 88, 97*, 102* aspartate 86, 102*, 113*, 128*, 156, 160 serine 100,101*, 1 0 2 * valine 83, 88, 89, 98, 115, 116*, 130 leucine 86, 87*, 88, 144 isoleucine 86, 88, 1 2 8 * glutamate 82, 84, 98, 100, 119, 114, l l 6 * , 142, 144 phenylalanine 88, 91, 120*, 131, 1 6 2 tyrosine 88, 107*, 253, 282, 293, 324*, 325 N-terminal ions glycine alanine aspartate serine valine leucine glutamate tyrosine

78, 119*, 147,176*, 177 92, 119*, 190*, 1 9 1 119*, 188*, 206*, 216*, 248 119*, 147, 160, 188*, 216* 83, 100, 119*, 164,176*, 218* 119*, 176*, 190*, 232* 84, 119*, 189, 202*, 203, 230, 231, 262 77, 107, 117, 119*, 253, 293, 428

* Designates those features used in final subsets for first analysis.

tH,

b

Figure 1. Fragmentation scheme for C-terminal phenylalanine. (Mass fragments 88, 91, 103, 120, 131, 162)

the following intensities for m/e’s 88, 102, and 104 (after rescaling to the sum of all the ions considered at that time); Ala-Ala 6, 41, 2 Ile-Ala 4, 24, 46 Glu-Ala 4, 30, 9 Phe-Ala 3, 20. 26 m / e vector 88 for those patterns then contains the elements [6, 4, 4, 31. When this vector is correlated with the unity vector [l, 1, 1, I] the coefficient of correlation is 0.97. Likewise, the correlation coefficient for m / e vector 102 is 0.96 and for m / e 104 is 0.77. m / e 104 is not known to be a characteristic fragment ion for C-terminal alanine whereas 88 and 102 are. The correlation values obtained for these ions accurately detect this fact. To allow for sufficient variation between class members, only features which had correlation coefficients less than 0.85 were eliminated. The list of ions selected by this procedure is shown on a class basis in Table 111. It is significant that even though this list was obtained empirically it agrees quite well with whrit is expected from theoretical considerations. In the case cif C-terminal phenylalanine dipeptides for example, the ions selected represent those important fragmentations shown in Figure 1. Third Preselection Step. Once features were chosen on an intraclass level it was necessary to study their correlation as a complete set on an interclass basis with respect to common termini. This third pre-selection step easily identifies nonspecific class discriminators. Problems of this sort arise, for example, in the case of m / e 91 expected for dipeptides containing either N-

1736

ANALYTICAL CHEMISTRY, VOL. 51, NO. 11, SEPTEMBER 1979

or C-terminal phenylalanine. Since the tropylium ion is a major fragment ion for phenylalanine in general, it becomes a poor feature for unequivocally identifying that amino acid’s position within the dipeptide. Such nonspecific class features also occur in situations where a common mass ion may originate from two different amino acids in different termini, as in the case of m / e 130 with C-Val and N-Trp. These features are easily identified and removed from further consideration by applying the same class correlation algorithm described earlier to the complete set of N- or C-terminal ions reported in Table 111. That is, a correlation analysis was applied to each N-terminal class using all the ions selected previously from all N-terminal classes; the analogous procedure was carried out for the C-terminal classes. Features were eliminated whose correlations for any class were less than 0.85 and whose corresponding vector magnitudes were greater than 15% of the largest correlated intraclass vector. This limitation was specifically designed to discriminate against poorly correlated features with intensities large enough to cause considerable class spacial dispersion. All together the three preselection procedures accounted for a reduction in the N- and C-terminal feature set size from a possible 380 to 32 and 23 m/e’s respectively. Iterative Feature Elimination. Those ions remaining in the two feature sets were finally subjected to the iterative elimination algorithm discussed earlier. Because the procedure differs somewhat from the original work by Thomas (23), several comments concerning its operation need to be made. First, in order to remove features in an unbiased manner, each of the ions must be temporarily removed from the set and evaluated for its effect on class distribution before a decision can be made concerning the permanent removal of any one ion. Final decisions are made on the basis of which single feature results in the highest gain in common class neighbors when temporarily set aside. This process is then repeated for the remaining set of ions and is terminated when further removal of any one feature results in poorer clustering. It should be mentioned that each change in the feature set is accompanied by a complete renormalization of the relative pattern intensities prior to analysis. The major change in this technique, however, concerns the action taken when the temporary removal of each feature causes no detectable improvement in cluster formation. This condition results if two or more specific class features exist in the feature set when only one is necessary. When this occurs a special algorithm is used which identifies those ions whose removal causes a decrease in class nearest neighbors. This subset of ions, considered essential for optimum clustering is set apart and the remaining ions singly added to it and evaluated according to the previous criteria. The difference here, however, is that features are kept when their addition improves class distributions. This process is ended when the results exceed or match those prior to execution.

Figure 2. Nonlinear map of Gterminal patterns. See Table I for symbols

RESULTS A N D D I S C U S S I O N T h e performance of the classifier is shown in Table IV-A during various stages of feature selection. In all analyses each pattern was selectively removed from the data set and treated as an unknown. Classification then proceeded in the normal manner based on those compounds of known class closest to it in the feature space. (Note that each pattern was prevented from being its own nearest neighbor when treated as a n unknown.) Because of the multiclass nature of the space, a n unknown pattern‘s n nearest neighbors could represent each of n different designated amino acid classes for that terminus. Therefore, to assist in the decision making process no preset window was employed as in the familiar k-Nearest Neighbor classifier (19,20). Rather, class assignments were based on the two nearest patterns of similar class t o the unknown. In no case, however, did this involve a window size greater than four nearest neighbors. T h e results in Table IV-A indicate the effectiveness of the pre- and iterative feature selection process. In the first analysis, performed after step one in the selection process, classification results were generally poor with only 23% and 6470 correct predictability for the respective N- and C-ter-

minal tests. This can largely be attributed t o the fact that many of the features used originated from the terminus other than t h a t being tested and hence contributed t o random scatter in patterns of the same class. Without adequate feature selection, the spectra show little inherent tendency t o form class clusters. Once the next two simple correlation steps were carried out, however, the number of features used dropped significantly and the classification accuracy correspondingly jumped to 86 90 and 95% in the N-terminal test and 77% and 92% for the C-terminus. Finally, completion of the iterative removal algorithm brought the number of ions used in the N- and C-terminal test to 9 and 10 respectively with a n overall accuracy of 100% and 97%. Those ions used in the final analysis are identified by a “*” in Table 111. Nonlinear mappings of the two test spaces, shown in Figures 2 and 3, verify that distinct class clusters have been formed (22,26). Perhaps the most interesting result of the analysis was the complete separation and identification of the two isomer classes Cterminal leucine and isoleucine. Because of their great similarity, they have traditionally been difficult amino acids to distinguish from low resolution mass spectra alone. This

Table IV A.

Classification Results for First Analysis C-terminal test

N-terminal test

no. classificn no. features accuracy, features used 7% used

classificn accuracy, %

following 80 64 97 23 pre-selection step 1 pre-selection 36 77 44 86 step 2 pre-selection 23 92 32 95 step 3 final feature 10 97 9 100 subset B. Classification Results for Second Analysis following pre-selection step 1 preselection step 2 pre-selection step 3 final feature subset test set results

95

69

110

68

40

80

38

93

28

87

32

93

7

100

8

100

7

91

8

95

1

I S

I

0

0 0

ANALYTICAL CHEMISTRY, VOL. 51, NO. 11, SEPTEMBER 1979

L L L

possibilities expected for the complete set. Indeed, because of the distinctive nature of the various amino acids, construction of such an ideal subset would be dif'ficult without involving nearly all 400 possibilities. The present data set merely reflects those mass spectra which were readily available, as evident by the fact that some classes were minimally represented (e.g. C-Val and N-Ser with only 5 class patterns) while others were noticeably over-represented (N-Ala with 17 of 20 possible patterns). I t is difficult, therefore, to conclude exactly how the present classifier would perform with the complete set. To gain some insight into this question, the entire selection process was performed again using a pattern training set with only 5 representative dipeptides per class and then tested with the remaining dipeptides as unknowns (test set). Training set compounds were selected from the previous analysis on the basis of maximum intraclass similarity. This was done to simulate the worst possible situation for selecting features. In other words, by training features from class patterns with the greatest resemblance to one another, the chances of overlooking features needed to separate spacially overlapped classes would be enhanced. This way features were selected from the best possible class distributions and tested under the worst possible circumstances. The results from the second analysis are given in Table IV-B. The high classification accuracy obtained after only

RA

R

A

0

Figure 3. Nonlinear map of N-terminal patterns. See Table I for symbols

is particularly evident when one examines the spectra for the two compounds alanylleucine and alanylisoleucine in Figure 4. One factor still in question, however, is the degree with which the final feature set was biased by the particular pattern set used for selection. The pattern subset used in this study by no means represents the range of dipeptide fragmentation

A

I

I.

,

I 1

I

I I I

I I I I I I I I I I 225

FILR-ILE: Figure 4. Mass spectra of (A) Ala-Leu and (B) Ala-Ile

1737

115

l - l -325h + r

37 I

1738

ANALYTICAL CHEMISTRY, VOL. 51, NO. 11, SEPTEMBER 1979

Table V. Features Used in Second Analysis

-

C-terminal ions 83, 101,102, 116,128, 162, 324

N-terminal ions 119,176, 188, 190, 204, 206, 216, 230

the first step in the preselection clearly illustrates how the intraclass training set patterns closely resemble one another. This degree of similarity further manifests itself through the rest of the selection procedure where 100% classification occurs for both termini with only 7 ions in the C-terminal and 8 ions in the N-terminal tests. More significant, however, is the fact t h a t the features chosen with the reduced training set (see Table V) are nearly identical to those selected in the first analysis. This lends credence to the fact t h a t the ions selected are not artifacts of the data set but rather accurately reflect the true nature of the fragmentation process. When applied to the remaining test set patterns the accuracy of the N- and C-terminal analysis falls to 95% and 91 70.Examination of the misclassified dipeptides from each test reveals t h a t in both tests this decrease in overall predictability was due to the same effect, namely the absence of additional distinguishing features for classes producing similar fragments. In the C-terminal test, for example, the only difference between class assignments from the two analyses is the additional misclassification of four alanine patterns as aspartates. Since both C-Ala and C-Asp dipeptides produce very intense 102 m/e's, the absence of a further distinguishing ion (such as m / e 113 for Asp) makes separation between the two classes difficult. The same situation occurs for the N-terminal classes alanine and leucine which both produce intense ions a t mle 190. When this fact is taken into account, the results from the second analysis indicate that excellent classification with pattern recognition can be expected when applied to a more complete data set. Furthermore, those remaining amino acid classes not tested in this study should pose little difficulty in future work because of the generally distinctive nature of their side chains. T h e feasibility of using this method for on-line compound identification, however, places severe limitations on the time required for classification and on the size of the pattern training set. With regard to analysis time, the two-step, low-dimensional nature of the classification algorithm should pose little problem. But unless limitations are placed upon the number of representative patterns required for adequate spatial definition, available computer memory may quickly be exceeded. The classifier described here, by its nature, does not require every dipeptide to be present in the pattern set in order to identify all 400 possible dipeptides. Practical consideration for using such a system in routine analysis indicates t h a t a t most 3-5 patterns per class should be sufficient. T o demonstrate this capability, only 2 patterns per class were used for unknown classification using the feature sets from the first analysis. This amounted to storing a total of 9 N-terminal and 10 C-terminal ions for 19 different dipeptides (indicated by an asterisk in Table I). The results

from this analysis were identical to that of the first analysis, i.e. 100% N-terminal and 97% C-terminal amino acid recognition. Hence, even though a large data set is needed for optimum feature selection (in a realistic situation all 400 dipeptides should be used for selection) only a minimum pattern set appears necessary for compound identification. While more amino acid classes certainly need to be tested under this scheme, it appears that pattern recognition is an effective method for identifying dipeptides from mass spectral data. The ability of the approach to correctly identify compounds without depending on the presence of the molecular ion peak further allows this technique to be applicable down to low analyte concentration levels. An additional aspect of the classification which was not considered here was the use of retention time data from the GC/MS experiment. Although not used in this study because of the original goal of assessing the capability of pattern recognition for compound identification from MS data, it would be possible to incorporate that information into future pattern classifiers (27). Indeed, if such information had been included as an additional feature, the two misclassified dipeptides from the C-terminal analysis would have been properly identified.

ACKNOWLEDGMENT The authors are grateful to David Burgard and William Farrell, Jr., for preliminary studies and for helpful discussions.

LITERATURE CITED (1) Caprioli, R. M.; Seifert, W. E.; Sutherland, D. E. Biochem. Biophys. Res. Commun. 1973, 55, 67. (2) CaDrioli, R. M.; Seifert, W. E. Biochem. Bioahvs. . . Res. Commun. 1975,

64; 295. Seifert, W. E.; Caprioli, R. M. Biochemistry 1978, 17, 436. Garenstroom, P. D.; Perone, S. P.; Moyers, J. L. fnviron. Sci. Techno/. 1977, 11, 795-800. Hopke, P. K. J . Environ. Sci. Health 1976, A 1 1 ( 6 ) , 367. Burgard, 0. R.; Perone, S. P.; Wiebers. J. L. Biochemistry 1977, 16, 105 1- 1057. Tunnicliff, D. D.; Wadsworth. P. A. Anal. Chem. 1973, 4 5 , 12. Thomas, Q.V.; DePalma, R. D.; Perone, S. P. Anal. Chem. 1977, 4 9 , 1376-1380. Burgard, D. R.; Perone, S . P. Anal. Chem. 1978, 50, 1366-1371. DePalma, R. D.; Perone, S. P. Anal. Chem., in press. Kowalski, B. R.; Bender, C. F. J . A m . Chem. SOC.1974, 9 6 , 916. Varmuza, K.; Rotter, H.; Krenmayr, P. Chromatographia 1974, 7, 522. Wllklns, C. L.; Williams, R. C.; Brunner, T. R.; McCombie. P. J. J . A m . Chem. SOC.1974, 9 6 , 4182. Liddell, R. W.; Jurs, P. C. Anal. Chem. 1974, 46, 2126. Kowalski, B. R. Anal. Chem. 1975, 4 7 , 1152A. Kowalski, 0. R.; Bender, C. F. J . Am. Chem. SOC.1972, 9 4 , 5632. Jurs, P. C.; Isenhour. T. L. "Chemical Applications of Pattern Recognition"; Wiley-Interscience: New York, 1975. Biemann, K. "Biochemical Applications of Mass Spectrometry"; Wiley-Interscience: New York, 1972; p 405. Cover, T. M. I€€€ Trans. Inf. Theory 1968, it- 14, 50. Pichler, M. A,; Perone, S. P. Anal. Chem. 1974, 4 6 , 1790-1798. Chu, K. C. Anal. Chem. 1974, 4 6 , 1181. Fukunaga, K. "Introduction to Statistical Pattern Recognition"; Academic Press: New York, 1972. Thomas, Q. V.; Perone, S. P. Anal. Chem. 1977, 4 9 , 1369-1375. Duda, R. 0.; Hart, P. E. "Pattern Classification and Scene Analysis"; Wiley-Interscience: New York, 1973. Rummel, R. J. "Applied Factor Anaiysis", Northwestern University Press: Evanston, IIi , 1970. Sammon. J. W. I f € € Trans. Comput. 1969, c-18, 401. Seifert, W E.; McKee, R. E.; Beckner, C. F.; Caprioll, R. M. Anal. Biochem. 1978, 88, 149.

RECEIVED for review April 14, 1978. Accepted May 25, 1979. Work supported by the Office of Naval Research.