Simplex pattern recognition - Analytical Chemistry (ACS Publications)

W. Arthur. Byers and S. P. Perone ... Ernest S. Gladney and Daniel R. Perrin .... Benjamin J. Gudzinowicz , James L. Driscoll , Horace F. Martin , Bur...
0 downloads 0 Views 740KB Size
Simplex Pattern Recognition G. L. Ritter, S. R. Lowry, C. L. Wilkins, and T. L. lsenhour Department of Chemistry, University of North Carolina, Chapel Hill, N.C. 275 14

A new method for the development of near-optlmum llnear dlscrlmlnant functions for nonseparable chemlstry data analysis Is presented. Thls approach, based on a modlfled simplex optlmlzatlon technlque, Is shown to be superlor or equal to prevlous chemistry pattern recognltlon analysls procedures for the mass spectral analysis problems examIned.

For the past several years, there has been a great deal of research on applying computerized pattern recognition methods to chemistry data analysis. Much of this work has employed linear binary pattern classifiers because of the computational ease of their development and application. Primary applications of this approach have been in chemical structural or qualitative analysis problems utilizing various types of spectrochemical data (1-6). One of the most vexing problems preventing the confident widespread application of this method has been the difficulty of dealing adequately with the linearly nonseparable data presented by most actual chemical analysis problems. The absence of any certain method for proving nonseparability or for achieving optimum or near-optimum linear discriminant functions for nonseparable data has, understandably, prompted analysts to take a cautious attitude toward the routine use of such functions. For these reasons, the development of a systematic approach which will permit nearoptimum linear discriminant functions is the object of the research reported in this paper.

WEIGHT VECTORS AND SEPARABILITY OF DATA The data that are used in pattern recognition are represented as pattern vectors of the form x = ( X I , xp, . . . , X d ) . The dimensionality, d , of the vector tells the number of observations or features that are used to characterize the pattern. The quantities xi are the numerical values of each observation i. For binary classifiers, each pattern is classified into one of two categories. In this case, the pattern recognition problem is to determine the relationship between the data and the categories of the data. This relationship can be evaluated by calculating a discriminant function. Geometrically, the linear discriminant represents a hyperplane of the same dimensionality as the data, which divides the data into the two desired categories. Points on one side of the hyperplane will always belong to category 1 and points on the other side to category 2. If such a hyperplane exists, the data are linearly separable by a linear discriminant function of the form s =

d+ 1

wixi i=l

where x i is the ith component of X and wi is the weight assigned to that component. A ( d 1) component is added ( X d + l = 1)so that the category is determined by the sign of s. For example,

+

s s

> 0 implies category 1 < 0 implies category 2

(2)

The vector formed by the set of weights (w1, wp, . . . , wd+l) is the weight vector. The algorithm used to calculate the weight vector parallels that first described by Rosenblatt in 1960 (7). The method employs an error correction procedure (Le., nega-, tive feedback for an incorrect response) and has appeared frequently in the chemical literature (8). The algorithm must converge to a solution if the data are linearly separable, although the rate of convergence cannot be predicted. If the data are linearly inseparable, no weight vector solution exists and the weight vector fluctuates greatly in the vector space formed by the weights. Thus, for inseparable data, the normal procedure will never allow convergence and the weight vector has much less utility as a classifier of unknowns. Since the assumption of linear separability is tenuous and linear inseparability is more likely, several alternatives may be employed to calculate a weight vector for inseparable data. The error correction procedure may be terminated at some arbitrary point and the resulting weight vector taken as the classifier. Unfortunately, this vector will be highly dependent on the last patterns misclassified before the termination time. Another approach is to take a subset of the training set which has been found to be linearly separable. The weight vector which results may be used to classify unknowns. This latter method may be modified by calculating a set of weight vectors that will separate random subsets of the training data. The average of these weight vectors may then be used as the pattern classifier. However, none of these methods directly consider the problem of finding the hyperplane which misclassifies the fewest patterns. Therefore, a technique is needed that will search weight vector space to find the optimum (maximum) in the recognition response curve. Restated in these terms, the problem may be approached by various optimization procedures. In this work, the sequential simplex method first proposed by Spendley, Hext, and Himsworth (9) and modified by Nelder and Mead (10) was used to optimize the recognition in linear discriminant analysis. The simplex approach to the optimization problem is geometrically appealing and has been demonstrated to be widely applicable (11). Recently, the simplex algorithm has been successfully applied to several problems in analytical chemistry (12-15). The only assumptions that are made about the response surface are that it is continuous and has a unique extremum (in this case maximum) in the region of the search (10). However, by its specific nature, the response surface in the recognition problem is discontinuous, that is, only integral numbers of patterns may be classified. Therefore, it may be possible for the simplex to become stranded on a plateau (Le., an equi-recognition region). To force the response surface to be continuous, a second optimization criterion is added. For each vertex of the simplex, not only is the recognition tabulated but also the perceptron criterion function (vide infra). Then the response has the form of a continuous variable. The best vertex (weight vector) will have the minimum perceptron function value with maximum recognition,

ANALYTICAL CHEMISTRY, VOL. 47,

NO. 12,

OCTOBER 1975

1951

and points to the right of the line satisfy W l X l + w 2 x 2 > 0, x 2 = 1 (6b) The weight vector W = (w1, w 2 ) then defines the line (a one-dimensional hyperplane) through the data. This vector is included in Figure 1, bottom, and may be seen to be perpendicular to the decision line. This result extends to d dimensions so that, in general, the weight vector will be perpendicular to the decision hyperplane. The vector W, however, is not the unique representation for the weight vector. Multiplying each component of W by the same positive constant, a , also gives a vector description of the hyperplane. Therefore aW is an equally good weight vector. This degeneracy may be removed by constraining w d + l to be equal to minus one. The decision line then becomes

xxx

~

X

I

w 1 x 1 - 1= 0 (7) and the weight vector is ( w 1 , -1). This observation and solution also extend to d-dimensions so that

00

w = ( W 1 , W 2 , . . . , W d , -1)

(8)

describes a unique weight vector for a given decision hyperplane. Then the search through weight vector space need only be done through the d-dimensional space formed by the first d terms of the weight vector. Flgure 1. (Top) A decision line for a one-dimensional linearly separable data set; (bottom) a weight vector corresponding to a decision line for a onedimensional linearly separable data set

WEIGHT VECTOR SPACE REPRESENTATION The optimization procedure developed requires a search through weight vector space to find a maximum in the recognition response. To understand the meaning of this search, the relationship between weight space and pattern space must be examined. In pattern space, a hyperplane is sought to divide the patterns into the two desired categories. In general, for a d-dimensional pattern space (d observations for each sample point), the equation of a d-dimensional hyperplane is WlXl

+ W2X2 + . . . + WdXd = t

(3)

where x i is the ith component of the pattern and w i is the weight given to that component. Transposing t and defining W d f l = -t and x d f l = 1gives d f l

wix; = 0

(3')

i = l

Any point lying on the hyperplane will satisfy this equality. Points (representing patterns) on one side of the hyperplane satisfy d+ 1

x

wixi

i = l

>0

(4a)

Points on the other side of the hyperplane satisfy ,i+ 1

- x - w l x l < i = l

0

The set of weights may be written as a vector, W = ( w l , w 2 , . . . , W d , d + 1 ) . Then d f 1

x

wixi

i = l

=

w-x

(5)

SEQUENTIAL SIMPLEX DEVELOPMENT A simplex is a geometric figure which is used in the optimization procedure. If the optimization is to be done over d 1 vertices in a d variables, the simplex will contain d dimensional variable space. For example, a two-dimensional simplex is a triangle and a three-dimensional simplex is a tetrahedron. A response function is evaluated for each of the vertices. Then the simplex is moved along the response surface in weight vector space to find an optimum. This optimum is approached by movement away from the least desirable response. In its original form, the simplex moved only by a direct reflection away from the worst response across the other d vertices (9). Modifications by Nelder and Mead allow the simplex to follow more closely the contours of the response surface (10). The application of simplex optimization in pattern recognition is described in the next five sub-sections. Here the steps in setting up and implementing the optimization are described. Defining the Variables in the Simplex Problem. The problem has been defined as finding a weight vector which gives the maximum recognition for a training set of patterns. The search variables are the components of the weight vector and the response is the number of members of the training set correctly recognized by the weight vector, Whenever a new weight vector is defined, the dot product of the weight vector, W, with each pattern vector must be calculated. The recognition is the number of the patterns correctly classified by W. Since the (d 1)st term of the weight vector is defined to be -1, one half of all possible solutions of the problem have been excluded. That is, if a complete search is to be made through weight space, possibilities (a) and (b) both must be considered.

and the vector W defines the hyperplane. As an example, consider the one-dimensional problem shown in Figure 1, top. The line drawn through the X's and 0 ' s separates the two classes. Points to the left of the line must obey WlXl+ w2w2

1952

< 0,

22

=1

(6a)

ANALYTICAL CHEMISTRY, VOL. 47, NO. 12, OCTOBER 1975

+

+

(a) Wc*X > 0 implies category 1 Wc-X < 0 implies category 2

(94

(b) WD*X< 0 implies category 1 WD*X> 0 implies category 2

(9b)

The transformation WD = -Wc is required to change

+

from (a) to (b). However, if in Wc the (d 1)st component is -1, then WD will never be searched. Two solutions are evident. First, a search may be run through weight space with W d + l -1 and another search with W d + l E +1. The relative optima can then be compared and the largest chosen as the true optimum. Or, in the method used in this work, the recognition is not allowed to fall below 50%. Whenever the recognition falls below 50%, the decision hyperplane is inverted ( WC transformed to WD) and recognition taken as the complement of the value below 50%. In this procedure, not only must the recognition be recorded, but also whether the decision surface has been inverted. The searching process depends upon a gradient existing in the response surface. In the recognition problem, the response surface is a series of plateaus with heights proportional to the number of patterns correctly classified. It is possible for the simplex to become isolated on a wide plateau and to be unable to reach higher plateaus. A parameter is needed to smooth the response surface. One possible solution is to approximate the recognition with a continuous function. This is essentially the approach of Pietrantonio and Jurs (16). Their algorithm, however, was used solely on linearly separable data. The true value of the recognition may be retained if a second, subordinate optimization criterion is added. A reasonable choice seems to be the sum of the absolute values of the dot products of all misclassified patterns with the weight vector. This is the perceptron criterion

J (W )= SW

IW-XI

(11)

The scaled data are then used to calculate an initial set of weight vectors. A reasonable approach to calculating the initial vectors is to find some point which is typical of category 1 and some other point typical of category 2. An easily calculated typical or representative point for a given category is the center of mass or mean pattern of the category. Given that the mean pattern for category i is then

x,,

x,=:

(fll',

fl2',

. . . ,%ids, 1)

(12)

where f l J S is the average value of the scaled feature j for members of class i. When the two class means have been calculated, the decision boundary may be approximated as the hyperplane bisecting these two means. The weight vector for the hyperplace bisecting x1 and x2 is given by wJ' = 2 ( f l J S - ~ 2 , ~ )

d

Weight vector

1 coordinates for e a c h weight vector

1

3

2

.,.

d

d i l

XI3

wd

u,+ a

zL'2

2

U'2

U'3

Wd

3

U'1

w2

Wd

4

U!1

tL'2

zc3 u*g + a

-1 -1 -1

ZCd

-1

zol

Zt2

w3

Wd

1

d

zL'1

+

1

ia

+a

-1

Flgure 2. Possible vertices for a two-dimensional simplex development

One of the initial weight vectors then should be

(10)

where S, is the set of all patterns misclassified by W ( 1 7 ) . In general, the perceptron criterion will be small if few patterns are misclassified and those misclassified lie close to the decision hyperplane. This criterion is used only as a smoothing function so that the ultimate solution will be minimum perceptron value for the weight vector giving maximum recognition. Starting t h e Simplex. The initial vertices should locate a region in the weight vector space where the optimum is likely to occur. The pattern vectors are the only guides to choosing this location. The first consideration is whether any pre-processing of the pattern features must be done. Since the features may represent different types of observations, the variables should be scaled. In this work, the scaling of feature i is done by dividing by its standard deviation xp = X , / U ,

Table I. Coordinates for Initial Weight Vectors

(13a)

where Wi

=

(14b)

-W[/wd+l'

The other weight vectors must be chosen so that weight space is spanned. This is most easily done by choosing subsequent initial weight vectors according to Table I where a is some constant (18) (in this work a is arbitrarily selected to be 0.1). Moving t h e Simplex. Before the d -dimensional simplex is moved, the ( d 1)weight vectors have been defined and for each of these there is an associated recognition and perceptron function. The worst vertex (or weight vector) is the vertex of lowest recognition. If two (or more) vertices share the same lowest recognition, the vertex of highest perceptron function is inferior. Geometrically, this vertex is reflected through the centroid of the other d vertices. Or, mathematically, if W is the worst vertex,

+

w R = W + ( W - w w )

(154

where 1

W = - - ( W 1 + W p +. . . + W w - l + W w + l + d

.. .+ w d + l )

(15b)

(In this notation if A is the vertex, then WA is the weight vector that vertex represents.) Defining B as the best vertex and M as the second worst vertex permits the simplex to be adaptively changed. With R(A) and J ( A ) indicating the recognition and perceptron function a t vertex A, the following changes may be made (19). For a d = 2 situation consider the simplex process depicted in Figure 2 . ~ . W E = W + ~ ( W - W W )

(16a)

if a) R(R) > R(B), or b) R(R) = R(B) and J ( R ) < J(B)

ANALYTICAL CHEMISTRY, VOL. 47, NO. 12, OCTOBER 1975

1953

where Xij = the average value of feature j for members of class i and Vij = the variance in the values of feature j for members of class i. Features which give large values of F will have greatly different means and/or small variances. These features of large Fisher ratio are chosen as the best features for the simplex optimization.

Table 11. Questions Used in Comparative Study Question

‘L

in yes class

Presence of oxygen More than 12 hydrogens Presence of an ethyl group Presence of a ring structure

3.

.. in no class

28 50 46 38

72 50 54 62

COMPARISON WITH PREVIOUS TECHNIQUES The data used to test the simplex algorithm are taken from the A.P.I. Project 44 tables. The data set consists of 629 low resolution mass spectra in the range Cl-10 HI-24 0 0 - 4 NO-2 and has been reduced to include only the 119 most significant mass positions (8). This collection of data has been used for several pattern recognition studies and should serve as a good test of the efficacy of the simplex optimization. A training set of 400 spectra was taken from the 629 total spectra by a random draw. Recognition values are taken as the per cent of the training set correctly classified. Prediction values are the per cent of the remaining 229 patterns correctly classified using the first 400 patterns for training and the designated technique. Four questions or categorizations have been tested for comparison purposes using several techniques. Table I1 lists the questions and the per cent of the total in each of the two classes. Table I11 presents the results of the recognition and prediction for each of six different techniques. Each technique will be discussed and compared to the proposed simplex technique. For all of the questions, only the best 25 dimensions as selected by the Fisher ratio test were supplied to the classifier. Nearest Neighbor (21). The nearest neighbor technique has been described as an effective pattern classifier. The recognition results are deleted since, by definition, they are 100%. Certain problems do exist with a nearest neighbor classification. Scaling difficulties may arise if the magnitudes of the features are not similar. More important, although there is no training step, each unknown must be compared with every member of the data set before it can be classified. Although the linear discriminant approach may involve a lengthy training procedure, its use for classification is very rapid. A comparison of prediction results for NN and simplex indicate that the two techniques are close in predictive ability for these questions. Distance from Means ( 2 2 ) .In this method, each class is described by only one point, the centroid or mean of the class. For classification purposes, an unknown is classified in the same category as the nearest mean vector. This method attempts to combine all nearest neighbor informa-

w,=w + % w - W,) 2

a) R(W) < R(R) < R(M), or b) R(W) = R(R) < R(M) and J ( R ) > J(W), or c) R(W) < R(R) = R(M) and J ( R ) > J ( M ) , or d) R(W) = R(R) = R(M) and J ( W ) = J ( R ) > J ( M ) . Therefore, the vertex replacing W will be either at R, E, T, or C. Halting the Simplex. The simplex procedure is stopped when the optimum has been sufficiently closely defined. If the data are found to be separable, the search should terminate at 100% recognition and zero value for J . The weight vector then is defined. When the data appear inseparable, the procedure is terminated by a three-step procedure. 1. Let the simplex move until it changes so little that the recognition and the perceptron criteria reach a steady state. For practical purposes, this means ten consecutive moves resulting in vertices with equal recognition and J changing less then 0.1%. 2. Re-expand the simplex around the best point found thus far. Instead of using the hyperplane separating the centers of mass, the best vector that has been found so far is used as the starting weight vector. The other vectors in the starting set are formed by adding the constant, a , one dimension at a time. 3. Allow the simplex to approach a new optimum. If, a t steady state, the recognition of the best weight vector has improved, return to step 2. Otherwise accept the best weight vector as the optimum. F e a t u r e Selection. From a practical standpoint, the simplex optimization will take a greater amount of time to approach the optimum when more observations are included. When many features are used to describe a pattern, a preprocessing feature selection step is found to greatly decrease the time it takes to reach a steady state in weight space. In this work, the Fisher ratio (20) is used to select features for the optimization. The Fisher Ratio for feature j is given by (17)

Table 111. Results of Recognition and Prediction for Six Techniques More than 12 hydrogens

Prescncc of oxygen Recog

Nearest Neighbor Distance from means Learning Machine (max value for inseparable data) Average Subset Learning Machine (8 avg.) Perceptron Simplex Simplex (Recognition and Perception)

Pred

Recog

66 72

85 63 66

86

88 93

...

Pred

Rccog

72 74

83 72 74

85

72

84 88

79 81

...

~~~

1954

Presence of a n ethyl group

~

ANALYTICAL CHEMISTRY, VOL. 47, NO. 12, OCTOBER 1975

Presence of a ring structure

Pred

Recog

66 59

69 61 54

73 62

82 67 64

68

68

63

79

75

76 79

67 75

61 70

81 86

72 80

...

...

Pred

tion into one point. It is hoped that this will be a t small cost in classification ability. The advantage of this technique is a simple calculation procedure in addition to the rapid classification of unknowns. Table 111,however, shows that both recognition and prediction results obtained in this way are clearly inferior to the simplex technique with prediction results between 7 and 25% lower. Learning Machine. The learning machine algorithm may be applied to the inseparable data. However, there is no way to know how well the weight vector will do when the algorithm is terminated. In the present work, the computation was terminated a t 20 different times and the maximum values for recognition and prediction were reported. The results, as would be expected, are much worse than with simplex optimization. It is interesting to note that since the simplex reaches or approaches the optimum value of recognition, the learning machine discriminant may be as far as 25% away from an optimum value. Average Subset Learning Machine (23).In this method, the training set was divided into 8 subsets with 50 patterns in each subset. All subsets were linearly separable (not a t all a surprising result considering the dimensionality ( 2 4 ) )and a weight vector was calculated by the learning machine algorithm. Although each of these might be an acceptable classifier, the average of the set of weight vectors showed improved recognition and prediction over the individual weight vectors. The results for this method are usually an improvement over the terminated learning machine. However, they still fall between 3 and 11%below the simplex technique. Perceptron Simplex. The perceptron function itself may be used in a simplex optimization. The resultant weight vector will not maximize the recognition, but instead will minimize the distance of misclassified points from the hyperplane. The results are similar to those obtained with the average subset learning machine method, but are still inferior to the simplex method using recognition, as well as the perceptron function. Simplex. Two important questions need to be considered in the simplex results. The first question is whether the improvement in predictive and recognition ability is sufficient to warrant the extra processing time required for training. This question may in part arise since there has been no attempt to improve the speed of the computer algorithm. Requiring feature selection is not objectionable if the features contain most of the non-artificial discriminating information. Finally, prediction of unknowns is very rapid, once the weight vector has been developed. The second question is whether the true optimum in the response curve was found. Special care was taken to start the simplex in a region where a best weight vector might be thought to lie. However, there was no way to guarantee that the optimum had been reached. In inseparable data problems, it is likely that false extrema do exist and there is no way to ensure that the perceptron criterion may not force the simplex into a false extremum. However, if false maxima were reached, they resulted in weight vectors which did as well or better than the other techniques.

SEPARABLE DATA-A PROPOSAL Although the existence of linearly separable data may be unlikely, the sequential simplex method is still applicable. In this case, the recognition will equal 100% and the perceptron function will be zero. The weight vector gives a hyperplane which separates the data. However, since there is assumed to be no information about the underlying distributions of the two classes, it may be argued that the best weight vector is the one which maximizes the distance between the two classes.

In the learning machine problem, this may be done by defining a width parameter (25,26) 2 c , so that

W-X > c for category 1 W*X < -c for category 2

(18)

The vector may be optimized in this respect by increasing c slowly until the data are no longer separable. The maximum value of 2c may also be calculated using a simplex optimization. This requires setting the width as the response variable. For each weight vector, a width may be calculated from the maximum and minimum values of W*X for the members of each class. If MIN(i) and MAX(1’) are the minimum and maximum dot products among members of class i, then the following possibilities may occur. 1. MIN(i) 5 MAX(1’)5 MIN(]’) 5 MAXQ)

(19a)

The width is positive and equal to MIN(]’) - MAX(i)-this the separable case.

2. MIN(1’)5 MINQ)

< MAX(i) 5 MAXQ)

(19b)

The width is negative and equal to MIN(]’) - MAX(1’). 3. MIN(i) < MIN(]’) < MAX(]’) < MAX(1’)

(19c)

The width is negative and equal to the greater of MIN(]’) MAX(1’)and MIN(i) - MAXQ). In each case, the width parameter is equal to the greater of MIN(2) - MAX(1) and MIN(1) - MAX(2) and, therefore, the greater of these two numbers will be the response variable, With separable data, the response variable will become positive and reach a steady state. The resultant weight vector will allow maximum distance between the two classes. This weight vector will be chosen so that

W.X > k -+ c for category 1 W*X < k - c for category 2 (20) If MIN(i) - MAX(]’) 2 MINQ) - MAX(i), then k is given by

k = 1h [MIN(i) + MAX(]’)]

(21)

SUMMARY The sequential simplex technique provides a method for calculating a near-optimal weight vector. The algorithm searches weight vector space to find the hyperplane which first maximizes the recognition ability and then minimizes the perceptron criterion. The simplex method works equally well for both separable and inseparable data. The simplex method proceeds consistently toward an optimum, thus avoiding large fluctuations in weight vector space when working with inseparable data. The results are only near-optimal since there is no way to be sure that a false extremum has not been found. This problem presents little difficulty since the prediction results are as good as or better than those obtained by other techniques. The simplex procedure is slower than the learning machine procedure since there must be a complete recognition step for each newly selected weight vector. However, this processing step is necessary only during training and the prediction of unknowns is equally rapid. LITERATURE CITED ( 1 ) P. C. Jurs, B. R. Kowalski, and T. L. Isenhour, Anal. Chem., 41, 21

(1969). P. C. Jurs, Anal. Chem.. 42, 1633 (1970). L. B. Sybrandt and S. P. Perone, Anal. Chem., 43, 383 (1971). B. R. Kowaiski and C. F. Bender, J. Am. Chem. Soc.. Be, 916 (1974). C. L. Wiikins, R. C. Williams, T. B. Brunner, and P. J. McCombie, J. Am. Chem. Soc., 06, 4182 (1974). (6) K. Varmuza, H. Rotter, and P. Krenmayr, Chromafogr., 7, 522 (1974). (2) (3) (4) (5)

ANALYTICAL CHEMISTRY, VOL. 47, NO. 12, OCTOBER 1975

e

1955

N. J. Nilsson. "Learning Machines", McGraw-Hill, New York, N.Y., 1965, p 79. T. L. Isenhour and P. C. Jurs, Anal. Chem., 43, (lo), 20A (1971). W. Spendley, 'G. R.-Hext, and F. R. Himsworth, Technometrlcs, 4, 441 (1962). J. A. Nelder and R. Mead, Compuf. J., 7, 308 (1965). D. M. Olsson and L. S. Nelson, Technometrics, 17, 45 (1975). R. R. Ernst, Rev. Sci. Instrum., 39, 998 (1968). D. E. Long, Anal. Chin?.Acta, 46, 193 (1969). S. N. Deming and S. L. Morgan, Anal. Chem., 48, 1170 (1974). R. Smits. C. Vanroelen, and D. L. Massart, Fresenius' Z.Anal. Chem.. 273, l(1975). L. Pietrantonio and P. C. Jurs, Pattern Recognition, 4, 391 (1972). R. 0. Duda and P. E. Hart, "Pattern Classification and Scene Analysis", Wiley-lnterscience, New York, N.Y., 1973, p 141. G. S. G. Beveridge and R. S. Schechter, "Optimization: Theory and Practice", McGraw-Hill, New York, N.Y., 1970, p 374. S. N. Deming and S. L. Morgan, Anal. Chem., 45, 278A (1973). R. 0.Duda and P. E. Hart, Ref. 17, p 116.

T. M. Cover and P. E. Hart, l€€€ Trans. lnf. Theory, IT-13, 21 (1967). J. B. Justice and T. L. Isenhour. Anal. Chem., 46, 223 (1974). J. 8.Justice, to be published. R. 0.Duda and P. E. Hart, Ref. 17, p 69. L. E. Wangen, N. N. Frew, and T. L. Isenhour. Anal. Chem., 43, 845 (1971). (26) P. C. Jurs, Anal. Chem., 43, 22 (1971).

(21) (22) (23) (24) (25)

RECEIVEDfor review April 14, 1975. Accepted June 20, 1975. C.L.W. is Visiting Associate Professor, 1974-75 Academic Year, from the University of Nebraska-Lincoln. T.L.I. is an Alfred P. Sloan Fellow, 1971-75. Support of this research through Grant GP-41515X (CLW) and Grant GP-43720 (TLI) by the National Science Foundation is gratefully acknowledged.

Identification of Positive Reactant Ions Observed for Nitrogen Carrier Gas in Plasma Chromatograph Mobility Studies D. 1. Carroll, 1. Dzidic, R. N. Stillwell, and E. C. Horning Institute for Lipid Research, Baylor Colleoe of Medicine, Houston, Texas 77025

A Plasma Chromatograph-mass spectrometer (PC-MS) combined instrument was used to define the ion species responsible for the three major peaks observed in positive ion mobility studies of nitrogen carrier gas at 160 'C and 7 ppm water concentration. These species are: I, NH4+ with about 7 % of NH4+(H20); ii, NO+ with about 10% of NO+(H20); 111, H+(H20)2 and H+(H20)3 In about 7:3 ratio, for ion peaks showing reduced mobilities of 3.00, 2.62, and 2.32 cm2 v-l sec-l, respectively. Literature identifications based upon presumed equivalence of mobility and mass measurements are in error.

The early mass spectrometric studies of Shahin ( I ) , employing a corona discharge in air at atmospheric pressure, indicate that H+(H20), ions should be the dominant species in positive ion mobility spectra observed with nitrogen carrier gas in the Plasma Chromatograph. This was confirmed using a Plasma Chromatograph-mass spectrometer (PC-MS) combined instrument a t Franklin GNO Corporation ( 2 ) .The ions NO+ and NO+(H20) were also found to be present ( 3 ) . A recent review ( 4 ) of Plasma Chromatograph applications contained a positive ion mobility spectrum for nitrogen which showed five peaks. These ions were identified, apparently through mass-mobility calculations, as H+(H20), NOf(H20), H+(H20)3, H+(H20)4, and H+(H20)5 ions. This mobility spectrum differed significantly from those previously published which had only three peaks, identified as H+(H20)2, NO+(H20), and H+(H20)3 (5-7). No explanation was given in the review for the absence of H+(H20)2 ions in the mobility spectrum, or for the discrepancy with respect to earlier publications in reporting five reagent peaks rather than three. Serious errors can be made when ion identifications are based on mobility data alone (8).This appears to be true in the case of the identification of the positive ions observed in PC mobility spectra of nitrogen carrier gas. Based upon mass spectrometric data, the identities of the ions responsible for the three major peaks in the mobility spectrum of nitrogen are: first peak: NH4+ together with some 1956

NH4+(H20); second peak: NO+ together with some NO+(H,O); third peak: Hf(H20),, where n = 2 and 3 for the conditions employed in our study. The degree of ion hydration in each instance is dependent upon the temperature and the water concentration.

EXPERIMENTAL Apparatus. A Plasma Chromatograph-mass spectrometer combination which has been described in detail (8) was used in this work. Primary ions are produced in a carrier gas stream by a 63Ni source; these primary ions initiate a sequence of ion molecule reactions which yield reagent ions. When the carrier gas is nitrogen containing a trace of water, the nitrogen ions which are formed initially enter a sequence of reactions terminating in the formation of H+(H20), ions. Ions formed in the source region are introduced into the drift region; they move through nitrogen at atmospheric pressure under the influence of an electric field. The transit times of ions through the drift region are recorded; these are of the order of milliseconds. The signal output of the instrument is ion current as a function of time. The principles of operation are essentially those of electrophoresis in the gas phase. Either one of two shutter grids is used to pulse ions into the drift region (8). The ions drift from either of the shutter grids to the sampling aperture, and are entrained in the gas flow into a mass analyzer. An ion lens focuses the ions from the aperture into the quadrupole rod structure of the mass analyzer. A Channeltron electron multiplier is used as a detector; pulse counting techniques are employed. The data are collected and displayed with the aid of a PDP 8/E minicomputer. Procedures. A mass spectrum was obtained with both shutter grids open, and a drift time chart was recorded using conventional gating techniques. The mass analyzer was then adjusted to respond to a single ion mass only. The arrival time of that ion was then measured from each grid to the detector (see Figure 3). The difference in arrival times was the drift time of the mass-identified ion between the two shutter grids. Using the known grid distances and the drift tube temperature and pressure, the mobility of the mass-identified ion was calculated. An electron impact source was available to obtain spectra for calibration purposes. The experimental conditions are summarized in Table I. The effluent gas from a container of liquid nitrogen was used as a source of both carrier and drift gases; the gas was cleaned and dried prior to use by passage through two cylinders containing Type 13X molecular sieve. Consistent background ion spectra were obtained after baking out the PC-MS under a vacuum of Torr at 200 O C overnight.

ANALYTICAL CHEMISTRY, VOL. 47, NO. 12, OCTOBER 1975