Nonparametric Feature Selection in Pattern Recognition Applied to Chemical Problems G. S. Zander, A. J. Stuper, and P. C. Jurs Department of Chemistry, The Pennsylvania State University, University Park, PA 1680 1
The application of pattern recognition techniques to chemical problems requlres that the objects to be classified be coded using computer compatible descriptors. If classlflcation is successful, then the descriptors can be investigated for correlations with the classification being made-a process known as feature selection. A new feature selection method to be used with linear threshold logic units is presented. Given a linearly separable binary classification problem, the method allows Identification of a minimum set of descriptors for which linear Separability is retained. The basis of the method is presented, and results of applying it to well-characterized artificial data sets and to two chemical data sets are shown. The method is applicable to a wide variety of data types, and it provides a ranking of the relative importance of the descriptors. It can be used to estimate the intrinsic dimensionality of data sets.
Pattern recognition techniques have been used in a number of scientific application areas. Generally these techniques attempt to predict unique properties of classes of objects or events which are not directly measurable because of expense in terms of time or lack of feasible theoretical approaches. If more information exists than necessary t o classify the data properly, methods are needed to determine which of the given properties are those influencing the classification. Such methods are grouped under the general heading of feature selection, and they vary depending upon the type of relationship involved. In this paper, we describe a new method of feature selection applicable to data sets separable by linear discriminant functions and offer comparisons between this method and others currently available. Although feature selection has been one of the most intensely investigated areas of pattern recognition, few of the methods reported are data independent. Most methods have been narrowly focused on an immediate problem. Feature selection for a two-class problem is generally defined as any method which reduces the dimensionality of a data space to form a feature space such that descriptors unnecessary for discrimination between the two classes are discarded, and those necessary for discrimination are retained. If the classes are linearly separable (can be separated by a linear decision surface), the feature space must be linearly separable after feature selection. In the case of a data set that is not linearly separable, the classification error function of the feature space must not be greater than t h a t of the original data space. This latter criterion may be difficult to evaluate since, in general, necessary or intrinsic components for non-separable data sets are ill defined. The most well defined methods of feature selection are statistical and assume a priori knowledge of the probability density functions of the data classes. Diagonal, rotational, and other linear transformations are based on second-order statistics and covariance matrices, while divergence and Bhattacharyya transforms require the data to be Gaussian in nature ( I , 2). Classification methods which use parame-
ters derived from known or estimated probability density functions are generally called parametric classifiers. In applications to chemical problems, very often no statistical assumptions can be made about the data, which necessitates the use of nonparametric classifiers. The most widely employed classifier in chemical applications has been the linear threshold logic unit (TLU) which is developed with an error correction feedback training procedure. Several nonparametric feature selection methods have been reported, but they are based on definitions which are intuitive rather than mathematical (3-5). The most promising area of feature selection used with nonparametric classifiers is based on a systems approach, whereby the results of classification are used after the fact to discard those descriptors of the original set which had the least influence on the classification process. Methods utilizing this approach have met with considerable success in chemical applications (e.g., 3, 6). In a typical chemical problem, the data to be classified are represented as n-dimensional pattern vectors, X = (XI, x 2 , . . . , x,,), whose components are descriptors of the object or event to be classified. For example, a mass spectrum can be coded by setting component x, equal to the intensity of the mass spectral peak in mle position j . For data sets in which molecular structures are to be coded in this format, each individual component of the pattern vector can be descriptive of a portion of the molecule being coded, e.g., x1 could be molecular weight; x2, the number of oxygen atoms in the molecule, etc. Given a set of data expressed in the format described above, it would be a trivial problem to separate classes from one another if one knew in advance those properties which were sufficient to obtain the proper classification. Since this is not the case, the tendency is to convert an excess number of descriptors of the object or event being coded into computer compatible form, limited only by the storage capacity of the computer being used. If linear separability of the two classes is obtained, then a feature selector should be used to discard all unnecessary components in order to find the minimum number of dimensions required to separate the classes. Those descriptors remaining could be investigated to determine whether they correlate to actual processes or models of processes. If linear separability is not demonstrated, then new descriptors can be included in hopes of finding a sufficient set. The success in finding linear relationships will depend on the degree to which the data set employed is representative of the universal set from which it is drawn. The universal set is defined as all possible members of the given classes. Removal of descriptors not contributing to linear separability could involve a large amount of computation time. Therefore, it would be desirable for the feature selection routine to give a ranking as to the importance of each descriptor necessary for separation. This ranking could then be used to speed the selection of those properties intrinsic t o the separation of the sets. The ideal linear feature selection process would then be one which 1) excludes all unnecessary components without excluding necessary components such t h a t the tendency is toward optimal separation ANALYTICAL CHEMISTRY, VOL. 47, NO. 7, JUNE 1975
1085
equal distance in a perpendicular direction t o the opposite side of the misclassified point.
/ permitted
ore P
W' =
w+
CX,
in which c =
Xn
Figure 1. Plot of wn vs. x,
of the two classes; 2) indicates a rating or ranking of the importance of each necessary component; 3) requires a minimum amount of computer time with few iterations; and 4) could be applied to any type of data set. Present indications are such that the method of feature selection presented here is superior in all criteria to methods previously employed. Linear learning machines have been extensively discussed in the literature ( I , 2, 7, 8 ) , and therefore a limited description of the training algorithm follows. Given m data vectors of n dimensions, then
xi =
(Xi,l,Xi,2,. * . ,XiJ
i = 1, 2 ,
...,
??I
(1)
The n t h component, xi,,, is set equal in value for all pattern vectors in the data set to ensure a common origin in the Euclidean data space. The pattern vectors can be alternatively viewed as points in the n-dimensional space. If the pattern points cluster in limited regions of the n-space, then a decision surface can be used to separate the clusters from one another. The simplest decision surface is the linear hyperplane. Like all planes, an n-dimensional hyperplane has a normal vector associated with it, usually called the weight vector, W. An individual pattern point or pattern vector, X,, can be classified with respect to a hyperplane decision surface by taking the dot product, si, between weight vector and the pattern vector, as follows:
si =
w*xi= Iw// x iCOS / e
where 0 denotes the angle between the vectors. The dot product is used in the following classification rules:
> si
d , then X i is in c l a s s one If s i < -d, then X i is in c l a s s two If -d 5 si 5 d , then Xi is not classified
(4)
The use of a deadzone tends to better align or center the decision surface between the two classes. A useful weight vector is usually generated by an iterative error correction feedback procedure called training. A subset of the total data set is chosen and called the training set. The weight vector being trained is initialized arbitrarily and is normalized to have unit length. The dot products are calculated sequentially for each member of the training set and the classifications obtained are compared to the true elasses of the patterns. A misclassification invokes a correction procedure which moves the decision surface an 1086
ANALYTICAL CHEMISTRY, VOL. 47, NO. 7 , JUNE 1975
-2si/xi
xi
(5)
(6)
The new weight vector, W', will correctly classify the pattern after the feedback. This procedure is repeated until all the members of the training set are correctly classified. The points not included in the training set can be used to test the ability of the weight vector to classify unknowns. Once a workable classification algorithm has been developed, it may be of use in developing a posteriori feature selection methods. The exact orientation of a hyperplane decision surface depends upon the order in which the pattern vectors are presented to the classifier, the initialization used for the weight vector, and the value of the n t h component in the pattern vectors. A series of weight vectors generated in a manner designed to exploit these dependencies can be used to rank those features contributing to the separation of the two clusters of pattern points. We define those dimensions corresponding to the descriptors which are the minimum necessary to effect linear separability as intrinsic dimensions. Those descriptors which do not contribute to linear separability are called nonintrinsic dimensions. The intrinsic dimensions may alternately be thought of as the dimensions defining the minimum volume (hyper-volume) through which a decision surface must pass in order to effect seDaration. Removal of any intrinsic component will result in collapse of this volume and a loss of linear separability. Intrinsic dimensions can be distinguished from nonintrinsic dimensions by a feature selection procedure which focuses upon the relative variations of each component of a series of unit length weight vectors trained using the same training set.
VARIANCE FEATURE SELECTION Description. Once a set of data is shown to be linearly separable, it is desirable to remove those descriptors from the data set that do not contribute to its separation. The variance method of feature selection does this by developing an ordered listing of the descriptors based on the relative variation of the corresponding weight vector components among a series of trained weight vectors. The descriptors which correspond to larger relative variations can be identified and discarded in the order of their appearance in this list. The procedure employed to train a series of weight vectors to be used to develop the list takes advantage of the properties of the error correction feedback training routine. A detailed discussion of the development of the procedure is in the Appendix, and only an outline of the method and its implementation will be presented here. The first requirement in the development of the variance method is to optimize the value of the n t h component of the patterns, x n . During training, this value is usually set to a convenient number that allows both good predictive ability and a fast training rate. The variance method requires a specific range of x , values to be used depending on the data set. T o pick a value of x,, the data set is trained using a value of x , set equal to or less than the average absolute value of all the components in the data set. Successive weight vectors are then trained while incrementing the value of x n and using the results of the previous weight vector as the initialization for the next weight vector. Increments are of the order of one to one hundred. The value of the n t h component of each weight vector is then plotted with respect to the value of x , for the training trials. A typical plot is shown in Figure 1. The portion of the curve hav-
ing the smallest slope is called the “permitted range” of x,. The x, component must fall in this range for successful implementation of the variance method. T h e second requirement is to train a sufficient number of different weight vectors with the data while the x n value is within the permitted range. The number of weight vectors to be trained depends on the data set. If large numbers of weight vectors are trained, fewer iterations of the overall algorithm will be required. Generally, performance will increase with a corresponding increase in the number of weight vectors used in the analysis. After generation of the set of weight vectors, the relative variation of each component is calculated using the following equations.
R j = Vj/Iu?jl
(7)
where
and where j is the component index, k is the index for the weight vectors, m, is the average value of the j t h weight vector component for the nk weight vectors, W j k is the j t h component of the k t h weight vector, and nk is the number of weight vectors trained. The relative variations are then ordered from largest to smallest. Those components having the smallest relative variations are the most necessary for separability. Those components with the largest R, values are least necessary. Components are discarded in the order of decreasing R,. T h e remaining components are those which give the classifier sufficient information to separate the classes in the data set. I t may be necessary to develop and condense the ordered list of variations several times. This is usually an indication that an insufficient number of weight vectors were developed originally and therefore a good measure of R, was not provided. Implementation. The variance feature selection method has been implemented in two slightly different ways. In the first method, a series of weight vectors is trained using different weight vector initializations for each training and an x, value determined from the x, vs. w, plot. These weight vectors are then used to calculate the relative variations which are then ordered from largest to smallest. The method of determining the cutoff point from the ordered list is as follows: the half of the list corresponding to the smallest relative variations is tested for linear separability. If separability is found, the list is again divided and tested. If separability is not found, then half of the components previously discarded are added and the resulting list is tested. This process is repeated until subtraction of only one component results in non-separability. Note that even a poor measure of R, will provide a means to eliminate a few components, since the ordered list and above method of elimination make it impossible to discard any members of an intrinsic set. (There may be more than one set of intrinsic components present. The method of elimination must keep all the members of a t least one set of intrinsic components in order to retain separability.) The second method for implementing the variance method generates weight vectors based on the method used to generate the x , vs. w, plot. The first weight vector is arbitrarily initialized and generated a t an x , value reasonably close t o the permitted area. Succeeding weight vectors are initialized using the result of the previous weight vector while incrementing the value of x,. The advantage of this algorithm is a significant reduction in computation time since the number of feedbacks required for convergence of each weight vector is reduced.
Data Sets. Since previous nonparametric feature selection methods are developed using data sets within which the relations between components were not well understood, and since methods developed prior to this time were generally applicable to only the current problem, a need for a well defined data set was seen to exist. This set could then be used to measure the effectiveness of the methods being developed and to compare those methods against the criteria defining an optimal linear feature selection algorithm. The requirements of the data generation procedure are that the data sets must contain two linearly separable classes and that the number of intrinsic and nonintrinsic components must be known. One method for achieving this is as follows. T o generate points with m intrinsic components which fall into linearly separable classes, the points can be calculated so that they are imbedded in a series of parallel m- dimensional hyperplanes. This was done using the general equation for an m-dimensional plane.
where a0 is a constant. Points imbedded in a given plane are generated by fixing the a values, then randomly choosing x1, x p , . . . , x,-1, and finally calculating x m from Equation 9. Points can be generated which are imbedded in parallel planes by varying a0 while fixing all the remaining values. For each generated point, the remaining components are generated randomly and do not contribute to the separation of the planes; they are therefore nonintrinsic components. The data sets resulting are called DGEN data. Each set of generated data consisted of two classes containing points imbedded in five planes each-four of the planes contained twelve points and one plane (the one closest to the opposite class) contained fifty-two points for a total of 200 points in a fifty-dimensional space. The value of the constant a0 was initially zero and was incremented by five for intraclass separation between planes and one for interclass separation. Five data sets were generated containing 5 , 15, 25, 35, and 45 intrinsic components, respectively. After generation of the five data sets by this procedure, they were each subjected to autoscaling and then multiplied by a scaling factor of twenty. This makes the average value zero and the standard deviation twenty for each of the fifty components. I t does not affect the basic nature of the generated set with respect to linear separability or the number of intrinsic components. Autoscaling and multiplication by a scaling factor were done for consistency with the other data sets employed. Thus, each data set is known to be linearly separable, and the number of intrinsic components m, and nonintrinsic components, 50 - m, are known in advance. The deletion of any of the intrinsic components during feature selection resulted in loss of ability for the classifier to separate the data within 4000 feedbacks, whereas elimination of any or all of the nonintrinsic components did not affect separability. (When employing a linear TLU, 100% recognition is necessary and sufficient proof of linear separability. However, an infinite number of iterations would be necessary to prove linear inseparability. In this case, separability was shown in approximately 1000 feedbacks. Therefore, 4000 feedbacks was felt to indicate reasonable doubt concerning existence of linear separability.) The variance and weight sign feature selection methods were compared using the generated data (DGEN data), mass spectral generation data (mass spectral data) (9) and data obtained from a study on psychotropic agents (drug data) (IO).Further comparisons to results obtained from ANALYTICAL CHEMISTRY, VOL. 47, NO. 7, JUNE
1975
1087
Table I. Artificial D a t a Sets with Known Number of Intrinsic Components Kumber Data set
of
steps
Descriptom included Percent recognition
Statistical analysis BMDO’IM
1 2 3 4 5
46 49 53 61 70
46 45 47 47 50
1
2
4 2 4 4
3
1 3 2
35 38 38 39 44 41 31 44 40 45 46 46 49 50 50
90.0 89.5 89.5 89.5 89.5 Initialization
Weight -sign method 2
1
4 4 2 3 1 1 1
4
5
1 3 6 1 3 6 1 3 6 1 3 6 1 3 6
Xumber of feedbacks
Variance method
1 2 3 4 5
. .. ... ... ... ...
5 15 25 35 45
Initial
Final
1024 1722 2332 1990 2513
76 356 804 1099 2138
the multivariate discriminant analysis program BMD07M ( 11) were made using the DGEN and drug data. The drug data consisted of 219 sedatives and tranquilizers representing a wide variety of compound types. These compounds were coded using 69 descriptors of three types: (a) numeric and binary fragment descriptors; (b) binary substructure descriptors; and (c) topological descriptors. Only an ordinary two-dimensional structural diagram of the molecule is used for input to the system. Preprocessing of the data consisted of normalizing, autoscaling, and variance weighting. Normalizing consisted of multiplying each component of the data set by a value such that the average value for all nonzero components was equal to twenty. After normalization, the data were autoscaled, multiplied by a scaling factor of twenty, and truncated to integer values. The autoscaled data were then variance weighted. The resulting data set was linear-separable into the two classes of sedatives and tranquilizers. The mass spectral data were selected from a study which involved the generation of mass spectra for small organic molecules using structural information alone. Six hundred molecules were selected from the American Petroleum Institute Research Project 44 listing. Each molecule was coded with 61 structural fragment descriptors of both binary and numeric type. Preprocessing consisted simply of normalization such that the average of each component was in the same approximate range. Since a 600 by 61 array is too large to be conveniently handled, the method used was to select 150 compounds as a random training set with sub1088
ANALYTICAL CHEMISTRY, VOL. 47, NO. 7, JUNE 1975
sequent feature selection only on this subset. The class distinction was based on the presence or absence of a particular intensity level a t a particular mass t o charge ratio. Intensity levels used represented 0.1, 0.5, and 1.0% of the total ion current in the mass spectrum. The original study performed weight sign feature selection on 60 mle positions, 11 of which had three different intensity cutoffs for a total of 82 binary pattern classifiers. Only 15 BPC’s were used here for comparison, corresponding to five mle’s a t three intensity cutoffs. All data sets were linearly separable and required less than 2500 feedbacks for convergence. The effects on feature selection of preprocessing of the data should be mentioned. I t was noted previously that normalization was used for all the data sets, autoscaling for two data sets, and variance weighting for the drug data set. Normalization and autoscaling are used to “square up the space” (12) since, in most cases, the components are expressed in units which do not compare. The major effect of preprocessing is found in the speed of convergence in the training. An order of magnitude decrease in the number of feedbacks over that required to train with raw data is not uncommon when the above preprocessing methods are used. However, in experiments not presented here, we have found the results obtained with variance feature selection to be, in general, independent of the degree of preprocessing applied to any data set. The same cannot be said for the other feature selection methods tested. APPLICATION OF F E A T U R E SELECTION TO DATA SETS In order to test the variance method of feature selection, it was applied to each of the three data sets. For comparison, the weight-sign feature selection method and a statistically based multivariant discriminant analysis program BMD07M ( 11 ) were used. Artificial D a t a Sets ( D G E N Data). Preliminary tests of variance feature selection were performed with the artificial data sets described above. The top section of Table I shows the results obtained when BMD07M was used with the DGEN data. The F-levels to include new variables and delete old variables were set a t the default values (F to enter equals 0.01, F to delete equals 0.005). Percent recognition figures are given because the data sets were known to be linearly separable but BMD07M could not find separating discriminants in any of the five cases. In only one of the five cases-data set one with five intrinsic and 45 nonintrinsic components-were the intrinsic components ranked by BMD07M as being the most important components. Then the weight-sign feature selection method was applied to the DGEN data sets. Weight vectors obtained by the training algorithm are a function of the order in which the data are presented to the classifier, the weight vector initialization, and the value of xn. By varying one or more of these parameters, a number of different weight vectors can be obtained, and subjected to weight sign comparisons. Weight sign feature selection was empirically developed from the observation that in many cases the sign of the nonintrinsic weight vector components changed within a series of arbitrarily generated weight vectors. Many methods of generating weight vectors in order to exploit this observation exist. The two methods presented here involve changing either the weight vector initialization (weightsign Method I), or the order of data presentation to the weight vector being trained (weight-sign Method 2) holding the other two parameters constant. In each case, the n t h component, xn, was given a value of 20. Six different initializations of weight vectors have been used in the weight-sign feature selection trials in this work:
(1) w j = I/& for all j; (2) w j = - 1 1 6 for all j; (3) w j = 1/fi for j = 1, 2 , . , . , n - 1, and w, = - 1 1 6 ; (4) W j = - 1 / 6 for j = 1, 2, . . . , n - 1 and w, = l/v"ii; (5) W j = 0 for j = 1, 2, . . . , n - 1 and w, = 1; (6) w, = 0 for j = 1, 2, . , . , n - 1and w, = -1. Weight-sign method 1 was implemented by selecting one of the three pairs of weight vector initializations [(1,2), (3,4), (5,6)], training on each of the DGEN data sets, and eliminating components by comparison of the resulting weight vectors. This process was repeated upon the reduced data set until no further eliminations were possible. Only a very small fraction of the nonintrinsic components . were eliminated by this procedure. The original order of the DGEN data sets were such that the first 100 data points were in class one, the second 100 in class two. Weight-sign method 2 was implemented by generating a weight vector using one of three initializations (1,3,6) and the original order of the data set. A second weight vector was obtained by using the same initialization and a random scrambling of the order of t h e data set. T h e two weight vectors were subjected to weight sign comparison, and the appropriate components were eliminated. The reduced data sets were repeatedly subjected to this scrambling and training procedure until no further eliminations were possible. The results are presented in Table I. Again, only a fraction of the nonintrinsic components were eliminated by this procedure. The variance feature selection method was then applied to the same five DGEN data sets with the results shown in Table I. All the intrinsic components were identified and retained, and all the nonintrinsic components were discarded in each of the five cases. The last two columns in the lowest section of Table I refer to the number of times error correction feedback was used to derive a weight vector separating the classes; a substantial reduction is seen after the nonintrinsic components are discarded. In all cases, linear separability was shown by complete training to 100%recognition. An example of the ordered list obtained from the variance method is shown in Table 11. This was obtained using only three weight vectors trained on DGEN data set one using initializations 1, 3, and 5 . This data set has five intrinsic components, numbered one through five, and 45 nonintrinsic components, numbered six through fifty. As can be seen in Table 11, the five dimensions responsible for separation appear lowest in the list while those which are not necessary for separation are all ranked higher. Since the total number of feedbacks required in a feature selection method is a good measure of the computational time, the total number of feedbacks required for weightsign and variance feature selection with the DGEN data sets was computed. The variance method was found t o require one-fourth to one-eighth as many feedbacks as the weight-sign method. Mass Spectral D a t a Set. The mass spectral data have been subjected previously to weight sign feature selection procedure 1 (9) and the results for selected mle's are reproduced in Table 111. Table I11 also shows the results of applying variance feature selection t o the mass spectral data set. The two lists of numbers of descriptors retained by iteration number refer to the fact that the feature selection procedure was performed twice. On the first iteration, the ordered list was made and then reduced until removal of one more component caused linear separability to be lost. For iteration two, this component was replaced and a new ordered list was developed and again reduced to a minimum. T h e number of components surviving this procedure should be compared with the number retained by weight sign feature selection. Variance feature selection always
_____
Table 11. List of Calculated Variations by Component Using DGEN D a t a Set 1 Descriptor so.
Calculated variation
Descriptor NO.
Calculated variation
7 25 41 38 20 31 43 40 45 28 27 51 18 22 29 26 8 30 9 49 4 2 3 1 5
0.1666 0.1526 0.1471 0.1456 0.1377 0.1318 0.1209 0.1073 0.1047 0.1003 0.0919 0.0860 0.0748 0.0674 0.0642 0.0612 0.0554 0.0542 0.0322 0.0210 0.0058 0.0051 0.0044 0.0025 0.0017
'
35 37 36 34 16 46 11 6 21 33 47 10 42 39 15 24 32 17 13 12 44 14 23 50 48 19
9.6258 2.9912 1.0876 0.8762 0.7122 0.6876 0.5851 0.5024 0.4493 0.4034 0.3927 0.3924 0.3752 0.3673 0.3555 0.3037 0.2748 0.2689 0.2226 0.2209 0.2190 0.2150 0.2043 0.1904 0.1824 0.1822
Table 111. Mass Spectra Generation D a t a Number of descriptors retained Variance methoda Mass spectral peak
Intensity cutoff, %
Wei ht si n m,t%OarP,.
Iteration 1
Iteration 2
29
0.1 0.5
39
0.1 0.5 1.o 0.1 0.5
15 18 19 14 14 13 11 15 25 19 61 61 18 26 61
9 26 22 8 15 14 7 14 24 21 21 30 9 27 20
5 14 15 8 6 9 4 14 16 16 17 21 9 21 20
1.o
41
1 .o 43
53
a
Initial .yn
0.1 0.5 1.o 0.1 0.5 1.o = 1, increment = 1.
yields a substantially smaller final number of components. Drug Data. The stepwise descriminant analysis program BMDO7M was applied to the drug data with the results shown in Table IV. Two trials were run with this data set which is known to be linearly separable. The trials differed only in F-levels to include or delete components. With 33 components included, the discriminant function could classify only 95.0% of the data set, and with 50 components only 96.8%. The data set is divided into two classes with populations of 145 and 74 drugs; neither of the two ANALYTICAL CHEMISTRY, VOL. 47, NO. 7, JUNE 1975
1089
Table IV. D r u g D a t a
Table V. List of Calculated Variations by Component Using Two Sets of Linear Components
Initial Final lio.0f Yo.of Descrip- Descrip tors tors-
Xotes Percent recognition
Statistical analysis BMD07M
69
33
95*0
69
50
96*8
Weight -sign method 2
69 69 69 69 69
38 44 34 40 40
Variance method
69 31
31 24
= Oa50! F d e l e t e = 0.45, 43 steps Finclude = 0*25, Fdelete 0.20, 70 steps Finclude
Initialization Iterations
1 3 4 5 6
8 5 9 7 9
Initial ,I=,5, increment = 300 Initial sn= 305, increment = 300
discriminant functions could classify all the members of either class correctly. The drug data have been subjected to weight-sign method 2 previously (IO)and the results for selected initializations are reproduced in Table IV for comparison. In every case, 100% recognition was retained. The lowest section of Table IV shows the results obtained by applying variance feature selection to the drug data. The first iteration reduced the dimensionality from 69 to 31 and the second iteration further reduced the number to 24. A recognition rate of 100%was retained.
DISCUSSION As shown in Tables I and IV, the statistical methods used could not be employed as feature selectors for linearly separable data. This is because statistical methods require the data to be distributed with regard to various statistical forms. Methods are available that allow approximations to these criteria for a given data set. However, these methods, besides being time consuming depend on the data set. Addition of new components or data points require retransformation of the entire data set. No such methods were applied to these data, and the results indicate that such transformations are necessary for statistical methods to be successful. Using variance feature selection, if one ranks components in order of decreasing relative variation as shown in Table 11, then the following is observed. For DGEN data, there were no cases in which nonintrinsic components were ranked lower than intrinsic components. As noted previously, this required only one iteration while applications to real data required two such iterations. The behavior of the real data can be explained by the presence of more than one set of intrinsic components where each set of intrinsic components is sufficient to provide linear separability. In order to test this assumption, one such set of data was generated. Table V shows the ordered list of components obtained using an artificially generated data set containing two sets of intrinsic components (components number 1 through 10, and 11through 20). Note that all intrinsic components in both sets were ranked lowest in the total list. Using the method previously described, elimination of components is possible down to and including number 4. Elimination of component number 11 would destroy the second set of intrinsic components which is the only set remaining. T o further reduce the dimensionality, a new ordered list must be developed 1090
ANALYTICAL CHEMISTRY, VOL. 47, NO. 7 , JUNE 1975
Descriptor so.
Calculated variation
Descriptor
KO.
Calculated variation
51 21 33 42 34 32 39 37 35 36 47 50 31 26 23 38 43 44 22 24 27 40 48 46 49 30
2.5619 1.6857 1.3133 1.2755 1.0832 1.0204 0.7861 0.4585 0.4142 0.3787 0.3361 0.3336 0.3006 0.2919 0.2633 0.2599 0.2123 0.2083 0.2068 0.1982 0.1978 0.1944 0.1770 0.1704 0.1647 0.1605
41 45 28 25 29 5 4 11 13 7 16 6 3 18 15 8 2 17 14 1 19 12 9 10 20
0.1560 0.1485 0.1386 0.1182 0.1177 0.0333 0.0304 0.0275 0.0272 0.0204 0.0200 0.0179 0.0163 0.0162 0.0152 0.0151 0.0124 0.0122 0.0112 0.0106 0.0095 0.0073 0.0066 0.0025 0.0017
for the reduced set which results in lowest ranking of components numbered 11 through 20, one of the two intrinsic sets contained in the data. Note that the use of an ordered list enables one to obtain any such set of intrinsic components by suitable means. Therefore, the variance method of feature selection can be used to indicate not only all intrinsic components, but all possible sets of these components present in the data. Although the variance method identifies those components which contribute to linear separability of two groups, this separability is not necessarily the prime relation between those groups. In cases where more than one set of components supports linear separability, it may be useful to investigate all such combinations. The variance method, by providing a means to identify such combinations, increases the probability of successfully elucidating the true dimensionality. Ultimately, the success of elucidating the true dimensionality of any real data set in a given chemical problem is dependent on the data set used for the investigation. If the data do not contain all possible members of the set, then it is said to be a subset of the larger or universal set. In general, the relationship within a given subset will not serve to elucidate the relationship within the universal set unless a significant fraction of the members of the subset are a t the extrema of the universal set. The ability to classify true unknowns will be limited by the extent to which the subset mimics the universal set. Finally, as noted in the literature ( 1 3 ) ,the requirement for a statistically sound measure of the reliability of a data set requires a ratio of the number of members in the data set 1 to the number of intrinsic dimensions, m, to be greater than three, Le., l/m > 3. Note that a rotation does not change the value of m. Such a rotation requires the same amount of information as was present in the nonrotated space. Therefore, no reduction in the number of features
Y
required to classify the problem is effected. The variance feature selection method can be used to estimate the number of intrinsic components for a given data set and can therefore be used t o ensure that the llm > 3 criterion was satisfied.
t
SUMMARY The variance method is a feature selection process and not a simple reduction of data space dimensionality. Since the method is capable of obtaining the minimum number of components necessary t o separate the classes, it can be easily used to provide a true measure of m such that the llm criterion is met. I t has been shown in the case of real data of several types that the intrinsic dimensionality is usually only a small fraction of the original number of descriptors used and, therefore no limit is usually placed on the transducer other than storage requirements of the system. The ordered listing obtained gives the variance method an additional advantage in that all unique linear combinations of components that exist in the data set can be selected. This is extremely useful in differentiating artifactual from true components or sets of components since one is normally dealing with subsets of universal sets. Understanding of the principles involved enables a variety of algorithms t o be implemented, only two of which have been presented here. These algorithms were designed with computational time as a major criterion, and a considerable savings was shown with respect to other feature selection methods used in the past. Finally, the variance method appears to be unaffected by the type of data used, requiring only a linearly separable case. The variance method of feature selection should prove to be of wide applicability in future studies of pattern recognition in chemistry.
APPENDIX Development of the variance method was in large part due to the ability to create a low dimensional model of processes occuring in the high dimensional spaces dealt with when applying pattern recognition to chemical problems. In this section, the models which led to the development of the method are explained. As an illustrative example, Figure 2 shows a data set with one intrinsic dimension, y , and one nonintrinsic dimension, x . The third dimension, z, is necessary to ensure a common origin. The two classes are linearly separable. The origin is a t point 0. Let l y be the minimum distance in the y dimension between the two data sets. Given no x component, then, l y represents the maximum range of the intersect of the separating surface, plane OBC, with the y axis. Note that the y dimension alone is sufficient to provide 100% recognition of the two groups. Note also that inclusion of an x component expands the range of y axis intersection to l y ’ . T h e separating surface now has an expanded range through which it may pass and still separate the two classes. After addition of the x component, the separating surface is constrained to (1)intersect the y axis only in the region labeled ly‘, (2) pass between the two classes without “touching” them, (3) pass through the origin a t all times, and (4) during any net movement retain the same “side” toward the respective classes. These constraints can be related to those imposed on a unit length vector (weight vector) which is perpendicular to the separating surface a t the origin. I t is this vector that is shifted during the training process. The plane can be considered to impose the above constraints upon the weight vector, since it must follow the movements of the surface in order to maintain its perpendicularity, as well as to satisfy the constraints.
L
Figure 2. Data set of two linearly separable classes with one intrinsic component and one nonintrinsic component. A separating plane is shown Y
t Figure 3. View of Figure 2 down the x-axis The weight vector has three components, W, W, W,. I t is to be demonstrated that when a number of decision surfaces, each separating classes one and two, and each with an associated weight vector, are investigated, the relative variation in the magnitude of W, (nonintrinsic component) will be greater than the relative variation in the magnitude of W, (intrinsic component). To attack this question one must first understand the effect of changes in W, as one moves the data set further from the origin. Figure 3 represents the projection of the weight vector, W, on the y-z coordinate plane. Call the projection W,. $ is the angle between W, and the z axis, 8 is the angle between the projected separating surface B, and the z axis and 4 is the angle between B and W, In order to correctly classify the data the separating surface may pass only within the area l y ’ . T h e absolute position l y ’ is fixed by the data set and W, is constant for a given skew of the separating surface in the x dimension (the nonintrinsic dimension). An increase in the distance 21, causes a decrease in 0, an increase in $, and therefore an increase in W, and a decrease in W,. Define 1 8 as the change in 8 as the projection of the weight vector moves across the allowed range ly‘.Define l W , , and l W z analogously. Since l y ’ is constant, an increase in 21 causes a decrease in 18, a decrease in A$, and therefore a decrease in 1W, and l W z . The further one moves on the z axis, the less the variation in the W, component will be for an allowable change in 8.A t large z l values, the magnitude of W, becomes independent of changes in 8. The only other contribution to W, must then be due to changes involving movement of the weight vector in the x-y coordinate plane. If the relative ANALYTICAL CHEMISTRY, VOL. 47, NO. 7, JUNE
1975
1091
Y
Figure 4.
View of Figure 2 down the z-axis
/ x2
Figure 6. A
representation of ( n - 1) space as viewed down the
L-
axis
P(a) = f i x d o = l / x The expectation values are then
(W,)= J ' ( B s i n cr)(l/x) da, Figure 5.
0
Plot of Ry/Rx vs. CY
B = -[1
x
variation in W, is greater than the corresponding relative variation in W,, then a means to distinguish intrinsic from nonintrinsic components is provided. The variation of components W, and W , can be defined as V, and V, and the average magnitude of these components as W, and W,.Then the relative variation of each component can be defined as follows:
- cos x]
B . = -sin x X
The variance of components x and y over the same range is given by the second central product moment which is
a* = ' I i g ( x ) ' P ( x )dx - (gb))2 The x and y moments can be written as
52 where nk is the number of weight vectors generated to approximate a spanning set. The definitions approach the definition of variance for a statistically large number of weight vectors. I t will be shown that for a sufficient distribution of weight vectors in any range allowable by the geometry of the sets, R, > R,, Le., the nonintrinsic components vary more than the intrinsic components. In Figure 4, W, is the projected weight vector from the n-dimensional pattern space (in this case three) into the ( n - 1) space, the x-y coordinate plane. Since the z component is fixed, this projection results in a vector of constant length, whose x and y components are respectively W, and W,. Only changes in the projected weight vector are of interest. Let I W,I = B. The vectors' components can now be expressed as W, = B s i n W, = B
CY
lw g(x)P(x)
dx
-m
where P ( x ) is the probability of observing a specific value of x somewhere in the range of g ( x ) . For this problem 1092
B2 = 2[,
=
5
- 1/4 s i n
1
2x - l / x ( l - cos x)*
B2 x --[ z + 1/4 s i n 2x
- l/s sin2 x
1
The measurement of the moments defines the variance in the limit of an infinite number of uniquely generated vectors. If we let
-
w,= (w,):vx2 = ax2 - = (w,); w, ;1' = 2 fJY
then the relative variation as measured in the text which is expressed as
COS CY
where ct is the angle between the projected vector W, and the y axis (intrinsic dimension). We will consider cy to range between 0 Ict I x . If cy is continually varied over this range, then the average value of any component will be given by its expectation value for that range. The expectation of a function g ( x ) may be written:
g(x) =
= L x [ B zsin2a][l/x]da, - (W,)'
ANALYTICAL CHEMISTRY, VOL. 47, NO. 7, JUNE 1975
can be seen to approach the relative standard deviation within the limit of an infinite number of uniquely generated vectors. A graph of the ratio of RJR, is presented in Figure 5 . Note that within the ranges of 0' 5 ct < 90' and 270' < CY I 360 that this ratio is less than one. A value outside of these ranges violates one or more of the constraints imposed upon the plane. If a sufficient number of vectors are generated within
those ranges, such that a valid measurement of V, and V, can be made, then R, > R,. Also for any number of generated vectors in the range -45' Ia 5 45O, this will always be true, since within this range IaW,/aal is always greater than the corresponding change for Ws. Within this range, development of as few as three weight vectors may be sufficient to measure the variations. On the range -7rI2 < LY < r / 2 , the expected value for W, is 0 while that for W, is 2Bl7r. If the allowable skewness due to the nonintrinsic component is symmetric about y , then the relative variation in the x component approaches infinity while the variation for the y component remains relatively small. Additional nonintrinsic components show no net effect upon these relations. This can be demonstrated using a four dimensional data set, whose components were labeled as XI, x p , y , and 2 , corresponding to two nonintrinsic, one intrinsic and a component to ensure a common origin. Given that the data's z component was optimized as per the text, then Figure 6 shows the resultant projection along the z axis from n = 4 to n - 1 = 3 space. Since z is a t a large value, its weight vector component is approximately fixed for any set of planes generated. The projected weight vector ( r ) is therefore of constant length. Using polar coordinates, the relations of interest for 0 I0 Ix . and anv single value of 4 are
w, =
YCOS
W, =
Y
~ ( e =) (WJ =
r3
This set of equations differs from the two space example by a constant involving 4. In the case where 4 = 0, the equations become those of the two dimensional case. Note that in the expression for the relative variation the 4 dependence is lost and the equation takes on a form identical with the two dimensional case. I t would then follow that an n dimensional coordinate frame could be defined such that the relations expressed in the previous two examples could be extended to a higher space. This would simply involve addition of terms to the constant. No such development is made since the two dimensional model serves to sufficiently indicate which parameters of a BPC should be optimized in order to select those features which are "intrinsic" to the separating process. The model is now complete. Basically stated it says that 1) If one increases the distance the data set is away from the origin, it constrains the change in the component of the weight vector corresponding to the intrinsic dimension to be much less than changes in the position of the separating surface between the two classes, and 2) the variation in the intrinsic component of a weight vector is due to changes in the nonintrinsic component. The nonintrinsic component, however, is able to change more than the intrinsic. Therefore measurement of the relative variation of the weight vector components a t large values of z 1 ( x , ) should indicate those dimensions which offer no useful information to the linear learning machine.
c o s d s i n ti
i/J''de
ACKNOWLEDGMENT
= i/x
1
0
J'n
[Y cos
The authors express their gratitude to Phillip KleinSchmidt for his many helpful discussions and comments.
6 s i n el[i/x]dti
YCOS @ -~ [ l - cos x]
LITERATURE CITED
s
1
Harry C. Andrews. "Mathematical Techniques in Pattern Recognition". Wiley-Interscience, New York, NY, 1972. William S. Meisel, "Computer-Oriented Approaches to Pattern Recognition", Academic Press, New York, NY, 1972. P.C. Jurs, Anal. Chem., 42, 1633 (1970). R. W. Liddell 111 and P. C. Jurs, Appl. Spectrosc., 27, 371 (1973). D. R. Preuss, and P. C. Jurs. Anal. Chem., 46, 520 (1974). L. B. Sybrandt and S . P. Perone, Anal. Chem.. 43,382 (1971). N. J. Nilsson, "Learning Machines", McGraw Hill, New York, NY, 1965. T. L. lsenhour and P. C. Jurs, Anal. Chem., 43, (lo),20A (1971). J. Schechter and P. C. Jurs. Appl. Spectrosc., 27, 30 (1973). A. J. Stuper and P. C. Jurs. J. Am. Chem. Soc., in press. W. J. Dixon, Ed., "BMD-Biomedical Computer Programs", 3rd ed., University of California Press, Berkeley, CA, 1973. Reference 2,p 10. Reference 2,p 15.
RECEIVEDfor review September 10,1974. Accepted February 24, 19'75. The financial support of the National Science Foundation is gratefully acknowledged.
ANALYTICAL CHEMISTRY, VOL. 47, NO. 7, JUNE 1975
1093