Constraints on "learning machine" classification methods

degrees above the normal melting point of tetradecanol) which is related to the ... aspects to the learning machine methods which appear to have. -rec...
2 downloads 0 Views 548KB Size
a

Undoubtedly there must be a retention effect at the liquid-gas interface for light alcohols on hydrocarbons stationary phases as results from Pecsok and Gump’s ( 2 )static experiments. But contrary to the case of an apolar solute on a polar solvent, the necessary use of a nonsilanized support will impose the very difficult task of separating the adsorption a t the liquid-gas interface from the prevailing adsorption on the solid and from the strong partitioning effect. The solution of this problem will doubtless demand very precise measurements.

b

Figure 1. (1) Solid support, (2) gas-liquid interface, (3) bulk capillary liquid; (a)with a silanized siliceous support, (b) with an untreated sili-

ceous support order of magnitude of the area of the above-quoted meniscus when applied to a coated silanized chromosorb. In fact in agreement with Krejci (15) and with Berezkin and co-workers (16),we have shown that the surface area of a coated porous solid depends only on the volume of the adsorbate and not on its nature. This state of affairs is particularly rigorous when there is no unilayer on the surface. Therefore by coating a chromosorb PDMCS first with 20% of glycerin and then with 0.3% of tetradecanol, we brought out a very small increase in the retention of octane at 47 OC (ten degrees above the normal melting point of tetradecanol) which is related to the bidimensional melting of a small film of fatty alcohol a t the glycerin-air interface. The amplitude of the observed transitions for different experiments corresponds to a surface area A I from 0.005 to 0.02 m2/g of support. This result is 100 to 30 times lower than the value taken by Conder ( I ) in his tentative calculation using literature data. Our conclusion is that silanized supports generally used in researches on the adsorption at the liquid-gas interface in chromatography are in fact unsuitable for this sort of study,

LITERATURE CITED (1)J. R. Conder, Anal. Chem., 48,917 (1976). (2)R. L. Pecsok and 8 . H. Gurnp, J . Phys. Chern., 71, 2202 (1967). (3)J. F. Parcher and C. L. Hussey, Anal. Chem., 45, 188 (1973). (4)R. L. Martin, Anal. Chem., 35, 116 (1963). (5)J. R. Conder, D. C. Locke, and J. H. Purnell, J. Phys. Chern., 73, 700 (1969). (6) W. A. Zisman, J. Paint Techno/., 44,41 (1972). (7)H. L. Liao and D.E. Martire, Anal. Chern., 44,498 (1972). (8) P. Urone and J. F. Parcher, Anal. Chem., 38,270 (1966). (9)J. C. Giddings, Anal. Chem., 34,458 (1962). (IO)J. Serpinet, Chrornatographia, 8, 18 (1975). (11) J. Serpinet, unpublished results. (12) S. J. Gregg and K. S. W. Sing “Adsorption, Surface Area and Porosity”, Academic Press, New York and London, 1967,pp 93-108. (13)J. Serpinet, G. Untz, C. Gachet, L. de Mourgues, and M. Perrin, J. Chim. Phys. Phys.-Chim. Biol., 71, 949 (1974). (14)J. Serpinet, J. Chrornatogr., 119, 483 (1976). (15)M. Krejci, Collect. Czech. Chem. Cornmun., 32, 1152 (1967). (16) V. G.Berezkin, D.Kourilova, M. Krejci, and V. M. Fateeva, J . Chromatogr., 78, 261 (1973).

J. Serpinet Laboratoire de Chimie Analytique I11 UniversitB de Lyon I 43 Bd du 11 Novembre 1918 69621 Villeurbanne, France

RECEIVEDfor review July 9, 1976. Accepted September 7, 1976. Work performed for the Centre National de la Recherche Scientifique, E.R.A. No. 474.

Constraints on “Learning Machine” Classification Methods Sir: Several different “pattern recognition” methods are currently under development-the different types of method have been presented in the tutorial article by Kowalski and Bender (1).The majority of the earlier work in pattern recognition was concentrated on pattern classification by linear discrimination functions derived by linear feedback methods. Many examples of such approaches are reviewed in a recent text book ( 2 ) . Linear feedback classifiers (“learning machines”) have been adversely compared with other pattern recognition methods such as k-nearest-neighbor matching schemes ( 3 ) .Criticisms have been concentrated on such features of the learning machines as non-uniqueness of solutions, poor performance if the training set is not linearly separable, and an unbounded classification risk. There are additional aspects to the learning machine methods which appear to have ,received an inadequate discussion. These aspects concern dimensionality constraints and other factors relating both to the convergence of training procedures and to the evaluation of classification functions in terms of overall prediction rates. Using numerical data derived from known mathematical functions, attempts will be made to illustrate the following points. The successful training of a linear classifier to 100% recognition does not necessarily provide substantive evidence for a particular spectrum/structure relationship. Binary classification schemes incorporating large numbers of feature measurements-e.g., intensities a t m/e positions-are in-

herently spurious unless adequate numbers of training compounds are available in each clhs. Some conventional techniques for aiding convergence of training procedures appear not to have been exploited in those applications of learning machines reported in the chemical literature. Finally, the technique of using overall prediction rates as a measure of a classifier’s value is unsatisfactory. Dimensionality Problems. Two aspects of learning machines can be conveniently considered under the general heading of dimensionality problems. First, it is necessary to consider those constraints relating the dimensionality ( d ) of the feature space to the number of examples available for the development of a classifier (n).These constraints are importank for they determine whether or not successful training of a binary linear classifier can be taken to establish the existence of a significant spectrum/structure relationship. The second aspect is somewhat simpler and requires only the consideration of the minimum number of examples that should be used to characterize a single class. Apart from an empirical study on the effect of varying the n to d ratio ( 4 ) ,the first consideration of constraints relating d to n appeared in a note by Bender et al. ( 5 ) .Bender proposed that the sample size per class should be a t least three times the feature size, this constraint being suggested by results of studies by Foley and others (6, 7). Many of the earlier applications of learning machines have n 5 2d and papers still

ANALYTICAL CHEMISTRY, VOL. 48, NO. 14, DECEMBER 1976

2265

1.0

f(n.d)

05

10

15

20

2.5

Figure 1. Plots of the function fin,@ for a number of dimensionality values typical of those used in spectrum/structure classification schemes

appear using excessively high dimensionalities. If n 5 2 d , then the establishment of a separating linear hyperplane for some set of training examples provides very little evidence for a new spectrum/structure relationship. Formula 1 defines the fraction f ( n , d )of the 2 n dichotomies of n points in d dimensions that would be linearly separable (8).

/I

nSd+l

These linear dichotomies are those labelings of points in general position such that all class 1points can be separated from those in class 2 by a hyperplane-such as that generated by a learning machine procedure. Figure 1shows this function for a number of dimensionality values similar to those that have been used in typical learning machine applications. Thus, dimensionality 155 was used in a study on the detection of oxygen directly from mass spectra (9);a related study on the prediction of about 60 different structural features does not specify the number of mle positions used but it appears that it was again around 155 (IO).In this second extensive investigation, training sets of 200 were used to develop discriminant functions for the identification of such complex properties as maximum ring size or number of branch point carbons directly from a compound’s mass spectrum. As f(200,155) = 0.9999, the fact that a linear dichotomy can be found does not in itself provide really substantial evidence for the thesis that mass spectral properties are directly related to such molecular attributes as ring size. A number of other studies have appeared in which n is actually less than d , thus allowing any desired dichotomy to be established. Bender’s rule can be seen to allow only a very small fraction of dichotomies in any high dimensional space; with a large sample size to feature size ratio, there is a very low probability of a chance occurrence of a linear dichotomy and so the successful training of a learning machine classifier can be construed as evidence for a significant spectrumhtructure relation. The results given in Table I further illustrate the ( n , d ) dimensionality problem. The data used for these examples consisted of 200 items each characterized by a 30-dimensional vector (equivalent to a system of 200 compounds characterized by 30 mle positions). The values assigned to the 200 X 30 elements were taken from a random number generator based upon a standard normal distribution. Subsequent to data generation, the 200 vectors were partitioned into two unequal sets characterized as the +ve set and the -ve set; the partition being derived using a uniform random number generator. The two sets so created have no genuine basis for discrimination. From the total set of 200 items, a training set of 50 and a prediction set of 150 members were arbitrarily selected. The size 2266

of the training set and the dimensionality were chosen so that an arbitrary assignment of classes to items would still have a reasonably high probability of being a linear dichotomy (f(50,30) = 0.96). The learning machine program utilized SUBROUTINE TRAIN of Jurs and Isenhour (2).Table I shows the results obtained for a series of completely separate “random” data sets (for one additional data set, the learning process did not converge; it can be assumed that this was one of the 4% that would not be linear dichotomies). Even in examples such as these with no valid basis for class discrimination, the learning machine procedure generally converges to allow for 100%recognition and apparently can provide fairly high rates of prediction. As is normal practice, the linear classification functions generated by the training procedure were evaluated by predicting the (arbitrary!) +ve/-ve class assignments of the remaining 150 random vectors in the prediction sets. The apparent high predictive abilities provide an interesting illustration of the second problem-the requirement for a minimum number of examples to characterize a compound class. An analysis of the form of the minimum error rate discriminant function for such data will reveal why, although there may be no genuine differences between classes, predictive rates greater than 50% for random guessing will be obtained if the numbers of examples from the two classes differ in the training set. Simple distributions, such as the multivariate normal distributions in the example data, are amenable to an analytic treatment that can define the form of the minimum error rate discriminant function. In the case of the example data, the features are generated independently from normal functions with the same variance 02. This results in certain simplifications to the analytic treatment and it can be shown (8) that the minimum error rate discriminant function has the linear form given in formula 2:

-

1 g(x) = ’;;;; (111 - P 2 ) t x + log (p(cl)/p(c2))

where x is the feature vector being classified, p1 is the mean feature vector for class 1,p~ is the mean feature vector for class 2, and p (cl) and p (c2) are the a priori probabilities of the two classes occurring. If g(x) is positive, then x is in class 1. Formula 2 defines a hyperplane orthogonal to the line joining the means for the two classes; unequal probabilities cause the plane to move away from the center of the more probable class. Unlike the learning machine discriminant, this function is not constrained to be 100%correct on the training set. Table I shows the achieved recognition and prediction success rates for the optimal linear classifiers derived for the same sets of 50 training items as used in the learning machine tests. (The numbers in Table I were obtained using the frequencies of examples in each class in the training set to provide the a priori probabilities-taking the probabilities as equal still gives an average 88% “recognition” and 68% “predictive” ability.) Again, although working from pure random data, it appears that linear classification schemes can achieve high rates of prediction and recognition. The origin of the apparent recognition and predictive abilities is due in large part to the term ( p 1 - p ~ ) ~The . mean vectors 1.11 and p2 for the two classes are of course determined from the examples in the training set. AS the two classes are in fact from the same distribution, the estimated mean vectors 111 and p2 would be identical if obtained on the basis of a sufficient number of examples. Unfortunately, a mean value estimated on the basis of a small number of samples can be quite substantially in error. If, for example, only five spectra are

ANALYTICAL CHEMISTRY, VOL. 48, NO. 14, DECEMBER 1976

Table I. Results Illustrating How Apparent Linear Classification Schemes Can Be Derived Even in Cases Where There Is No Genuine Difference between Classes Class composition “Learning Machine” Data-set A B

C D E

“Formula”

Training set +/-

Prediction set +/-

Feedbacks

% Prediction

% Recognition

% Prediction

4713 4317 4515 4218 38/12

134116 129/21 126124 125125 115135

72 164 591 247 2183

88

100

76 74 71 58

96 92 96 84

89 83 69 77 71

Table 11. Results Illustrating the Advantage in Improved Training Rate That Can Generally Be Realized if the Learning Procedure Is Initialized Using the Class Means Difference Vector rather than a Random Initialization “Learning machine” Formula initialization

Arbitrary initialization Data-set

Feedbacks

A B C D

280 203 329 430

Overall prediction, %

Prediction by class +I- %

Feedbacks

73 83 86 79

85/23 93/49 9819 78/83

182 192 116

0

available to characterize a compound class, then for any given feature dimension there exists a small chance (of the order of 2Wh) that the estimated mean value would be in error by a full standard deviation. It follows that in a problem with 32 independent dimensions, the chance that one has no feature dimension with such an error is less than 50%. The actual mean vectors as calculated for the random data sets of Table I show the expected statistical fluctuations with the mean vectors for the minority class being more perturbed from the true mean in every case. These differences in the mean vectors enter the discrimination function either explicitly as in the formula or implicitly through the “learning machine” training procedure. The feature vectors for the members of the prediction sets will be randomly distributed about the true mean and consequently will tend to be closer to that mean estimated for the majority class in the training set than that found for the minority class. Therefore, the prediction success rate will be higher for members of the majority class (assuming of course that the class distributions in training and prediction sets are similar). This effect is observed for the examples of random data; thus, for data-set A the “learning machine” success rate for the +ve class members was 97% while that for -ve class members was only 6%. (It has been a fairly common observation that a “learning machine” classifier is more successful for the majority than for the minority class in actual applications of these procedures, see for example ref. 2 , p 96.) The details of this discussion apply only to the multivariate normal case but it is reasonable to suggest that other types of multidimensional data would also require minimal numbers of examples to define a class. Such considerations suggest that a small proportion of the published discriminant functions derived for example in ref. 10 are unlikely to be of real significance-examples might be the oxygen > 2 discriminant with only five examples defining the positive class or the nitrogen > l discriminant with only four examples. Problems of Convergence a n d Evaluation. Conventionally, chemical applications of learning machine approaches have used arbitrary initial weight vectors with all

Overall prediction, % 77 85 85 88

Expected overall discriminant power, % 81

87 91 89

elements set equal to values such as 0.1 or l / d . This normally has two unsatisfactory effects-it slows convergence and reduces the likelihood of the learning machine procedure converging on the best discriminant (best in the sense of highest prediction rate). It is generally better practice to use a formula such as Equation 2 to define an idealized initial vector. Some accentuated examples are provided by the data in Table 11. The data are essentially the same as those used in Table I except that in these new data there is a genuine difference between the classes for the mean of one feature variable. As the data are taken from known parametric functions, it is possible to compute the recognition rate that can be achieved on the basis of the one discriminating variable. Table I1 details the number of feedbacks to convergence and the achieved prediction rates both for an arbitrary initial vector (all elements 0.1) and for training from the vector defined through Equation 2. In this highly artificial example the differences between the methods are probably exaggerated but they do exemplify the expected traits of poorer convergence and lower predictive power consequent upon arbitrary initialization. Another example can be obtained from the illustrative data of ref. 2 . This example involves the training of a five-variable function on a set of 80 items from a linearly separable system. The example requires 254 feedbacks to converge from an initial arbitrary vector whereas only three feedbacks suffice if the initialization uses the difference of the class mean vectors, Le., Equation 2 without the a priori probabilities. Another trend just barely evident in the data given in Table I1 is a decrease in prediction success rate due to training to 100% recognition-such a decrease might be expected and it occurs also in sufficient number of other examples to make it appear probable that such a difference is significant. Small differences in individual prediction rates, such as those in Table 11,are not statistically significant. I t does not generally appear to have been recognized that prediction rates obtained through single tests on small samples are subject to quite large statistical fluctuations. Highleyman has provided plots giving the 95% confidence interval for different values

ANALYTICAL CHEMISTRY, VOL. 48, NO. 14, DECEMBER 1976

2267

of obtained prediction success rate on a number of test samples (11).For small tests this range is large. For example, if a 90% prediction rate is obtained for 100 test examples, then the corresponding true prediction rate can be anywhere from about 82 to about 95% (even with 250 test examples the range is approximately 85 to 93-). Apart from the fact that the prediction rate must be a fairly arbitrary value, its use as a measure of a function’s discriminating power has another disadvantage. An overall percentage success rate invites comparison with “50% for random guessing”. As has been noted in some earlier work (12),this is scarcely a fair basis for comparison if either the actual class sizes or the success rates for the two classes differ substantially (a good example is data-set C from Table I1 with an overall success rate of 86%-but only 9% of the minority class are correctly identified). If a single number must be used to summarize the discriminating power of a function, then it is probably more reasonable to quote the average of the success rates for the two classes as in recent work by Lowry et al. (13). Given some degree of difference between classes on one feature variable, it is always possible to achieve linear separability in the training set. “Noise” feature dimensions can be added until “learning machine” convergence is attained. The predictive ability of such an augmented linear function will be determined by the extent to which it reflects the true discriminating power of the genuine features. The required number of “noise dimensions” will depend on the actual distribution of points in the training set and on how the “noise” is generated. A simple example of such an effect is provided by data-set D of Table 11. The recognition and prediction rates derived solely from formula 2 varied randomly in the range 88-93% as the number of noise dimensions was increased from 0-29. All the tests on the learning machine approach used an arbitrary initial weight vector (all elements 0.1). As the number of noise dimensions was decreased from the original 29, a generally increasing number of feedbacks were required for convergence (550 feedbacks with 25 noise variables, 3590 feedbacks for 17 noise variables). Convergence could not be obtained with less than 5000 feedbacks when the number of dimensions had been reduced to 16 (15 noise) or less. The expected fraction of linear dichotomies for these dimensions are f(50,18) = 0.04 and f(50,16)‘= 0.01. For those examples that converged, the prediction rate was in the range 78-81%; as has been noted elsewhere (3),the linear functions obtained upon arbitrary termination of the training procedure generally have very little predictive ability. Concluding Comments. In a large proportion of published work, constraints either on overall dimensionality or on minimal sample sizes have been violated. While such violations do not necessarily invalidate results so obtained, they imply greatly reduced significance. Dimensionality and size constraints should be an important factor in the design of any proposed linear classifier. Bender’s empiric rule relating n and

226s

d does provide a simple test of the acceptability of a training scheme for finding separating planes. An alternative criterion for acceptability would be to define (in advance of training) a limit on the probability of a linear dichotomy being found by chance-reasonable limit values might be 0.05 or 0.01; once a limit has been selected, Equation 1 may then be used to determine the maximum number of dimensions allowable for a training set of given size. In order to satisfy dimensionality constraints it will normally be necessary to eliminate all but a few significant features. The problem of feature selection has recently been one of the major areas of development of learning machine and other pattern recognition methods (2, 14-1 7). An effective feature selection scheme is essential for the development of significant discriminant functions. Of course the reduction of dimensionality may lead to the loss of linear separability in which case learning machine methods become inappropriate. Some alternative pattern recognition methods, such as k-nearest-neighbor matching (1,3), are less limited by separability requirements and so may prove more generally useful.

ACKNOWLEDGMENT I thank C. J. van Rijsbergen and D. J. Wheeler of the Computer Laboratory, University of Cambridge, for their useful comments on and suggestions for this work. LITERATURE CITED (1) B. R. Kowalski and C. F. Bender, J. Am. Cbem. SOC.,94,5632 (1972). (2)P. C. Jurs and T. L. Isenhour, “Chemical Applications of Pattern Recognition”, Wiley-interscience, New York, 1975. (3) B. R. Kowalski and C. F. Bender, Anal. Cbem., 44, 1405 (1972). (4)D. N. Anderson and T. L. Isenhour, Pattern Recognition, 5,249 (1973). (5) C. F. Bender, H. D. Shepherd, and B. R. Kowalski, Anal. Chem., 45,617 (1973). (6) D. H. Foley, l€€€ Trans. lnf. Theory, 11-16, 618 (1972). (7)J. W. Sammon, D. H. Foley, and A. Proctor, Proc. E€€ Symp. Adaptive Processes, U. of Texas at Austin, 1970,plX.2.1. (8) R. 0. Duda and P. E. Hart, “Pattern Classification and Scene Analysis”, Wiley-lnterscience, New York, 1973. (9)P. C. Jurs, B. R. Kowalski, T. L. Isenhour. and C. N. Reilley, Anal. Cbem., 41,690 (1969).

(IO) P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reillev, Anal. Cbem.. 42, 1387 (1970). (11) W. H. Highleyman, Bell Syst. Tech. J., 41,723 (1962). (12) B. R. Kowalski, P. C. Jurs, T. L. Isenhour, and C. N. Reilley, Anal. Chem., 41., 1945 . - . - (19691 (13) S:R. Lowry. H. 6 . Woodruff, G. L. Ritter, andT. L. Isenhour, Anal. Chem., 47. 1126 11975). (14) P. C. Jurs: Anai Chern., 42, 1633 (1970). (15)D. R. Preuss and P. C. Jurs, Anal. Cbem., 46,520 (1974). (16)G. S. Zander, A. J. Stuper, and P. C. Jurs, Anal. Cbem., 47, 1085 (1975). (17)B. R. Kowalski and C. F. Bender, Pattern Recognition, 8, 1 (1976).

N. A. B. Gray King’s College Research Centre King’s College Cambridge CB2 IST, England

RECEIVEDfor review May 25,1976. Accepted September 13, 1976.

ANALYTICAL CHEMISTRY, VOL. 48, NO. 14, DECEMBER 1976