Deceptive "correct" separation by the linear learning machine

Chemometrics and liquid chromatography in the study of acute lymphocytic leukemia. Hubert A Scoble , James L Fasching , Phyllis R Brown. Analytica Chi...
1 downloads 0 Views 386KB Size
desolvation, produce particles of finite diameter and mass. This expectation is not consistent with the model employed herein, in which complete evaporation is predicted. One would also expect that as desolvation occurs, the density of the droplet would change because of the increasing analyte concentration. Finally, vapor-phase water and its decomposition products, in the region of the droplet, might change the viscosity of the flame gases. However, these changes are minor refinements and would not significantly alter the result predicted here, that before desolvation is complete, the aerosol will have attained a velocity which is experimentally indistinguishable from that of the flame gases.

LITERATURE CITED (1) B. V. L’vov, L. P. Krugiikova, L. K. Polzik, and D. A. Katskov, J . Anal. Chem. USSR, 3 0 , 545 (1975). (2) B. V. L’vov, L. P. Krugiikova, L. K. Poizik, and D. A . Katskov, J . Anal. Chem. USSR,3 0 , 551 (1975).

( 3 ) Kuang-pang Li, Anal. Chem., 48, 2050 (1976). (4) G. M. Hieftje and H. V. Malmstadt, Anal. Chem., 40, 1860 (1968). (5) N. C. Clampiti and 0 . M. Hieftje, Anal. Chem., 44, 1211 (1972). (6) “Handbook of Chemistry and Physics”, 54th ed.,R . C. Weest, Ed., CRC Press, Cleveland, Ohio, 1973.

C. B. Boss’ G.M. Hieftje*

Department of Chemistry Indiana University Bloomington, Indiana 47401 Present address, Department of Chemistry, North Carolina State University, Raleigh, N.C. 27607.

RECEIVED for review May 9,1977. Accepted August 17, 1977. This work was supported in part by the National Science Foundation through grant NSF MPS 75-21695 and by the National Institutes of Health through grant P H S GM 17904-05.

Deceptive “Correct” Separation by the Linear Learning Machine Sir: The linear learning machine (LLM) has the promise of being an extremely useful and simple tool for distinguishing among different groupings of items, including chemical compounds, into a meaningful order or into defined categories based on measurable quantities (1-3). The equations upon which the linear separations are made have then been used to predict the category of unclassified items. This is done by using the same physical measurements determined on the unknown as in the categorized groups and observing the categorized group association of the unclassified item. Kanal ( 4 ) reported on problems concerning the dimensionality and sample size in pattern recognition techniques. This warning has not been explicitly stated in the chemical literature except for a recent theoretical discussion and demonstration on artificial data by Gray ( 5 ) . We attempted to categorize chemical compounds into appropriate electrical fire-hazard classes for bulk water-transport using the LLM. Our use of the LLM procedures consisted of first trying to separate chemicals into the categories developed experimentally by the National Academy of Sciences (NAS) which ranked the electrical fire-hazard of the compounds (6). Our separators were based on variables which were a composite of physical measurements and structural information. The compounds with their NAS classification are listed in Table I and the variables are listed in Table 11. Both binary- and multi-group linear learning machine procedures, found in the computer package ARTHUR (8, were applied to this data set. With either routine, we were able to completely separate the compounds listed into their appropriate NAS assigned category using either the complete variable set listed in Table I1 or a number of smaller subsets of the variables. It was essential in our work with the LLM that we obtain correct separation of the experimentally classified compounds a t greater than 95% confidence. Any applications of the LLM that have substantial human or material risk associated with them, such as this fire-hazard problem, cannot have any error in the correctness of the predictability and, since perfect prediction of the training set does not guarantee that the test data will be predicted with any great accuracy, other tests must be performed. Applications where only a trend is needed to indicate the direction for future work can benefit from the less accurate predictions of the LLM. The high percentage of correct separation of the training set into their assigned categories has been used by others to indicate that unknowns could be categorized with a high degree of accuracy (8). Since 2114

ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

we achieved complete separation of our training set and were able to measure the variables in Table I1 easily, we had hoped to use this premise to categorize a large number of compounds. To test this prediction, we used a leave one out iterative procedure (JACKKNIFE) which involved treating one compound in the training set as an unknown and then classifying it based on the rest of the data. Comparing the predicted classification of each chemical with its correct classification would provide a better estimate of the prediction capability of the LLM than the percentage separation of the training set. The LLM had already separated the training set with 100% separation, whereas the JACKKNIFE procedure only classified 68% and 66% of the compounds correctly for the multi- and binary-linear learning machines respectively. These precentages were obtained with the first 13 variables listed in Table 11, which were the best results we obtained for any subset of variables tested, including the complete data set. The variables used in the subsets tested were chosen based on a series of stepwise discriminate analysis. This routine gives an ordering of the variables and their approximate importance according to the percentage of variance they contribute to the data set while retaining the compounds in their respective categories. We found that both fewer and greater numbers of variables than 13 decreased the reliability of our predicting the NAS assigned classification. Decreasing the number of variables below 13 most likely leads to insufficient information being present in the data set to make a correct decision concerning a compound’s classification. This idea is supported by the fact that we often did not get the 100% separability of the training set with fewer than 13 variables. An explanation for why adding more variables decreased the percentage of correctly predicted compounds is more difficult to find, especially since the 100% separation of the training set was retained. One possible explanation is similar to what Gray ( 5 ) proposed. He did a series of calculations which suggest that a “noise” feature can be found in a data set which can be manipulated by linear learning machines to shift the equations describing the hyperplane used to make the category decision for a compound. This allows the training set to be separated more rapidly but the optimal choice hyperplane or even a correct one for separation is not always found. This inclusion of a random component would lead to incorrect classification of unknowns, especially those near the true boundary of different categories because the hyperplane is shifted slightly in its orientation. Two other possible explanations for why we were able to

Table I. Training Set JACKKNIFE

Compounds 2-Propanol Propanal 2-Butanol 1-Butanol 2-Butanal Amine, diethyl Pentene, 2,4,4-trimethyl Propane, 2-chloro-1,3-epoxy Propene, ethyl ester Ethane, 1,2-diamino Azirane Hydrogen sulfide Ether, diisopropyl 3-Penten-2-one,4-methyl Morpholine Propane, 2-nitro Pyridine Furan, tetrahydro Methane Formaldehyde, dimethyl acetal Ether, dimethyl Ether, dipropyl Ethane, amino Amine, triethyl Cyclopropane Propyne Propane Acetaldehyde Acetic acid, nitrile Ammonia 1,3-Butadiene Carbon disulfide Ethane, dichloride Ethane, 1,2-epoxy 1,3-Butadiene, 2-methyl Propene Propane, 1,2-epoxy Styrene Hydrazine, 1,l-dimethyl Acetic acid, ethenyl ester Ethene, chloro Benzene, 1,4-dimethyl Hydrogen Ether, diethyl Ethene Butane Ethyne

NAS Multi C C B B D

D

Plane D

B D C

C C C D C

C C C B

C C D B

D D C

D

D

B C B D

B C B D

D

D

C D D C

C D

C D C C C D

C C C D C

D

C D D

C C C

C C C B D D D

C

D D

D D

B

C

B

C D B C

C D D C C C C C C D C D D C D D Unclassified

D

B C D B

D C

C

B C

D D D

D D D

B C C

C C B

D

D

D

B

B

C D

B C D C D C C C D C C B D B

achieve complete separation of the training set but were not as successful with the JACKKNIFE classifications could be that some of the compounds contained unique structural information or a physical property which is important in determining its correct category and was utilized when it was included in the training set but no other compound could provide the same information when it was treated as an unknown. Also it is possible that the data set was too small and/or the variables chosen were incomplete for the attempted classification. Another problem that is evident from these data was also discussed by Gray ( 5 ) and is related to the varying number of samples in each class and the number of samples in each class vs. the number of variables. Bender et al. (9) proposed that the total number of samples per class in the data set be at least three times the variable size. Gray (5) proposes that the prediction success rate will be better for classes containing the greatest number of samples. Our data supports this and demonstrates the importance of having approximately equivalent numbers of samples in each class and of having an appropriate sample number to variable ratio. Class D and class C are of approximately equal size and the optimal

Table 11. Variable Ordering 1. Auto ignition temperature 2. Total number of hydrogens 3. Epoxy groups 4. NO, groups 5. Molecular weight 6. Solubility in ether 7. CH, groups 8. NH, groups 9. Total number of carbons 10. Carbon-carbon single bonds 11. Ester linkages 12. NH groups 13. Solubility in alcohol 14. Carbon-carbon triple bonds 15. Total number of sulfurs 16. Carbon-chloride bonds 17. NH, groups 18. Carbons without hydrogens 19. Carbon-nitrogen triple bonds 20. CH groups 21. N-C=N groups 22. Ethyl groups 23. COH groups 24. Total number of nitrogens 25. Hydrogens alpha to C=O 26. Ester linkages 27. Nitrogens without hydrogens 28. Solubility in water 29. Hydrogens alpha t o C=C 30. Flash point 31. Total number of oxygens 32. Boiling point 33. HC=O groups 34. Melting point 3 5. CH, groups 36. COOH groups 37. C=O groups 38. Carbon-carbon double bonds 39. Total number of chlorines percentage correctly predicted for both groups, using the JACKKNIFE procedure, are 74% and 67% respectively, whereas class B, which contained much fewer compounds, was only predicted with a 43% accuracy. Thus, larger size groups were predicted more accurately. Although the number of experimentally classified compounds and the descriptors we chose for variables did not allow us to use the suggested ratio of the samples per class for each variable, we were still able to see a trend which suggests that it is important t o use a ratio greater than the one we had. First, the simple relationship mentioned above concerning the better accuracy of the prediction of groups with a sample t o variable ratio closer to three was observed. A second interesting relationship was observed between our ability to separate the training set and our prediction by the JACKKNIFE while varying the number of variables. The precentage of predictions with JACKKNIFE improved while the training was completely separated, for the cases tried as the sample t o variable ratio increased until only 13 variables were reached, at which point we believe insufficient information is also retained. This indicates that, as the number of variables decreases relative to the number of samples, the hyperplanes calculated are approaching the correct boundary planes between the categories. Having a small sample to variable ratio though allowed for the “noise” vector previously mentioned to be utilized in the calculations. Thus, the hyperplane was determined so it accommodated the data points to achieve separation of a training set while not actually separating the categories. It is obvious from these data that one should not be satisfied with a 100% separation of a training set as an indication that unknown compounds can also be predicted with any accuracy. ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

2115

It is also important to note that including too much extraneous information when using the linear learning machines can lead to incorrect predictions since some of the information can be used in the development of the equations of the hyperplanes of separation while ignoring some of the actual variance that is important in the categorizing of the training set and ultimately in the predicting of the unknown. Real variables that have no relationship to the property or classification being tested will behave as a random component in the data set.

ACKNOWLEDGMENT We thank the University of Rhode Island Computer Laboratory personnel for their assistance in the analysis of these data and John M. Cece for his many helpful suggestions. LITERATURE CITED (1) P. C. Jurs, B. R. Kowalski, and T. L. Isenhour, Anal. Chem., 41, 21 (1969). (2) P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, Anal. Chem., 41, 1949 (1969). (3) T. L. Isenhour, and P. C. Jurs, Anal. Chem., 43 (IO), 20A (1971). (4) L. Kana1 and B. Chandrasekaran, Pattern Recognhion, 3, 225-234 (1971).

N. A. B. Gray, Anal. Chem., 48, 2265 (1976). “Matrix of Electrical and Fire Hazard Properties and Classification of Chemicals”, National Academy of Sciences, Washington, D.C., NTIS A027181 (1975). J. C. MacDonald, A m . Lab., 9, 31 (1977). D. R. Preuss and P. C. Jurs, Anal. Chem., 46, 520 (1974). L. F. Bender, H. D. Shepard, and B. R. Kowalski, Anal. Chem., 45, 617 (1973).

Clifford P. Weisel James L. Fasching* Department of Chemistry University of Rhode Island Kingston, Rhode Island 02881

RECEIVED for review May 12, 1977. Accepted July 28, 1977. This research was supported by a U.S. Coast Guard Contract (DOT-CG-44160-A) and NSF Grant OC.376-16883. The opinions or assertions contained herein are the private ones of the writers and are not to be construed as official or reflecting the views of the Commandant or the Coast Guard at large.

Dimensionality and the Number of Features in “Learning Machine” Classification Methods Sir: Of the several chemical pattern recognition techniques that have been reported, the one that has received the most attention is the “linear learning machine” ( I ) . Other names that describe similar concepts include linear discriminant function, threshold logic unit, binary pattern classifier, and linear feedback classifier. For each, the problem is described in the same way. An investigator has collected information about a number of different species or patterns. In the chemical problems, the species have most frequently been compounds and the information has been physical (or spectral) measurements. T o establish the convention for this note, suppose that the spectra of n compounds have been measured and that each compound is represented by d physical measurements or features. Further suppose that the compounds have been divided into two categories on the basis of some other physical property. In the usual chemical example, the two categories are (1) the presence of some chemical substructure in the compound and (2) the absence of the same substructure. If the problem is considered geometrically, the learning machine algorithm attempts to find a d + 1 dimensional hyperplane that will physically partition the two categories (2). Algebraically this discrimination amounts to choosing a linear combination of the d measurements so that if the resulting inner product for compound i, gi,is greater than some threshold, compound i will always be a member of category 1. Similarly if gi is less than the threshold, compound i will always be in category 2. There are additional factors that must be considered in evaluating the linear discriminant function. The specific factor that motivates this discussion is the requirement on the ratio 2118

ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

of the number of compounds (or patterns) to the dimensionality of the data. Usually the theoretical treatment for the ratio of patterns to dimensions is stated (3). The most striking characteristic of the theoretical result (vide infra) is that if the number of patterns (or spectra) is less than the dimensionality of the data, then a separating hyperplane always exists. Thus, no possible physical significance may be attached to this linear separability. One difficulty with the theoretical formulation is that it may be easily misinterpreted. If the dimensionality of the data is mistakenly taken to be the number of physical measurements made (that is, d ) , then it is possible to generate n < d spectra that are not linearly separable with the d features. The intent here is to demonstrate that the critical factor for linear separability is not the number of features measured, but instead the number of orthogonal dimensions spanned by the data. In the subsequent discussion, this number of orthogonal dimensions will be referred to as the dimensionality of the data. Theory a n d Example. The theoretical result mentioned above is described in the pattern classification book by Duda and Hart ( 4 ) . The result depends upon two values, n and d’. The value given n follows the convention stated earlier and d’is the dimensionality of the data. The function f ( n , d ? is the fraction of all possible dichotomies of n points in d’dimensions that are linearly separable. The fraction is determined by asking first how many ways n spectra may be labeled or divided into two categories. The result is that there are 2” ways of dichotomizing n points. Next the total number of these dichotomies that are linearly separable in d‘ dimensions must be counted. The resulting ratio is f(n,d’)and