Dimensionality and the number of features in "learning machine

Dimensionality and the number of features in "learning machine" classification methods. Gary L. Ritter, and Hugh B. Woodruff. Anal. Chem. , 1977, 49 (...
0 downloads 0 Views 362KB Size
It is also important to note that including too much extraneous information when using the linear learning machines can lead to incorrect predictions since some of the information can be used in the development of the equations of the hyperplanes of separation while ignoring some of the actual variance that is important in the categorizing of the training set and ultimately in the predicting of the unknown. Real variables that have no relationship to the property or classification being tested will behave as a random component in the data set.

ACKNOWLEDGMENT We thank the University of Rhode Island Computer Laboratory personnel for their assistance in the analysis of these data and John M. Cece for his many helpful suggestions. LITERATURE CITED (1) P. C. Jurs, B. R. Kowalski, and T. L. Isenhour, Anal. Chem., 41, 21 (1969). (2) P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, Anal. Chem., 41, 1949 (1969). (3) T. L. Isenhour, and P. C. Jurs, Anal. Chem., 43 (IO), 20A (1971). (4) L. Kana1 and B. Chandrasekaran, Pattern Recognhion, 3, 225-234 (1971).

N. A. B. Gray, Anal. Chem., 48, 2265 (1976). “Matrix of Electrical and Fire Hazard Properties and Classification of Chemicals”, National Academy of Sciences, Washington, D.C., NTIS A027181 (1975). J. C. MacDonald, A m . Lab., 9, 31 (1977). D. R. Preuss and P. C. Jurs, Anal. Chem., 46, 520 (1974). L. F. Bender, H. D. Shepard, and B. R. Kowalski, Anal. Chem., 45, 617 (1973).

Clifford P. Weisel James L. Fasching* Department of Chemistry University of Rhode Island Kingston, Rhode Island 02881

RECEIVED for review May 12, 1977. Accepted July 28, 1977. This research was supported by a U.S. Coast Guard Contract (DOT-CG-44160-A) and NSF Grant OC.376-16883. The opinions or assertions contained herein are the private ones of the writers and are not to be construed as official or reflecting the views of the Commandant or the Coast Guard at large.

Dimensionality and the Number of Features in “Learning Machine” Classification Methods Sir: Of the several chemical pattern recognition techniques that have been reported, the one that has received the most attention is the “linear learning machine” ( I ) . Other names that describe similar concepts include linear discriminant function, threshold logic unit, binary pattern classifier, and linear feedback classifier. For each, the problem is described in the same way. An investigator has collected information about a number of different species or patterns. In the chemical problems, the species have most frequently been compounds and the information has been physical (or spectral) measurements. T o establish the convention for this note, suppose that the spectra of n compounds have been measured and that each compound is represented by d physical measurements or features. Further suppose that the compounds have been divided into two categories on the basis of some other physical property. In the usual chemical example, the two categories are (1) the presence of some chemical substructure in the compound and (2) the absence of the same substructure. If the problem is considered geometrically, the learning machine algorithm attempts to find a d + 1 dimensional hyperplane that will physically partition the two categories (2). Algebraically this discrimination amounts to choosing a linear combination of the d measurements so that if the resulting inner product for compound i, gi,is greater than some threshold, compound i will always be a member of category 1. Similarly if gi is less than the threshold, compound i will always be in category 2. There are additional factors that must be considered in evaluating the linear discriminant function. The specific factor that motivates this discussion is the requirement on the ratio 2118

ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

of the number of compounds (or patterns) to the dimensionality of the data. Usually the theoretical treatment for the ratio of patterns to dimensions is stated (3). The most striking characteristic of the theoretical result (vide infra) is that if the number of patterns (or spectra) is less than the dimensionality of the data, then a separating hyperplane always exists. Thus, no possible physical significance may be attached to this linear separability. One difficulty with the theoretical formulation is that it may be easily misinterpreted. If the dimensionality of the data is mistakenly taken to be the number of physical measurements made (that is, d ) , then it is possible to generate n < d spectra that are not linearly separable with the d features. The intent here is to demonstrate that the critical factor for linear separability is not the number of features measured, but instead the number of orthogonal dimensions spanned by the data. In the subsequent discussion, this number of orthogonal dimensions will be referred to as the dimensionality of the data. Theory a n d Example. The theoretical result mentioned above is described in the pattern classification book by Duda and Hart ( 4 ) . The result depends upon two values, n and d’. The value given n follows the convention stated earlier and d’is the dimensionality of the data. The function f ( n , d ? is the fraction of all possible dichotomies of n points in d’dimensions that are linearly separable. The fraction is determined by asking first how many ways n spectra may be labeled or divided into two categories. The result is that there are 2” ways of dichotomizing n points. Next the total number of these dichotomies that are linearly separable in d‘ dimensions must be counted. The resulting ratio is f(n,d’)and

Table I. T w o Categories that are Inseparable Despite the Presence of More Measurements than Patterns Measurement No. 5 4 2 3 1 Spectrum Category No. No. 1 2 3 4 5

1 1 2 1 2

6

7.40 2.80 5.00 1.40 2.20

2.60 9.80 1.20 6.60 8.20

8.70 7.70 5.60 4.70 6.30

5.00 6.30 3.10 4.00 5.20

6.85 7.00 4.35 4.35 5.75

3.80 8.05 2.15 5.30 6.70

1.0

0.0 1.0

1.0 0.5

0.5 0.5

0.75 0.5

0.25 0.75

Weight 0.0

is given by Equation 1.

X

If d‘ = d or the dimensionality equals the number of features measured, then Equation 1 states that a linear dichotomy exists whenever n 5 d + 1. T o investigate this statement consider the example shown in Table I. In this artificial example, there are 5 patterns ( n = 5) each described by 6 measurements ( d = 6). If the linear learning machine algorithm is used on these data, there is no convergence. Since the learning machine algorithm is known to converge if the data are linearly separable (2),the data must be linearly inseparable. Yet from Equation 1,f(5,6) = 1, indicating that the data must be linearly separable. This contradiction implies that either Equation 1 is incorrect or that d # d’. Since Equation 1 is known to be true, the dimensionality of these data must be confused with the number of features measured. As a matter of fact, features 3 through 6 were generated as linear combinations of features 1 and 2. (The weights given to features 1 and 2 make up the bottom line of Table I.) Thus the six features span but two directions in the six-dimensional vector space. It is very important then to distinguish between the number of features or measurements, d , and the dimensionality of the data, d‘. When the dimensionality of the data, d‘, is considered there is no contradiction in Equation 1 since f ( 5 , 2 ) = 0.6875 ( # 1.0).

A separate and more trivial illustration of the difference between d and d’is shown in Figure 1. Each “x” is described by two measurements; however it is clear that all information lies in one dimension which may be represented as the line passing through all of the “x”s. Computing t h e N u m b e r of Orthogonal Dimensions. This section will briefly describe a method for estimating the dimensionality of the chemical data set. Because of the finite sample size, this dimensionality will serve as a lower limit for the true dimensionality of the universe of data. T o restate the problem, an investigator is given the spectra of n compounds. Each compound is represented by d measurements. The goal is to determine the minimum number of orthogonal features (or linear combinations of the measurements) that will (nearly) represent the data. This formulation is identical with that given in recent papers which apply principal component analysis to chemical problems (5, 6). Thus the data are written into an n X d matrix, X where each row represents one compound. The number of orthogonal dimensions, or the dimensionality, is given by the number of significantly non-zero eigenvalues of XTX. The corresponding eigenvectors describe the transformations which produce the minimum number of orthogonal dimensions.

x 7

\!easuremcnt 2

Flgure 1. Five points that are described by two measurements, but that span one dimension

If the eigenanalysis is performed on the example in Table I (where n = 5 and d = 6), then the eigenvalues are found to be -902.692, -75.883, O., O., 0. and 0. (The characteristic equation, the roots of which are the eigenvalues, is X6 - 978.575 X5 + 68498.77311875 X4 = 0.) Since the numbers in this example were generated without noise, there are exactly two non-zero eigenvalues. This result confirms the contention that the dimensionality of the data set is two (d’ = 2). In real examples, noise will be present in the data set. Also the use of computers for eigenvalue calculation introduces machine round-off problems. These factors combine to produce non-zero eigenvalues, even if there is little or no information in a particular transformation (direction). In recent papers (7, 8), Malinowski has reported and investigated a strategy for determining the number of true dimensions, or the dimensionality, in a data set in the presence of noise. If an investigator is interested in producing a set of orthogonal features for a data set, an alternative to the principal component method has been reported by Kowalski and Bender (9).

CONCLUSION I t is quite possible that the number of features that are measured does equal the true dimensionality of the universe of data. However the finite sample size and the presence of correlation in spectroscopic measurements may contribute to an effective decrease in the dimensionality of the sampled data. The investigator who uses a linear classification technique needs to be aware that the limits he states for the ratio n / d are a conservative estimate of the theoretically more valuable ratio n / d ’. Therefore, separations that may a t first appear meaningless may in fact have physical significance. ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

2117

ACKNOWLEDGMENT The authors thank David S. Bright for his helpful comments on this manuscript. LITERATURE C I T E D (1) (2) (3) (4)

P. C. Jurs, 6.R. Kowalski and T. L. Isehour, Anal. Chem., 41, 21 (1969). N. J. Nilsson. “Learning Machines”, McGraw-Hill, New York, N.Y., 1965. N. A. 6.Gray, Anal. Chem., 48. 2265 (1976). R. 0. Duda, and P. E. Hart, “Pattern Classification and Scene Analysis”, Wiley-Interscience, New York, N.Y.. 1973. (5) Z. Z. Hugus, Jr., and A. A. El-Awady, J . Phys. Chem., 75, 2954 (1971). (6) G. L. Ritter, S. R. Lowry, T. L. Isenhour, and C. L. Wilkins, Anal. Chem., 48, 591 (1976). (7) E. R . Maiinowski, Anal. Chem., 49, 606 (1977).

(8) E. R. Malinowski, Anal. Chem., 49, 612 (1977). (9) B. R. Kowalski, and C. F. Bender, Panern Recognition. 8, 1 (1976).

G a r r y L. Ritter* Special Analytical Instrumentation Section Analytical Chemistry Division National Bureau of Standards Washington, D.C. 20234

Hugh B. Woodruff Merck Sharp & Dohme Research Laboratories Rahway, New Jersey 07065 RECEIVED for review March 7, 1977. Accepted July 20, 1977.

AIDS FOR ANALYTICAL CHEMISTS Nonmembrane Amperometric Sensor for Dissolved Oxygen in Flow-Through Systems Ch-Michel Wolff’ and Horacio A. Mottoia” Department of Chemistry, Oklahoma State University, Stillwater, Oklahoma 74074

The determination of the concentration of dissolved oxygen in aqueous media is an analytical problem of widespread interest (1). The electrochemical characteristics of oxygen have inspired the development of basically two types of sensors: galvanic and electrolytic, in a variety of configurations and for different applications. Industrial (2) and particularly biochemical (3)situations require the use of these probes [e.g., it is recognized that the measurement of the partial pressure of oxygen in gaseous, liquid, and semiliquid media is routine work in most hospitals and physiological laboratories ( 4 ) ] . Most of these sensors physically separate the metallic electrode surface (generally platinum) from the sensed solution by means of a semipermeable membrane. The electrode response in these cases is kinetically controlled by the rate of diffusion of the dissolved oxygen through the membrane to reach the electrode surface. The most popular sensor of this kind is the so-called Clark electrode, in which both the platinum cathode and the reference anode are covered by a single semipermeable membrane of Teflon, polyethylene, polypropylene, or Mylar (5). The membrane minimizes undesirable effects of lipids, proteins, cells, and variations in stirring in biochemical applications. This paper reports on the design and performance of an electrolytic (amperometric), nonmembrane, three-electrode system with fast response to changes in dissolved oxygen concentration and specially suitable for use with solutions flowing in narrow tubing at constant rate. The device is of utility in situations where a fast response is needed such as in sample injection techniques in continuous flow analysis (6, 7). T h e development of polarographic and voltammetric sensors for continuous flow analyses has received sustained attention (8)since Miiller’s pioneer work on a by-pass electrode system (9). This work was done with a platinum wire inserted a t a right angle into the thickened wall of the glass tubing through which the solution for analysis flowed. The exposed P e r m a n e n t address, L a b o r a t o i r e de C h i m i e P h y s i q u e e t d’Electroanalyse, Ecole N a t i o n a l e Superieure de C h i m i e de Strasbourg, 1, r u e Blaise Pascal, 67008 Strasbourg Cedex, France.

2118

ANALYTICAL CHEMISTRY, VOL. 49, NO. 13, NOVEMBER 1977

electrode surface was a disk approximately 0.3 mm2. The work was intended to produce an understanding of electrode reactions in which equilibrium conditions are not established, and Muller made the interesting projection, at that time, that the “by-pass electrode could, if inserted into the blood stream, provide an immediate indication of the availability of material in the blood to a given tissue”. Design of Electrode System. The need for fast response and the continuous “washing” of the electrode surface by the imposed flow led us to the all-glass cell and the three-electrode design illustrated in Figure 1. The reference electrode is a calomel element of a Coleman 3-511 electrode removed from its sleeve and inserted into a 3 unit (see Figure 1) with a platinum wire contact a t its bottom. The electrolyte inside the reference electrode is a saturated KC1 solution. The working electrode is a platinum wire; a platinum wire also serves as counterelectrode. Earlier exploratory work with a platinum disk (from a Beckman 39273 metallic electrode) indicated that better defined peak profiles (Figure 3) are obtained with a filament-type electrode (Pt wire, 0.5-mm 0.d. and 9 mm length) which disturbs the laminar flow considerably less than the platinum disk. The larger diameter of the platinum disk and the closer fit to the cell walls generate turbulence; moreover, the platinum wire equally well integrates the current during the parading of the “sample plug” in and out of the detection zone. A three-electrode system is preferable for measuring changes in oxygen concentration from the saturated level. Solutions saturated with oxygen exhibit a large value of current as “baseline” and, to obtain a stable applied potential, a three-electrode configuration becomes desirable. The three-electrode cell described in this paper can be used with practically any common voltammetric circuit and with current or potential read-outs operating in the microampere-millivolt levels. Our evaluation data were obtained with the help of a potentiostat assembled with modular units from an MP-System lo00 (McKee-Pedersen Instruments, D a n d l e , Calif.). A Sargent SRG strip chart recorder or a Tektronix 564 storage oscilloscope was used as read-outs.