Representation-space transformation for the display of multivariate

Representation-space transformation for the display of multivariate chemical information. Chi-Hsiung. Lin, and Hwa-Fu. Chen. Anal. Chem. , 1977, 49 (9...
0 downloads 0 Views 850KB Size
the cavity mode was measured at zero field, and at the resonance maximum, and the microwave frequency was calibrated with a wavemeter. The value of Q, was estimated to be 6700 f 300, while Q at resonance was 5400 f 250. The value of AQ/Q, is therefore 0.19 f 0.08. From the previous calculation, a sample of 1.88 mg at 9 = 1should cause a value of AQ/Q of 0.01. Thus for the 47-mg sample, AQ/Q is predicted to be about 0.23, in agreement with the estimated value.

ACKNOWLEDGMENT Helpful discussions with Robert M. Housley and William Ho are gratefully acknowledged. LITERATURE CITED (1) 1. B. Goldberg and A. J. Bard, “Analytical Applications of Electron Spin Resonance” in “Treatise on Analytical Chemistry”, 2nd ed., Pt. 1, Vol. 6, I.M. Koithoff, P. J. Eiving, and M. M. Bursey, Ed., in press: Science Center Manuscript SCM-76-125. (2) R. M. Housley, E. H. Cirlin, I. 6.Goidberg, and H. R. Crowe, Proc. Lunar Sci. Conf., 7tb, 1, 23 (1976). (3) A. A. Westenberg, Prog. React. Kinet., 7 ( l ) , 23 (1973).

(4) I.B. GoMberg, H. R. Crowe, and W. M. Robertson, Anal. Cbem., in press. (5) G. Feher, Bell Syst. Tech. J . , 36, 449 (1957). (6) B. Vigouroux, J. C. Gourdon, P. Lopez, and J. Pescia, J. Phys. E. 6, 557 (1973). (7) I. 6.Goldberg, H. R. Crowe, and R. S.Carpenter, J . Magn. Reson.. 18, 84 (1975). (8) C. P.Poole, “Electron Spin Resonance”, Interscience, New York, 1967, Chapters 8 and 14. (9) G. E. Pake, “Paramagnetic Resonance”, W. A. Benjamin, New York, 1962, pp 21-30. (10) M. L. Randolph, “Quantitative‘ponsiderations in Electron Spin Resonance of Biological Materials”, in Biological Applications of Electron Spin Resonance”, H. M. Swartz, J. R. Boiton, and D. C. Borg, Ed., John Wiley, New York, 1972, Chapter 3. (11) J. R. Bolton, D. C. Borg, and H. M. Swartz, “Experimental Aspects of Biological Electron Spin Resonance Studies”, in ‘ Biological Applications of Electron Spin Resonance”, H. M. Swartz, J. R. Boton, and D. C. Borg, Ed., John Wiley, New York, 1972, Chaper 2. (12) Y. Allain, J. P. Krebs, and J. de Gunzbourg, J . Appl. Phys., 39, 1124 (1968). (13) Varian Aqueous Sample Cell Instruction Manual. (14) G. G. Guilbault and G. J. Lubrano, Anal. Lett., 1, 725 (1968).

RECEIVED for review February 22,1977. Accepted May 2,1977. This work was supported by the Office of Naval Research.

Representation-Space Transformation for the Display of MuItivariat e ChemicaI Information Chi-Hsiung Lin” and Hwa-Fu Chen Department of Chemistry, National Cheng Kung University, Tainan, Taiwan, Republic of China

A new algorithm for displaying multivariate chemical information on a three-dimensional coordlnate system is presented. The transformation is basically the extraction of three characteristics from the information and assures the preservation of data structure to a satisfactory degree If the information is normalized. Since the transformation is independent of other existing information, the display coordinates are unique and representative on the given information. This promlses utility in treating a huge amount of chemical lnformation with pattern recognition and other modern technlques. The display coordinates also serve as the Index codes in the computer retrleval of such Information. Other applications such as the signal shape analysis in spectroscopy with this method are suggested.

The application of pattern recognition technique in chemistry (1-4) has aroused wide interest in recent years. This technique is the central part of the “learning machine” (5) utilized for the analysis of mass spectra (6),infrared spectra (7), NMR spectra (8), polarograms (9), etc. Besides this conventional classification method, some chemical problems are also solved by the unsupervised learning method which utilizes the technique of cluster analysis (10, 11). This is a method developed in the past several years and widely utilized in the fields of taxonomy, psychology, social sciences, and medical diagnosis. It is a powerful tool in detecting some obscure properties of a collection of objects through the recognition of a natural classification among them (1). In the field of chemistry, the application is still in the beginning stages. But, since chemistry at the present stage already abounds in scientific information, its future development can be easily anticipated.

To find an intrinsic property hidden in a huge set of information, it is necessary to perceive the mutual relationships among this set of information. A t the present stage of development, the most reliable detecting device is still the human perception whose working principles are bases for the development of theories of cluster analysis (12). But to be perceptible, it is necessary to present that information in some strictly limited forms. Some chemical information such as spectra of compounds are already in such forms, and chemists are taught to differentiate and identify them properly. As the number and complexity of information for such study increases, however, the human capability quickly becomes inapt unless some new presentation scheme which is simple for perception and efficient for numerical treatment can be devised. The present study offers such a new scheme of presentation which transforms multivariate chemical information into their corresponding points on a three-dimensional display space for visual perception and numerical analysis. The Display Method. Multivariate chemical information can be thought of as a set of N measurements. A set of chemical information can then be represented as a set of points on an N-dimensional measurement space where each axis corresponds to one of those N measurements. The distribution of these points, or the data structure, is the object of study in the cluster analysis. To study this by human visual perception, however, it is necessary to transform it into a display space of dimensionality no higher than three while keeping the data structure practically intact. In most cases, the N-dimensional measurement space is thought of as an orthogonal coordinate system, but it is not always necessarily so. A ‘H NMR spectrum with a singlet at 6 1.0, for instance, resembles one with a singlet at 6 1.5 rather than one with a singlet at 6 7.5. To demonstrate this on the measurement space, the relation between the axis for 6 1.0 and that for 6 ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977

1357

1.5 has to be different from the relation between the former and the axis for 6 7.5. Though even if so, it is the usual practice to adopt an orthogonal coordinate system for the measurement space for mathematical convenience. A display space is also an orthogonal coordinate system with a low dimensionality. A display method is, therefore, a proper scheme of dimensionality reduction which keeps the data structure as intact as possible (13). The Karhunen-Loeve transformation (14) is one of the dimensionality reduction treatments, general difficulty of which is thoroughly discussed by Olsen (15). Another is the method of nonlinear mapping (NLM) developed by Sammon. In this method, the data structure is thought to be kept by an iterative adjustment of inter-point distances so as to make the sum of deviations minimal. The benefit and limitation of this method are also discussed in detail (16). Besides those difficulties on their practical applications, a serious shortcoming of these methods, seldom noticed, is that they treat all information as a set and, consequently, a revised computation is necessary whenever new information is added to the set. And what is more, the revised computation gives different coordinate values for each information. In the study of quickly accumulating chemical information, we hope to have a method which will assign for a given information its unique display coordinates, Fukunaga and Olsen (17) present a two-dimensional display which partly satisfies this goal. The uniqueness of the display coordinates will make it possible to utilize them as index codes of that information for later retrieval and comparison. Futhermore, we hope that the display coordinates are themselves related to some conceivable characteristics of the information in the ordinary chemical sense. A spectrum for instance, should give a higher y-value if it is more intense, and a larger x-value if its bands are bathochromically shifted. Chemical information as such is often a continuous function which is sampled every ( N 1)-interval to give the N measurements. If the display coordinates are truly representative of the information, then the number of samplings should not be dominant. We hope, in this connection, that the display will not shift too far by reducing the number of samplings as long as the very nature of the information is not lost. Obviously, a display which satisfies the above desiderata is more than an ordinary display. The display space where data structure is preserved by the dimensionality reduction from a high dimensional measurement space is here also a space on which the data points are representative of the characteristics of original information. We call this display space the “representation space” of the chemical information, a n d the method of dimensionality reduction t h e “representation-space transformation”.

THE REPRESENTATION-SPACE TRANSFORMATION Multivariate chemical information comprises several measured values either from a single measuring instrument or from several independent sources. These values do or do not possess a natural sequence of arrangement. When they do not, then an arrangement is assigned in accord with some chemical consideration or just arbitrarily in the stage of preprocessing, This stage also includes any transformation specifically taken for some purpose and a normalization process to give a new set of values each of which is in the range of 0.0 to 1.0. The ith information is now

Pi =

{Pi19

Piz 9

- - - Pirv 1

(1)

and occupies a point on the N-dimensional information space inside a hypercube, the edges of which are aJl of the same unit length. The representation-space transformation, TR,is to give three values, x , , y L ,and z,, as the coordinates of its 1358

ANALYTICAL CHEMISTRY, VOL. 49, NO. 9, AUGUST 1977

“representation point” on the display space, i.e.,

TRPi = ((Xi?Y i , z i ) ) = ( ( P i ) )

(2)

Let us consider three pairs of reference information, X- and X+, Y-, and Y+, and Z- and Z+, of the same dimensionality. These reference points are so assigned as to represent three special characteristics of the chemical information of interest. For their display, coordinates are given the following values regardless of the dimensionality of the information space.

((X-1)= ((+, 0, O)),

( ( X + ) )=

((13’

0, 0))

Here, lz is the limiting value for the representation space, to which 10000 is commonly assigned for the convenience of computer handling of the coordinates in one-word integers. The representation-space transformation is to transfer information points to the display space by holding their proximity relationships to each pair of the reference information intact. Thus, for instance, the ratio of distances from P,to X.. and to X+ on the information space is used to assign the r-value of the representation point so as to give the same ratio of distances from ((Pi))to ( ( X - ) )and to ( ( X , ) ) on the display space. Thus, in general,

where R stands for X, Y, or Z. There is no limitation on which distance measure should be used. Generally, the Mahalanobis expression

dist[Pi, R] =

(S

j = 1 lPij

- R-jIp)’/fi

(5)

with three independently assigned exponents, ys, y,, and yz, for each pair of distances is quite proper. The representation coordinates, ((Pi)) become

The three pairs of reference information can be assigned rather arbitrarily. Information of two special compounds, for instance, could be assigned as a pair of reference information in some cases. Utilization of typical factors (18) as reference information is specially suitable. Another recommendation is the assignment of both extremes of a characteristic for each pair. For information such as chemical spectra, they could be such as to illustrate the left-right inclination, the low-high intensity, and the diverse-centralized distribution. In this sense, the following set is most serviceable through its eimplicity in form and generality in nature:

where p, represents ( m - l ) / ( N - 1)for the mth component of the N-dimensional vector point. An illustration on such reference information with eleven components is shown in Figure 1. The simplest case is when the information space is three-dimensional, This space is itself perceivable. Information of the six references on this space and their representation points on the display space are compared in Figure 2. This information space has three axes [l],[2], and [3].

COMPUTATION All plotting and most calculationsrequired for this study were carried out on a Hewlett-Packard 9810A Desk Calculator equipped with a 9861.4 Output Typewriter, a 9862A Calculator Plotter, and a 9865A Cassette Memory. Other calculations which require a high capacity memory core were carried out on the UNIVAC 90/30 in the Engineering Research Center of the National Cheng Kung University. RESULTS AND DISCUSSION The ability of the representation space to be the classification medium for high dimensional data was thoroughly studied with several artificial data sets. The comparison with the results of NLM treatment showed higher excellence with the present algorithm. The present transformation on a set of sixty 12-dimensional data took less than 10 ms of CPU time while the iterative NLM took more than 10 min on a UNIVAC 90/30. The preservation of the data structure is also found better with our algorithm. This is a natural consequence of the limitation of pattern space inside the hypercube of unit edge lengths; the NLM treatment does not impose any limitation of similar nature. A pair of exact same patterns, for instance, appears as a single point on the representation space, but it always gives two points on the two-dimensional map obtained by the NLM method. Another decisive factor of practical importance in computation is the memory size required. Even the largest computer now available cannot handle a collection of 1000 infrared spectra with 400 features each on its core memory with the nonlinear dimensionality reduction of NLM, which requires a core memory larger than 5400 kB. With our algorithm, on the other hand, even a desk calculator with a cassette memory can efficiently handle such a job since the representation transformation is made on each information independently and the resultant representation point is displayed one by one. In this transformation, the sequence of arrangement of those features in the pattern matrix is a decisive factor for its computed coordinate value. Thus, when the information is measurements from various sources, the arrangement can be used as a means for finding different modes of clustering. This is caused by the fact that the hypercube on the representation space is not exactly symmetrical as illustrated in Figure 3. When a certain property is to be extracted from a set of information, there must exist an optimum arrangement which gives the most definite classification in regard to that property. In this arrangement, it is generally observed that components which are intrinsically similar in contribution to the given property are arranged in proximity in the series of information. In a supervised learning, it is possible to compare those N ! / 2 arrangements and pick up the arrangement which gives the best separation of classes. From this, in turn, we can deduce the relationships of intrinsic contribution among those components to the given property under classification. It is suggested by Kowalski that the most important valence (V), melting point (MP), covalent radius (Rc), ionic radius (RI),electronegativity (El), and heat of fusion (H) of elements

Figure 1. 1 1-dimensional reference information

,1 L

C31

t

*I

Lid

(a;

(b:

Figure 2. Reference information on the three-dimensional information space (a) and their representation points on the display space (b)

-

If they are mutuallyorthogonal, then the line X-X+ becomes shorter than either Y-Y+ or Z-Z,. Here we see the distortion on the representation-space transformation. But, on the other hand, we can also say that the transformation does not distort the data structure at all if we admit a nonorthogonal information space where the angle between axis [l]and axis ,[3] is a little larger and, consequently, those three lines somehow become mutually orthogonal and of the same length. By extension, this consideration to high-dimensional information space vindicates the capability of the present algorithm in preserving the data structure. The normalization process taken in the preprocessing of the measurement values assures that all information points are in a hypercube of unit edge lengths. Owing to the nature of the present algorithm, the data structure outside the hypercube is not preserved on the transformation. Such hypercubes with dimensionality larger than three are very difficult to conceive. They can be, however, transformed into the display space for visual examination. Examples are shown in Figure 3. All information will give representation points in the region enclosed. 010

i;

2

t

2

':3:

0:

(a)