Modified heuristic for generating tree-structured ... - ACS Publications

Jul 20, 1981 - tional Technical Information Service, Springfield, VA. ... Modified Heuristic for Generating Tree-Structured Spectral Libraries. Sir: T...
1 downloads 0 Views 428KB Size
Anal. Chem. 1981, 53,2354-2356

2354

(4) Lindenbaum. A,: Smvth. M. A. In “Oraanic Scintillants and LiauM Scintillation Counting”; Horrocks. D. L., Pgng, C. T., Eds.; Academic Press: New York, 1971; pp 951-958. (5) Miglio, J. J. Int. J . Appl. Radht. Isot. 1976, 29, 581-584. ( 6 ) Ham, G. J.; Stradling, G. N.; Breadmore, S . E. Anal. Chem. 1977, 49, i, L~VaV -n- -I Li m n I V .

(7) Keough, R. F.; Powers, G. J. Anal. Chem. 1970, 42, 419-421. (8) Gureev, E. S.;Kosynkov, V. N.; Yokovlev, S . N. Radiochimiya 1964, 6, 655. Chem. Abstr. 1965, 63, 1429. (9) Myasoedov, 8 . F.; Chmutova, M. K.; Kochetkova, H. E.; Pribylova, G. A.; Vernadsky, V. I. In Proceedings of the International Solvent Extraction Conference, Lyon, France, 1974, pp 1103-1 108, Document NO. CONF-7409 17-P2. (10) Knab, D. DOE Document Number LA-7507, 1978, obtainable from National Technlcal Information Service, Springfleld, VA. (11) Spector, W. S.,EdS. “Handbook Of Biologlcal Data”, Natlonal Academy of Sciences Research Council; W. B. Saunders Co.: Philadelphia, PA, 1956; pp 70-78.

(12) Altshuler. B.: Pasternack. B. Health Phvs. 1963, 9 , 293-298.

R. A. Guilmette* A. S. Bay Inhalation T~~~~~~~~~ ~~~~~~~hinstitute Lovelace Biomedical and Environmental Research Institute p.0. Box 5890 Albuquerque, New Mexico 87185

RECEIVED for review July 20, 1981. Accepted September 23, 1981. Research conducted under Contract No. DE-AC047 6 ~ ~ 0 ~ between 0 1 3 the U.S. D~~~~~~~~~of E~~~~~and Lovelace Biomedical and Environmental Research Institute.

Modified Heuristic for Generating Tree-Structured Spectral Libraries Sir: The time needed to search sequentially through a collection of objects, such as a library of infrared spectra, increases linearly as the number of objects increaQes.Libraries of reference spectra have become quite large, often containing tens or hundreds of thousands of spectra. As computerized instrumentation is used more routinely, these collections will continue their rapid growth. Techniques which are more efficient than sequential access for handling these large collections of chemical information are needed. One way to make the library searching process more efficient is to place the library members into some form of unique order. This would be analogous to a dictionary. For example, to look for the 10 infrared spectra which are more similar to a given unknown spectrum, one would determine in advance the portion of the dictionary in which they should appear and then look only at that small portion of the library. A digitized spectrum can be envisioned as a list of numbers, with each number being the intensity at a given wavelength. One could put the spectra into a unique order based on the intensity at the first wavelength and then resolve any ties by using the intensity at the second wavelength, etc. This constitutes lexicographical or “alphabetical” order. There are two major problems with this approach. First, only the initial few wavelengths in the list are sigcificantly used to determine the position in the listing. Any slight experimental deviations will alter the position. Second, spectra which are obviously similar would probably not be “shelved” near each other. To overcome these problems, the libary needs to be arranged so that the spectra which are the most similar are “near” each other. This can be accomplished by structuring the data in a “tree” format. A tree is a graph theoretical construct which has a root node to which all other nodes are connected. The nodes may also be called vertices, and the connections are called edges or branches. The terminal nodes are called leaves and the graph consisting of an arbitrary node and all branches and nodes out to the leaves is also a tree, and is called a subtree. In a tree there are no cycles; that is, one cannot start at a node and traverse the tree back to the starting node without retracing some steps. In this study the original library spectra constitute the leaves, and each branch point is formed from the numerical average of all the spectra contained in that subtree. Starting from the root of the tree, each branch point divides the remaining spectra into two groups containing members which are more similar to each other than the members of the other group. The branches of the tree are decision points at which one chooses the path leading to the spectra which are most 0003-2700/81/0353-2354$01.25/0

similar to the unknown spectrum being searched. These trees are usually called binary trees, or bitrees, since there are only two paths leading from each branch point. Binary tree structured information has been used in analytical chemistry in the field of pattern recognition. In the technique known as cluster analysis (I), one tries to discover groups of similar objects which might be classified into the same category. This same approach can be used for library searching of spectral data with considerable savings in interpretation time over the sequential approach. In general, to search a spectrum through an N-spectra library, it takes N comparisons between the unknown spectrum and the library spectra but only log, N comparisons if the library is structured as a balanced bitree. This gain in efficiency becomes quite significant as N increases. For example, for a library containing 262 000 spectra, it would only take 18 comparisons to find a match using an optimally tree structured library. The most probably reason that tree structured libraries are not being used routinely in spectral interpretation is the large number of computations which are needed to construct the tree. To build the tree, one needs to compute the similarity between each pair of spectra in the library. For a 262000 spectra library, this would constitute approximately 34 billion multidimensional distance calculations: a formidable task. Recently, Zupan (2) presented a heuristic for efficiently generating binary similarity trees. For a 262 000 spectra library, the tree could be constructed with roughly 10OOO times less computation effort. This heuristic gains its efficiency by using the tree to grow the tree as new spectra are added. One drawback to this heuristic is that the actual structure of the resulting tree is somewhat dependent on the ordering of the original spectra. In this paper we present an improved heuristic which produces a tree which is identical with that obtained in the formal, though cumbersome, clustering approach. THEORY For the purpose of computerized processing and interpretation of chemical information, an infrared spectrum is represented as a vector

xi = (xc,l, x1,2,

***, x t , d )

(1)

where each element x , is~ the spectral intensity for spectrum i at the j t h wavelength from a total of d wavelengths. One can then consider each spectrum to be a point in a d dimensional Cartesian coordinate axis system where each axis corresponds to one of the d wavelength resolution elements. For a given spectrum X, the position along axis j is the intensity zcJfor that wavelength. 0 1981 Arnerlcan Chemical Society

ANALYTICAL CHEMISTRY, VOL.

In this study, the similarity between the two spectra X I and

XIis measured by using the Cartesian distance, d,,xl between the spectra

dX,XI=

I(%J- X,,d2 -t-

(%,2

- 2,,2)2

+

e**

+ (&,d

-4

211’2

(2) If two spectra are identical the distance between them will be zero, and the distance increases as the spectra differ. Formal Clusterinig. In the formal clustering technique (3))one begins by computing the distance between the N.(N - 1)/2 pairs of vectorEi. One then locates the closest pair of spectra and “ties” therJe together to form a new vertex. The location of the new verkx is taken to be the arithmetic average of the original vectors in the cluster. In the tree being formed, this new vertex constitutes the branch point leading to this pair of closest vertices. The two vertices are removed from further consideration. The distances from this new vertex to the remaining vertices are then calculated. The next closest pair of vertices is located and a new node is formed. This process is continued until all vectors have been tied together resulting in a similarity tree. For N spectra, this approach uses ( N - 1)2distance calculations to construct the full tree. Zupan’s Heuristic (2). In this technique one constructs the tree by sequentially adding each spectrum to the currently existing and updated version of the tree. To add a new spectrum to the tree, one starts at the root node. One calculates three distances, the distance from the new spectrum to each of the average spectra to which the root node is connected, d l and d,, and the distance between these two average spectra, d3 If either dl or d2 is the smallest of the three, one proceeds down the corresponding branch. If d3 is the smallest, then the two average spectra are more similar to each other than to the new spectrum. In this case a new branch is sprouted containing only the new spectrum. This process is repeated until a leaf of the tree is reached. As the new spectrum traverses the tree, the values of the average spectxa at each node which include the new spectrum are updated to include its contribution. The efficiency of this1 heuristic arises from the fact that only the distances which are actually needed are calculated. It is not possible to calculate the efficiency of this approach for an arbitrary tree, since this depends on the actual branching in that tree. However, for a best case in which the tree remains exactly balanced, it cam be shown that the total number of distance calculations needed to construct a tree for N spectra is 3Ci2’ = 6 [ ( i - 1).2’ 11 (3) where the summation is from i = 1 to i = n - 1 and n = log, N . This number of distances needed is much smaller than that for formal clustering, especially as N becomes large. For example, for N = 64, formal clustering requires 3969 distance calculations while this heuristic uses only 678. Although Zupan’s heuristic is exceedingly efficient, the actual tree achieved is seen to depend on the initial ordering of the spectra. This is because as a spectrum is added to the tree, the position that it finally adopts depends on the spectra which have already been incorporated into the tree. This inconsistency must be overcome for spectral library searching since one needs to be able to find the spectra which are most similar to the unknown spectrum. In the next section we present an improved heuristic which transforms the Zupan derived tree into the formal clustering tree. The Improved Heuristic. The fundamental idea for this improvement is quite simple: If the presence of an arbitrary spectrum is deleted frorn the tree and then that spectrum is searched back through the tree, one of two possibilities will be realized. Either the spectrum will go back to its original

+

53, NO. 14, DECEMBER 1981

2355

place in the tree or it will adopt a new, more appropriate position. In this rechecking process, the entire library of spectra is utilized to position the spectrum being considered, rather than only those spectra which preceded this spectrum in the original list of spectra. Since all of the available information is used to find the final position of each spectrum in the tree, each spectrum is more likely to be connected in the tree as it would with the formal algorithm than using Zupan’s heuristic. This improvement should only approximately double the number of distance calculations needed to construct the tree and will still be much more efficient than the formal algorithm. In this paper we will demonstrate the effectiveness of this improvement using the test set of data the Zupan used in his original publication (2).

EXPERIMENTAL SECTION All programming was performed in FORTRAN on the Boston University IBM-370 Virtual Processing System. A formal clustering routine was written to facilitate the verification of proper tree generation for the different versions of the heuristic programs. Another program was written to perform the Zupan heuristic. The approach used was strictly that presented in the original publication ( 2 ) )which should be consulted for details. This program was modified to be used as a subroutine in the improved heuristic program, since it was needed in both the original tree building and the checking step. The data storage for the tree was altered from that used by Zupan to facilitate moving up and down the tree. For each node of the tree, there must be two entries to store the numbers of the two nodes to which the current node is connected and one entry to hold the number of the node through which the current node is connected to the root node. The spectra and vertex average spectra in the tree are stored in a type REAL array VEC(IP,NMAX) where IP is the number of wavelength channels. For N spectra, the total number of nodes and spectra will be NMAX = 2.N - 1. A type INTEGER array IADD(5,NMAX) holds the information describing the connection of each node in the tree, the number of spectra contained in this branch, and the spectrum identification number if it is a leaf of the tree. The improved heuristic program has four sections. In the first part the arrays are initialized, the number of spectra, number of wavelength channels, etc. are defined, and the spectra are read from disk storage. In the second section the tree is built by using the Zupan tree heuristic. In the third section the position of each spectrum is checked. This is done in two steps. In the first step, the presence of the spectrum is deleted from the tree, moving from the leaf to the root node. The average spectra at each node affected are corrected. In the second part this spectrum is repositioned on the tree moving from the root node to the new leaf where this spectrum ends up. The average spectra are updated to include the new spectrum as the tree is traversed. In the fourth section, the structure of the final tree is displayed. As the position of each spectrum is checked, it is possible to detect any changes in the tree. This can be done by saving the average spectrum at the node to which the spectrum being checked was connected. If there is no change in the tree, the repositioned spectrum will still be connected to a node with the same average spectrum as it had before it was repositioned. A listing of the programs used in this study can be provided by the author upon request. I

RESULTS AND DISCUSSION The test data set used by Zupan (2) consists of 10 twodimensional points (Figure 1). The tree produced by the formal clustering algorithm using any order of vectors is shown in Figure 2. The tree produced by the Zupan heuristic using the ordering of points 6, 10, 8, 9, 7, 1, 3, 2, 4, 5, is shown in Figure 3. There are significant differences between this tree and the formal tree. Other orderings of initial vectors give somewhat different trees. By use of the improved heuristic, each vector is checked to ascertain whether it should be connected in a more ap-

Anal. Chem. 1981,53,2356-2358

2356

A

6

Flgure 4. Partlal trees resultlng from several stages of the Improved heuristic: (A) partial tree after repositioning point 6, (B) partial tree after positioning point 10, (C) partial tree after positioning point 8.

-

1

0

1

2

3

4

5

h

Flgure 1. The two-dimensional, ten-point, test data set, from ref 2.

E

1 2

3 4

5 6

7 8

9 10

Figure 2. Tree resulting from formal Clustering algorlthm, and the

improved heuristic.

spectrum is checked once, in the order that they happen to be sequentially stored in the tree array. With more involved data sets, changes in tree connections for one spectrum might effect the most appropriate connection for other spectra. In this situation, a slightly more elaborate checking scheme might be used in which a detected change in the tree would trigger the checking of other related spectra in that subtree. The efficiency of these approaches can be compared using the number of distances as an efficiency metric. For this ten-point test data set, constructing the tree using the formal clustering algorithm required 165 distance calculations. The Zupan heuristic used only 39 distances to form the tree in Figure 3. The improved heuristic used an additional 159 distances to modify the Zupan tree into the formal tree. The efficiency of the approach becomes improved as N increases. For a 20 data point test set, the formal algorithm required 1110 distances while the Zupan heuristic used only 114 distances and the improved heuristic used an additional 495 distances. We are presently employing this approach to construct trees of vapor-phase infrared spectra (4) for library searching and related studies.

LITERATURE CITED

10

Figure 3. Tree resulting from the Zupan heurlstlc.

propriate way. When point 6 is repositioned it assumes its correct place connected to the pair 4 , 5 (Figure 4A). Then point 5 is checked with no change, followed by point 10 which is changed to be correctly paired with point 9 (Figure 4B). Point 8 is then repositioned to be paired with point 7 (Figure 4C). Then points 9, 7, and 4 are checked with no change. Finally, points 1, 3, and 2 are checked with only a trivial change since points 1and 3 are equidistant from point 2. The tree is now identical with the formal clustering tree (Figure 2). In the present version of the program, the order in which spectra are selected for checking is not specified. Each

(1) Penca, M.; Zupan, J.; Hadzi, D. Anal. Chim. Acta 1977, 95, 3. (2) Zupan, J . Anal. Chin?.Acta 1980, 122, 337. (3) Warren, F. V.; Delaney, M. F., submitted for publication In Appl. Spec-

.

trosc (4) Delaney, M. F.; Warren, F . V . Anal. Chem. 1881, 53, 1460.

Michael F. Delaney Department of Chemistry Boston University Boston, Massachusetts 02215 RECEIVED for review July 10,1981. Accepted August 28,1981. Acknowledgment is made to the donors of the Petroleum Research Fund, administered by the American Chemical Society, for the support of this research.

Liquid Chromatography of Coal Oil Fractions Sir: During the last years, interest in coal liquefaction products has rapidly increased. These products contain many more heterocyclic compounds such as phenols and pyridines than petroleum fractions in the same boiling range. Therefore it was necessary to develop an analytical procedure for characterizing these compounds as well as the aromatics and naphthenes. We assume that high-performance liquid chromatography (HPLC) is especially suitable for high boiling fractions. Although thousands of HPLC applications have been published, only a few papers deal with liquid coal products (1-10). The literature cited results partly from a computer search. It may not be complete, but it reflects the

present level of research in this field. The primary aim of our work was not to separate and determine compounds but to get well-resolved chromatograms as fingerprints of each fraction. For this reason it was necessary to test many stationary and mobile phases in order to determine the optimum conditions for each fraction.

EXPERIMENTAL SECTION Apparatus. The liquid chromatograph used in the experiments was a modular system manufactured by Laboratory Data Control, Division of Milton Roy (LDC). The columns were of

0003-2700/81/0353-2356$01.25/00 1981 American Chemical Society