Feedback search of hierarchical trees - Analytical Chemistry (ACS

vapor mixtures by Fourier-transform infrared spectrophotometry. Hongkui. Xiao , Steven P. Levine , and James B. D'Arcy. Analytical Chemistry 1989 ...
0 downloads 0 Views 819KB Size
Anal. Chem. 1986, 58, 3219-3225

3219

Feedback Search of Hierarchical Trees Jure Zupan*

Institute of Chemistry “Boris Kidric“, Ljublijana, Yugoslavia Morton E. Munk*

Department of Chemistry, Arizona State University, Tempe, Arizona 85287

The idea of the feedback search In a hierarchically organized data base is Introduced and discussed. This procedure was implemented on a hlerarchkal tree conslstlng of 219 f u i h r v e infrared spectra of a variety of organic compounds by selecting vertices reJectedin the first pass through the tree as entry points for new searches. The improvement in performance of the feedback search in inferring the structural features of a compound of unknown structure and in retrieving the most similar spectrum in the data base is described.

FEEDBACK SEARCH Background. In a search system in which the reference infrared spectra are ordered in a binary hierarchical tree, a given query (the spectrum of a compound of unknown structure) begins the search procedure at the root node and traverses exactly the same path through the tree each and every time it is entered. The actual path through the tree is determined by calculating three distances ( D )at each vertex through which the query (X)passes

di = D ( X , V J d2

As computer-based methods for the storage and retrieval of infrared spectra data became a reality, the attention of spectroscopists turned naturally toward automated systems for the identification of organic compounds using these spectral data. It was recognized that to be of value to the practicing chemist, such systems had to be able to do more than identify a compound whose infrared spectrum is contained in the reference library; they had to provide structural insight into compounds not part of the reference library. Three major approaches, the boundaries between which are not always distinct, have received wide attention in achieving the latter goal, the interpretive library search (I, 2), pattern recognition (3), and artificial intelligence (4-6). Recently, we reported the results of a study on the application of a hierarchically organized infrared data base to the solution of structural problems in organic chemistry (7). It was demonstrated that a hierarchical tree generated from infrared spectra as objects-more precisely, from the truncated sets of their Fourier coefficients obtained by the fast Fourier transformation of the full curve spectra-serves three roles: (a) 100% retrievability of spectra in the data base, (b) the reliable prediction of presence or absence of particular structural features of compounds whose spectra are not contained in the data base, and (c) identification of the spectrum (and therefore compound) in the data base most similar to one not contained in the data base. The latter application is intended to provide insight into the structural class to which a compound of unknown structure belongs, e.g., steroid, nucleoside. Our purpose in undertaking the present study was to extend the interpretive library searching capabilities of the method in two ways. First, we wished to enlarge the breadth of structural features inferred from the infrared spectrum of a compound not in the reference library. A set of structural inferences characterized by both high reliability and breadth could play a central interpretive role in an automated structure elucidation system as CASE (8). Second, we wished to enhance the value of such hierarchically organized infrared data bases (9-13) in providing the chemist additional insight into the overall structural nature of compounds that are not part of the library. In this paper we describe an algorithm for a “feedback” search of the hierarchical tree, discuss the added capabilities it offers, and evaluate its performance with a test set of spectra. 0003-2700/86/0358-32 19$01.50/0

= D(X,Vz) = D(Vi,VJ

(1) (2)

(3) where VI and V2 are the left and right descendants, respectively, of the vertex V at which the calculation is made. The query moves toward the descendant vertex of minimal distance unless d3 is the distance of the minimum value, in which case passage throught the tree ceases at vertex V. Each vertex V in the tree is the “root node” of a cluster of “objects” (spectra). If the compounds corresponding to the spectra in one of the clusters have a particular structural feature in common (e.g., an ester function), the assignment of that structural feature to an unknown compound can be made as it passes the root node of that cluster, provided that prior evaluation using test spectra has validated such an assignment (7). Since the query follows a single path through the tree, structural inferences are limited to assigned nodes along that path only. In this application of the hierarchical tree, the scope of the search could possibly be broadened by selectively searching other pathways. The root node of a cluster of spectra can be regarded as the “average spectrum” of the cluster. Thus, the distances d,, d2, and d, calculated at each vertex other than a terminal node provide a comparative similarity measure between a query and each of the two average spectra. Experience has shown that, in most cases, movement of a query not part of the library is in the direction of the cluster containing the reference spectrum most similar to the query. (Operationally, the most “most similar” reference spectrum (V,) is that which is “nearest” to the query (X)in the search space, Le., min (dx,vT).) However, because average spectra are involved in the calculations, at times, one or more spectra exist in the tree which are even more similar to the query than the spectrum to which it is linked in the search process. Expanding the search to selectively include other pathways could raise the probability to retrieving the most similar spectrum (and therefore compound) in the reference library. Further, retrieval of a group of similar spectra (compounds), rather than a single spectrum, may broaden the scope of information provided to the chemist by the search. If several compounds of the same structural class are retrieved, the reliability of assigning that class to the unknown is enhanced. Implementation. The feedback search allows the user to initiate a search at any or all of the vertices rejected in the first pass of the query through the tree. The hierarchical 0 1986 American Chemical Society

3220

ANALYTICAL CHEMISTRY, VOL. 58, NO. 14, DECEMBER 1986 ==nc=====r=o==a=i=i========a================

A = 50

F I R S T PA55

D

=

8000.

IrPPlal=P=I=P=lele=P======P=====~~==========

IDr Vertex!

Vertex: Link i n :

7511118 e n t e r s t r e e a t v e r t e x r 1 13 -011 Uininr 3870. 59 )CO Dmin: 2550. 272 t o I D : 7404434 Ijrnin: 2686.

* *

H

a==i================III=pI=rIIp====a==========~=======

Main path

Rejected vertices

Distance measure difference io0 =I Idl-dZI minId -_--------_---__________________________---6 2 331. Yd2' 7.8 13 28 48 59 124 125 55 128 272

.A

A

16 295 60 33

931. 3261. 3830. 2592.

19.7 45.7 60.9 47.6

43 277 252 40 275

924. 809. 374, 1469. 880.

26.6 23.1 11.1 34.0 24.7

G" Ck

uam

~

~

p

3M

o

j

+ 3M

0

\ )?

mm

~ ~ ~ ~ r s ~ ~ r t = i i r r = ~ ~ ~ i = a a a = - = - - = = ~ = = = = = = = = = ~ ~ -

FEED-BACK

SEARCHES

a.

= 100.0

CF;

I P 3 P P 1 1 1 3 1 I z P P I B I P P I p I r " P 4 1 e p r ' D = = ~ ~

ID; Vertex: Vertex : Link in!

7511118 e n t e r s t r e e a t v e r t e x r 2 19 )CO Dmin: 2875. 8 -110 ) C O Dminr 2222. 271 t o I D : 7618573 Drnini 2020.

ID: Vertexi Vertex: Link i n ;

7511118 e n t e r s t r e e a t v e r t e x ~ 16 199 )CO Drninr 3650. 205 -Oli Drnin: 3564. 418 t o I D : 6448283 Drninr 4460.

ID! Link in:

7511118 e n t e r s t r e e a t v e r t e x : 295 295 t o I D : 6554497 Dmin: 7131.

ID:

7511118 e n t e r s t r e e a t v e r t e x ! 33 53 )CO Drnin: 5280. 260 t o I D : 7931362 Dmin: 4051.

*

7531362

/

3H

Ci.

............................................

*

*

............................................ ............................................

Vertex: Link in:

G

lo/:Hj

*

............................................

ID: Stop i n :

7511118 e n t e r s t r e e a t v e r t e : : ~ 60 60 r e p r e s e n t i n g 2 s p e c t r a D ( 1 , 2 , 3 ) 6958. 6051. 2091.

.ICY69-

W C H

/

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ - _ _ _ _ _ ^ _ _ _ _ _ _ _ _ _ _ _ _ _ _ - - - -

3H

7511118 e n t e r s t r e e a t v e r t e x : 43 7617195 Clminr 3704. L i n k i n 8 268 t o I D ,

:P.

IDr

@ + y -

________________-_._------------------------ID: 7511118 e n t e r s t r e e a t v e r t e x r 277 L i n k i n : 277 t o I D : 7509694 Umini 3507.

/ 3.

3H

_____________________--_--------------------

ID: Link i n :

7511118 e n t e r s t r e e a t v e r t e x : 252 252 t o I D : 7719418 Dmin: 3359.

78296iC

___-__-_________----__i_________________----

ID: Link inr

7511118 e n t e r s t r e e a t v e r t e x r 40 254 t o I D : 7829640 Drnin: 3469.

IDI

7511118 e n t e r s t r e e a t v e r t e x r 275 275 t o I D r 7617838 Drnin: 3566.

wp.

HO =

OH

___-_____________--_-----------------------Link in:

rIIlIPPPPrP*3=PlrP3======I==========i======a

76,7838

OH

3H

Figure 1. Computer output for a complete feedback search for query spectrum I D number 751 1118 (X). Structural inferences are reported whenever a structurally assigned vertex (V) is encountered with a calculated distance between X and V that is less than the user-set threshold (DJ. The structure of the reference entry retrieved in the first-pass and each subpath searched is shown to the right of the output for convenience. The topmost structure is that of the query. The output provides the I D number of the retrieved reference and the calculated distance between it and the query.

search has been reprogrammed to provide as output the information required to make an informed selection of rejected vertices a t which to initiate additional searches. This information follows the output of the results of the first-pass search (Figure 1). It includes the vertices rejected at each level in

the first-pass search, and, a t each vertex, the absolute difference in distance, Id, - d21,between the query and each of the two descendant nodes. A large difference suggests an "easy" decision between two descendant nodes; a small difference indicates that the rejected vertex might be worth

ANALYTICAL CHEMISTRY, VOL. 58, NO. 14, DECEMBER 1986

considering as a “root node” for the feedback search. In automatic mode, following the first p w , the program executes all feedback searches, beginning with the first rejected vertex (vertex 2 in the example of Figure 1) and continuing until the last one along the original pathway (vertex 40). Note that vertices 295,252, and 275 are spectra, not the root nodes of clusters (Figure 2). If the average path in the binary hierarchical tree of Nbt spectra has a length L (meaning that L vertices will be visited, on the average, in the first-pass search) and the binary tree is well balanced, then the complete feedback procedure will require

(L

+ l)L - (L2 + L) 2

2

(4)

comparisons. In spite of the fact that the dependence of the number of comparisons on the average path length in the tree is quadratic, the feedback search is still far more economical than any sequential search because L is logarithmically proportional to the number of spectra in the tree (Nbt)

(5) Prior to the execution of the feedback search (manual or L

log2 N t o t

automatic), three parameters are set by the user: the width of the safety wall, the distance measure threshold Dt above which a structural inference is not made at a structurally assigned vertex, and the distance difference threshold A,,

expressed as a percentage of the smallest distance d , or d2, above which a feedback search along a rejected vertex will not be executed in the automatic mode. The feedback search described in Figure 1 has At set to 100, i.e., all rejected vertices are considered. For most applications, At is set to 40%. In the search shown in Figure 1, this would result in bypassing searches a t vertices 295, 33, and 60. The width of the safety wall d3’ around cluster representations can be mathematically expressed as a function of a constant A and the distance d S ,the latter being the distance between two descendents V 1and V2 of a given cluster representation V The requirements imposed on the function f a r e as follows: lim d,’ = d 3 lim d,’ = m (8) A-0

A-==

Additionally, f has to take into the account the size of clusters V1and V2 and be simple enough for rapid excution as it is used in the innermost loop of the clustering (or search) algorithm (11, 13). In our implementation, the following expression is used for the calculation of d3/: where N , , N 2 ,and N,, are numbers of spectra in clusters Vl and V2 and in the entire reference library, respectively. It is evident that eq 9 meets both conditions in eq 8. If the user chooses a very large value for A , d3’ will never be the smallest of the three distances d l , d2, and d3/, and hence the search will always continue to a terminal node. The three-distance-clustering method (3-DCM) becomes, in fact, the two-distance method (11). As the value of A is decreased, the probability to terminating a search at a higher level in the tree is increased. As a consequence, the query may be linked to fewer spectra in the data base and the number of structural inferences made (assigned vertices passed) may decrease. However, as shown below, in drawing structural inferences, the accuracy of the predictions increases as A decreases.

3221

With the option to vary the value of the constant A , the range of program applications available to the user is extended. For direct transfer of structural inferences to CASE, high accuracy is called for; thus, a small value of A will be selected. As a stand-alone infrared spectrum interpreter, the chemist may choose to sacrifice some accuracy in predictions for a greater breadth of ideas of the possible structural features present. If so, higher values of A will be used. The parameter D,,the distance measure threshold, also allows the user to adjust the balance between accuracy and breadth. In applications to identify a group of spectra (and therefore compounds) with characteristics similar to that of a query, the number contained in the group retrieved increases as the values of A is increased. Clearly, the group of structures derived from a feedback search is more informative than the single structure of a first-pass search. Members of the group of structures can be ranked in similarity on the basis of their distances to the query. The feedback search output for a sample query (ID no. 7511118) in the test set of 242 spectra (7) is shown in Figure 1. The entered search parameters A and D, are output first, followed by a description of the outcome of the first-pass search: two structural inferences (hydroxyl and carbonyl assigned at vertices 13 and 59, respectively) and a link to reference spectrum ID no. 7404434 at vertex 272 (the distance between query and reference is 2686). The pathway of the first-pass through the tree is shown in Figure 2 in bold type and summarized vertex-by-vertex in Figure 1. In manual mode, the program interrupts execution after this vertexby-vertex summary. After inspection, the rejected vertices to be searched can be selected by the user. In the automatic mode, all vertices consistent with the parameters entered are searched without interruption. The first vertex used as a “new” root is number 2, at which the smallest difference A = 331 was obtained. Interestingly, along this “subpath” the same structural inferences (hydroxy and carbonyl) were made (Figure 1). The structure of the compound to which the query was linked a t the end of this subpath was an even “better” reference (smaller distance) than the compound retrieved during the first pass. The three spectra, the query X and the two retrieved, A and B, together with their corresponding structures are shown in Figure 3. Each of the structures retrieved in the feedback search for query ID number 7511118 are shown in Figure 1 adjacent to the vertex from which they arise. The structure at the top is that of the query.

RESULTS AND DISCUSSION In the evaluation of the effectiveness of the feedback search an earlier test set of 242 full-curve infrared spectra (7) was used. Each spectrum is stored as its first 100 complex Fourier coefficients and linked to its structure which is represented as its Wiswesser line-formula chemical notation (WLN). Similar structures and common structural features (substructures) are readily discerned by means of WLN representations. We focused on two applications-predicting structural features from the infrared spectrum of a query and the identification of similar structures in the reference libraryand compared the results of feedback and first-pass searches. By use of the earlier generated hierarchical tree of 219 infrared spectra (7), an average of 14 vertices are considered during the first-pass search of a query. Thus, 14 decisions are made, 14 vertices are rejected, and a maximum of 14 new roots for the feedback search are generated. The automatic mode was called for searching the test set of 242 spectra. Each complete feedback search, consisting of a first-pass search and, on the average, 13 subpath searches, requires about 90 decisions (eq 4), or a totalof about 3400 pathways examined and

3222

ANALYTICAL CHEMISTRY, VOL. 58, NO. 14, DECEMBER 1986

P

Flgure 2. A section of the hierarchical tree of 219 infrared spectra showing the features of the feedback search. The bold line and filled circles (chosen vertices) trace the first-pass search. Rejected vertlces are shown as open circles. Two subpath searches-from vertices 2 and 16-are traced by the dotted lines. The best possible reference match (A) to the query was found by the feedback search. The first-pass search retrieved the second best match B.

X

on

A

mOH

B

l

4000

I

l

~

'

'

'

2000

l

'

'

~

~ 1oM)

"

"

'

cm-1

ai

m

O -H M(

Flgure 3. The infrared spectra and structures corresponding to the query X (ID number 7511118), the reference match retrieved by the first-pass search A (ID number 7404434). and the best reference match B (IDnumber 7618573) retrieved in the feedback search, as described in Figures 1 and 2.

about 22 000 decisions made for the 242 queries. To study the effects of changing the safety wall width (constant A in

eq 9) and the distance threshold (DJon the quality of the structural inferences made, seven values of the former and

ANALYTICAL CHEMISTRY, VOL. 58, NO. 14, DECEMBER 1986

3223

Table I. Comparison of the First-Pass and Feedback Searches in Drawing Structural Inferences A=O

Dt 2000

A=2

A = l

A=3

A=6

A = 15

A = 50

0'

Od

Ob

3 0 100

3 0 100

3 0 100

3 0 100

5 0 100

5 0 100

5 0 100

5 0 100

5 0 100

5 0 100

5 0 100

5 0 100

0'

0 0

3000

6 0 100

6 0 00

22 0 100

24 0 100

22 0 100

26 0 100

24 0 100

28 0 00

24 0 100

28 0 100

25 0 100

31 0 100

25 0 00

31 0 100

4000

8 0 100

22 0 00

53 0 100

55 0 100

69 0 100

77 0 100

79 0 100

91 0 00

82 0 100

97 0 100

88 1 99

109 1 99

91 1 99

119 3 98

5000

29 0 100

29 0 00

75 1 99

77 1 97

102 1 99

111 5 96

124 3 98

34 10 93

153 3 98

166 11 94

164 5 97

186 14 93

72 6 97

217 24 90

6000

29 0 100

29 0 00

81 1 99

85 2 98

121 1 99

132 5 96

148 3 98

60 10 94

186 3 98

207 16 93

232 9 96

287 37 89

244 11 96

324 67 83

7000

29 0 100

29 0 00

81 1 99

85 2 98

122 2 98

133 7 95

155 5 97

67 13 93

198 5 97

220 20 92

261 13 95

325 47 87

283 16 95

373 98 79

29 0 100

29 0 100

81 1 99

85 2 98

122 2 98

133 7 95

157 6 96

69 14 92

204 7 97

228 23 91

278 15 95

344 50 87

307 18 94

411 122 77

03

"The left uppermost number in the six-number cluster for each set of A and D,values is the total number of correct inferences made in the first-pass search using the 242 spectra of the test set. bThe total number of incorrect inferences made in the first-pass search. cPercent accuracy. dThe column to the right in each six-number cluster records the corresponding results using the feedback search. nine of the latter were examined. Thus, 63 complete feedback searches were needed for each of the 242 queries. The study of prediction ability utilized the same 24 structural features assigned in the tree of 219 reference entries as described earlier (Table I, ref 7). In preparation for an analysis of the results, each of the 242 queries was assigned a set of tags identifying each of the 24 structural inferences pertinent to that particular compound. The 63 separate feedback searches were then executed for each of the 242 queries. The results, which were automatically compiled, are summarized in Table I. As expected, given any set of parameters ( A and D,), the number of inferences made in the feedback search generally exceeds that of the first-pass search. Thus, visiting the addition vertices in the feedback search is advantageous, and, except at large safety wall widths and high distance thresholds, there is little or no loss in prediction accuracy. With A less than or equal to 6, and D, less than or equal to 4000, accuracy is 100%. At larger values of A or D,, the decrease in accuracy in the feedback search relative to the first-pass search need not be great, e.g., a t A = 50 and D, = 4000, the accuracies are 99 and 98%, respectively, or 98 and 93%, respectively, at A = 6 and D, = 6000. It can be seen from the data in Table I that in both the first-pass and the feedback searches, a decrease in either safety wall width or distance threshold leads to a decrease in the number of inferences made but an increase in prediction accuracy. As safety wall width decreases, the frequency of having d; (the distance between the V1 and V,) as the minimum distance (of dl, d,, and d;) a t the visited vertex increases. Thus, the likelihood of terminating a query search before encountering a structurally assigned vertex increases, thereby decreasing the frequency of invalid inferences caused by "forcing" the query to one or the other vertex. It is also true that a valid inference may be missed, but that is the price of high accuracy. The distance threshold parameter acts similarly. Inferences made a t assigned vertices with large values of D,are less likely to be valid; therefore, excluding such assignments increases accuracy. (Again, it is possible

that valid inferences will also be excluded.) In applications where a lower level of accuracy may be acceptable, the data in Table I suggest optimal results with values of A ranging from 3 to 6 and D, from 5000 to 6000, for either the first-pass or the feedback search. Optimal values of A and D, can be expected to vary from tree to tree and must be determined empirically on the basis of results with a test set of spectra. However, if the same representation of spectra as in this work (100 complex Fourier coefficients) is used in the generation of a hierarchical tree from another data set, it is likely that the optimal values of A and Dtwill closely correspond to those observed here for either the first-pass or the feedback search. The consequence of varying the distance difference threshold At (the difference between dl and d2 expressed as percentage; see Figure l),which determines if a rejected vertex (first pass) is to be bypassed, was also examined. In one study, a comparison was made at A, = 40 and A, = 100. Except at large values of both A and D,, where slight improvement in accuracy is observed by terminating searches a t rejected vertices where A is large, the results are nearly identical. In the interest of search efficiency, a value of A, = 40 is recommended. Next, we examined the extent to which the feedback search improves the retrieval of the best possible (i.e., most similar) reference entry to a query not contained in the data base. (It should be noted that the reference spectrum identical with a query is always retrieved in the first pass (7, %13).) Clearly, in the feedback search, the number of reference entries returned can be expected to be greater than one, to which the first pass search is restricted. The sequential search of a data base will, of course, always produce the best possible reference spectrum. However, for large data bases, the sequential search is no longer feasible in terms of required time, especially if full-curve spectra are to be examined. In such cases, which are the rule rather than the exception in today's laboratories, methods such as hierarchical ordering of data are essential. However, adequate testing to determine the effectiveness of the method is a

3224

ANALYTICAL CHEMISTRY, VOL. 58, NO. 14, DECEMBER 1986

,o--- -

,’ 0

I

I

I

0

I

20

0 Cluster representations 0 objeas

up

IO

2000

3OOO

4000

5000

6000

7000 and more

Figure 5. A two-dimensional cluster representation illustrating how the best possible match R to a query spectrum X can be missed if R is an outlier in a cluster. Since d , < d , , the path toward the vertex (cluster) not containing the best match is followed, even though, in fact, dX,R

Flgure 4. The results of the retrieval of the best reference match to

each of the 242 infrared spectra in the test set using both the first pass search and the feedback search. Performance is reported by distance intervals of 1000. From bottom to top, the numbers in the three sections of each column indicate the best matches retrieved in the first-pass search and in the feedback search and the actual number of best matches in the distance interval, respectively. The sums of all lowest, middle, and upper figures are 111, 21 1, and 242, respectively.

prerequisite of widespread application. A rank-ordered list of the 10 best matches of each query was obtained by matching each of 242 spectra (actually the first 100 complex Fourier coefficients) in the test set to all 219 reference spectra in the tree (sequential search). With this information, it was possible to evaluate the performance of the feedback search. Figure 4 summarizes the results of retrieving the most similar reference entry for each of the 242 queries. Performance is reported by distance intervals of 1000. The three numbers within each distance interval column are, from bottom to top, the number of most similar spectra retrieved in the first pass search, in the feedback search, and in the sequential search, respectively. These data were obtained for values of A = 50 and D, = 10OO0, which maximize the number of reference entries retrieved for each query. A comparison of the complete results for the two different search stratagies reveal a sizable advantage for the feedback search. Whereas the first-pass search retrieved only 111 (46%) of the 242 most similar spectra, the feedback search raised that number to 211 (87%). The failure of the feedback search to match the performance of the sequential search derives most likely from one of two sources: the most similar reference is “not-so-similar” and of an uncommon type in the data base, or the presence of a relatively large number of similar reference entries. The first source of failure can be easily rationalized by recalling that the retrieval pathway through a hierarchical tree is determined at each vertex by the distance between the query and each of two descendant nodes. If the “most similar” reference entry is of an uncommon type in the data base, it may occupy a position in the cluster close to its spatial boundary (an outlier). This arrangement can be illustrated in two-dimensional space as shown in Figure 5 . Here, the distance d2 between the query X and the cluster V2 containing the best match may not reflect the distance Clx~between the query X, and the best match R. Thus, the probability that the search will continue in the direction of the “right” cluster is diminished. In the case of a data base with a large number of similar compounds, which may be concentrated in a large cluster, the path through the tree, especially within this cluster, may require increasingly close calls at vertices. Because of the

< dl.

“similarity” of these subclusters, the 3-DCM may not provide the correct “decision” at each of these vertices. It is important to note that a failure under such circumstances may not be serious since the reference retrieved will still very likely bear a strong resemblance to the query (see structures of the prostaglandins retrieved in Figure 1). For the chemist looking for “ideas” about the nature of a compound of unknown structure, the feedback search offers an advantage over the firstpass search. If all or most of the reference entries retrieved are of a similar structure type, e.g., prostaglandins, it is likely that the unknown is as well. If the compounds in the group retrieved are of quite different structure types, that may suggest the absence of compounds of similar structure type in the data base. However, the retrieved compounds may still reveal substructural features present in the unknown.

CONCLUSIONS The present study demonstrates that the feedback search extends the flexibility and the usefulness of the hierarchically organized data base. Relative to a first pass search-a mode of execution the new program retains-the feedback search substantially increases the probability (to about 90%) of retrieving the most similar reference compound to a query not present in the data base. At the same time, retrieving a group of similar compounds in the feedback search increases the information content of the output for the chemist seeking insight into the nature of compound of unknown structure. Large values of both search parameters A and D,lead to maximum group size. The accuracy of the structural inferences made at assigned vertices in the hierarchical tree of 219 infrared spectra has already been shown to be high in the first-pass search (7). Although the feedback search cannot improve on accuracy, it does increase the number of inferences made, in general, with little or no loss in accuracy. The extent of the increase depends on the values selected by the user for the search parameters A and D,, which in turn should reflect the nature of the application. A broad list of inferred structural features calls for a wide safety wall and a high distance threshold. This can lead to a 25% increase in the number of inferences reported, but the price is a decrease in reliability. The lowest accuracy observed was 77% with a very large saftey wall ( A = 50) and no distance threshold. For application to automated structure elucidation programs as CASE (8), a narrow safety wall and low distance threshold are advisable. Table I suggests A = 6 and D, C 5000 for the hierarchical tree in this study. Adjustment of these same parameters similarly influences the size of the group of “similar” compounds retrieved in the feedback search.

Anal. Chem. 1986, 58,3225-3230

As the compilation of spectral libraries proliferates in chemical laboratories and as they continue to grow in size, more attention is focused on the execution time required for searches. The logarithmic relationship between the size of a hierarchically ordered spectral data base and the average path length through the tree has already been noted. Thus, the logarithmic relationship between tree size and execution time that follows, offers a significant advantage in the processing of queries involving the data bases of 10K-100K entries and more, that can be expected to be commonplace in the near future (13). The adaptability of the feedback search to computers capable of parallel processing is likewise advantageous. As each descendent node decision is made, the first-pass and subpath searches could be executed concurrently. LITERATURE CITED Clerc, J. T.; Szekeiy, G. Trends Anal. Chem. 1983, 2 , 50-53. Zupan, J. Fresenlus‘ 2.Anal. Chem. 1982, 313, 466-472. Varmuza, K. fartern Recognition in Chemistry; Springer Verlag: Berlin, 1982; pp 157-160. Woodruff, H. B.; Smith, G. M. Anal. Chem. 1980, 5 2 , 2321-2327. Hippe, 2.Anai. Chlm. Acta 1983, 150, 11-21.

3225

(6) aibov, L. A.; Elyashberg, M. E.; Koldashov, V. N.; Pletnjov, I . V. Anal. Chim. Acta 1983, 748, 159-170. (7) Zupan, J.; Munk, M. E. Anal. Chem. 1985, 57, 1609-1616. (8) Munk, M. E.; Shelley, C. A,; Woodruff, H. B.; Trulson, M. 0. Fresenius’ 2.Anal. Chem. 1982, 313, 473-479. (9) Zupan, J. Anal. Chim. Acta 1980, 122, 337-345. (10) Delaney, M. F. Anal. Chem. 1981, 53, 2354-2356. (1 1) Zupan, J. Clustering of Large Data Sets; Research Studies Press (Wiley): Chichester, 1982. (12) Zupan, J. Anai. Chim. Acta 1982, 739, 143-153. (13) Zupan, J.; Novic, M. Computer Supported Spectroscopic Data Bases; Zupan, J.; Ed.; Ellis Horwood International Publishing Co.: Chichester, in press.

RECEIVED for review February 3,1986. Resubmitted July 1, 1986. Accepted July 1, 1986. Presented in part at the VI1 International Conference on Computers in Chemical Research and Education (ICCCRE), Garmisch-Partenkirchen, June, 1985. The authors acknowledge with gratitude the financial support of the US-Yugoslav Board for Scientific Research (Project Number 479), the Research Community of Slovenia, the National Institute of General Medical Sciences (USA, NIH Grant GM21703), and The Upjohn Company.

Characterization of a Diet Reference Material for 17 Elements Nancy J. Miller-Ihli* and Wayne R. Wolf

Nutrient Composition Laboratory, U.S. Department of Agriculture, Beltsuille, Maryland 20705

A freeze-drled diet reference material was prepared from commonly consumed everyday foods. Concentrations of major (Mg, Ca, Na, K, and P), mlnor (Mn,Zn, Fe, Cu, and Ai), and trace elements (Cr, NI, Co, Mo, As, Se, and Cd) In this dlet material were determined by the authors and 10 collaborating laboratories using a total of 10 different analytlcal technlques. Good agreement between concentratlon values determined by the dmerent laboratories enabled the authors to compute “recommended values” and uncertalntles for these 17 elements. Values for proximates as well as several other nutrients were also reported by collaborators. Thls diet material Is available to the sclentlfk community as Reference Material 8431 through the National Bureau of Standard’s Office of Standard Reference Materials.

The usefulness of reference materials (RM’s) for accuracy transfer, method validation, technology transfer, and quality control monitoring has been well established (1-3). Several agencies routinely produce R M s including the National Bureau of Standards (NBS), Community Bureau of Reference (BCR), the International Atomic Energy Agency (IAEA), Agriculture Canada, the National Research Council of Canada, and the National Institute for Environmental Studies (NIES). In 1981 the Nutrient Composition Laboratory (NCL) of the U.S. Department of Agriculture devloped a mixed diet RM because no mixed diet control material was commercially available from any of the RM manufacturers at that time and previous research had shown that existing biological reference materials were not very suitable when analyzing foods and diets (4). The NCL, which is involved in developing methods for the determination of a wide variety of nutrients in individual foods and diets, recognized the essentiality of a mixed

diet RM for this work. A mixed diet material made from commonly consumed everyday foods was selected because a good RM must provide similar matrix effects and analyte concentrations and contain the same chemical form of the analyte as the real world samples for which it will serve as a control. A mixed diet RM was prepared ( 4 ) by using approximately 75 kg of mixed diet material containing commonly consumed foods such as fruit, cereal, bread, noodles, vegetables, salad, fish, and poultry. Extra sources of fat were excluded resulting in a diet that was -9% fat by weight (20% fat calories). The prepared foods were blended, freeze-dried, and reblended. Care was taken to avoid possible trace element contamination since this material was being developed as an inorganic RM. The final powdered material provided 50% recovery for a 30/60 mesh cut sieved through polyethylene sieves and was packaged into 280 30-g units. The final diet material was then characterized with regard to homogeneity for eight elements, Ca, Cu, Fe, K, Mg, Mn, Na, and Zn, to assure that this material was sufficiently homogeneous to serve as a RM. Aliquots of this diet were then distributed to a range of experts in trace element analyses for the final overall characterization of this material for 17 elements. The results from the characterization are presented here along with the final recommended values for these 17 elements. This diet material (previously known as TDD-1D) has been accepted as a Reference Material by the National Bureau of Standards and is available to the scientific community as Reference Material 8431.

EXPERIMENTAL SECTION Diet RM Preparation. A typical diet menu was selected from a human condu&d at the Belbville H-* Nutrition Research Center (Table I). The diet menu included commonly consumed foods for the three meals consumed in a day. A detailed

This article not subject to US. Copyright. Published 1986 by the American Chemical Society