Neural network assisted rapid screening of large infrared spectral

Neural network assisted rapid screening of large infrared spectral databases. Christoph. Klawun, and Charles L. Wilkins. Anal. Chem. , 1995, 67 (2), p...
0 downloads 0 Views 1011KB Size
Anal. Chem. 1995,67,374-378

Neural Network Assisted Rapid Screening of Large Infrared Spectral Databases Christoph Klawun and Charles L. Wilkins* Department of Chemistry, University of Califomia, Riverside, Califomia 92521-0403

A new method of prefiltering applicable to spectral database searches has been developed. It employs a backpropagation neural network to classify 609 matrix isolation ET-IR spectra with respectto the presence or absence of 35 functional groups, serving as sortable bit string keys to the spectral library. These bit strings or integers are used to construct a bmary search tree for 100%successful fast spectral retrieval. Compared with a sequentiallibrary search, this method yields a more than 25-fold increase in search speed and could easily be adapted to handle other types of spectral information. For larger databases, this advantage is expected to increase by a factor of n0v5 relative to the number of spectra. Additionally, the backpropagation neural network training has been modified with an extended version of the flashcard algorithm for 100%successful training of 2651 matrix isolation ET-IR spectra. Rapid growth in affordable computing power, memory, and disk storage has been a common phenomenon for quite a number of years and has resulted in convenient access to large spectral databases for the analytical chemist.’B2 However, the number of available stored spectra and their resolution have increased concurrently,reducing the anticipated increase in database search speed to a disappointing level. This is especially true for infrared spectra, where full curve spectral comparisons are frequently employed. Slow spectral retrieval algorithms become even more aggravating when large numbers of infrared spectra are generated by the use of separation techniques linked to infrared spectrom e t e r ~ . ~The - ~ growing number and variety of these hyphenated techniques make research in intelligent spectral database organization and fast access as important as ever. Therefore, reducing the time required to identify each separated compound of a mixture from its infrared spectrum offers the prospect of making the total analysis of unknown mixtures a less daunting task. A number of researchers have worked on different methods to (1) Warr, W. A In Recent advances in chemical information ZI; Collier, H., Ed.; The Royal Society of Chemistry: Cambridge, U.K, 1993; pp 235-246. (2) Warr, W. A. Anal. Chem. 1993,65, 1045A-105OA (3) Wilkins, C. L. Anal. Chem. 1987,59, 57lA-58lA (4) Ragunathan, N.; Krock,K. A; Wilkins, C. L. Anal. Chem. 1993,65,10121016. (5) Krock, K A; Ragunathan, N.; Wilkins, C. L. Anal. Chem. 1994,66,425430. (6) Krock, K. A; Ragunathan, N.; Klawun, C.; Sasaki, T.; Wilkins, C. L. Analyst 1994,119,483-489. (7) Bourne, S.; Haefner, A M.; Norton, K L.; Grifiiths, P.R Anal. Chem. 1990, 62, 2448-2452. (8) Reedy, G.T.; Ettinger, D. G.; Schneider, J. F.; Bourne, S. Anal. Chem. 1985, 57,1602-1609.

374

Analytical Chemistry, Vol. 67, No. 2, January 15, 1995

enhance spectral search speeds. Reducing wavenumbergresolution or intensity datagJOmay eliminate too much information,while library compression based on principal component a n a l y ~ i s ~ ~ - l ~ has not yielded more than an &fold data reduction. Peak list searches“+19allow for rapid database screening and are a standard procedure in mass spectrometry. However, infrared spectra contain broader bands, and their selection depends upon the peak recognition algorithm. Consequently, important peak shape information may be lost. More unconventional approaches include use of data clustering,20namely hierarchical clu~tering,2l-~~ to generate a search tree for fast retrieval, which also provides structural information for spectra previously not encountered. Sorted principal component analysis scores1zhave been employed for unambiguous infrared spectrum retrieval, and neural networks have been used as a compact storage medium for IR spectraz8 and protein sequences.z9 This paper presents a new approach to rapid spectral database access. For linear database searches, the retrieval time grows proportionally with n,the number of database spectra. In contrast, a binary search through stored information has a time dependence of log2 n (best case) and is known to be the most efficient search algorithm in many applications. By their nature, spectral data do not lend themselves well to sorting, a prerequisite for binary searches. On the other hand, every spectrum in the database can be classified with respect to the presence (“1”) or absence (“0”) of specified functional groups in the compounds generating the spectra. A list of these can be represented as a bit string: a simple sortable integer. In order to produce such an integer key, (9) Owens, P. M.; Isenhour, T. L. Anal. Chem. 1983,55,1548-1553. (10) Jia, X; Richards, J. A Remote Sens. Environ. 1993,43,47-53. (11) Hangac, G.; Wieboldt, R C.; Lam, R B.; Isenhour, T. L. Appl. Spectrosc. 1982,36,40-47. (12) Wang, C. P.; Isenhour, T. L.Appl. Spectrosc. 1987,41,185-195. (13) Hanington, P. B.; Isenhour, T. L. Appl. Spectrosc. 1987,41,449-453. (14) Hanington, P. B.; Isenhour, T. L. Anal. Chem. 1988,60, 2687-2692. (15) Lo,S.; Brown, C. W. Appl. Spectrosc. 1991,45,1628-1632. (16) Rasmussen, G. T.; Isenhour, T. L. Appl. Spectrosc. 1979,33,371-376. (17) Lowry, S. R; Huppler, D. A; Anderson, C. R /. Chem. Inf Comput. Sci. 1985,25, 235-241. (18) Somberg, H. In Computersupported spectroscopic databases; Zupan, J., Ed.; Ellis Horwood Chichester, U.K.., 1986; pp 64-91. (19) Pyo, D. Vz‘b. Spectrosc. 1993,5,263-273. (20) Scsibrany, H.; Varmuza, K Fresenius J. Anal. Chem. 1992,344,220-222. (21) Zupan, J. Anal. Chim. Acta 1982,139,143-153. (22) Zupan, J.; Munk, M. E. Vestn. Slov. Kem. Dms. 1983,30,61-76. (23) NoviC. M.; Zupan, J. Anal. Chim. Acta 1985,177,23-33. (24) Zupan, J.; Munk, M. E. Anal. Chem. 1985,57, 1609-1616. (25) Zupan, J.; Munk, M. E. Anal. Chem. 1986,58, 3219-3225. (26) Zupan, J. Mikrochim. Acta (Wien) 1986,2, 245260. (27) Zupan, J.; Novie, M. In Computersuppotted spectroscopic databases; Zupan, J., Ed.; Ellis Horwood: Chichester, U.K, 1986; pp 42-63. (28) Tanabe, K; Tamura, T.; Uesaka, H. Appl. Spectrosc. 1992,46, 807-810. (29) Wu, C. H. Comput. Chem. 1993,17,219-227. 0003-2700/95/0367-0374$9.00/0 0 1995 American Chemical Society

Table 1. Training Set Composition for 1067,1584, and 2651 MI-FTIR Spectra.

functionalgroup 0-H N-H C-N X=Y, X=Y=Z

c=o c-0 c=c

aromatic C-C halogen N-0 CH3 CHz primary alcohol

secondary alcohol tertiary alcohol phenol COOH primary amine

seconday amine tertiary amine (I

set 1

set 2

set 3

318 166 281 35 586 681 244 446 190 53 806 800 84 75 21 69 80 85 41 56

354 207 329 32 789 937 300 575 215 49 1266 1247 121 68 19 67 104 125 40 55

672 373 610 67 1375 1618 544 1021 405 102 2072 2047 205 143 40 136 184 210 81 111

functional group amide ester, lactone aldehyde ketone ether RHC=CH2 RHC=CHR &C=CHz &C=CHR aromatic c6 ring pyridine ring C-Cl (no COCl)

Ar-COOR

R-COOCH3 R-COOCzH5 CH3-COOR

monosubstituted benzene ortho-substituted benzene meta-substituted benzene para-substituted benzene

set 1

set 2

set 3

76 313 68 91 149 69 85 30 81 415 31 83 40 62 71 43 122 79 31 83

76 414 62 165 248 75 121 30 94 540 29 108 28 59 71 44 189 78 31 100

152 727 130 256 397 144 206 60 175 955 60 191 68 121 142 87 311 157 62 183

The first two sets do not overlap, and the third set is a combination of the first two sets. Total spectra: set 1, 1067; set 2, 1584; set 3, 2651.

back-propagation neural n e t w ~ r k s ~have ~ ~ ~been l employed to predict functional groups from infrared spedra.32-36 Nevertheless, until the recent development of the flashcard alg~rithm:~it had not been possible to obtain 100%successfultraining from a larger data set, a necessity for the retrieval of all database spectra. It overcomes the ubiquitous computation of local minima resulting from back-propagation training by overrepresenting “difficult” examples. EXPERIMENTAL SECTION

With this new flashcard algorithm, a back-propagation neural network was trained to recognize 35 different functional groups from a set of 609 matrix isolation FT-IR spectra with 752 data points each (see ref 37 for spectral selection, functional groups, and neural network details). The training parameters are one hidden layer with 25 nodes, learning rate 7 = 0.03, momentum a = 0.90, sigmoidal discrimination /3 = 1.0, sigmoidal bias term 8 = 0.0. The sigmoidal function Sig(x) in eq 1 illustrates the use of the terms j3 and 8:

After 408 epochs (including additional efforts from overrepresenting the “difficult” spectra), the training was complete, with all output errors less than or equal to 0.30 and a total epoch error E (30) Rumelhart, D. E.; McClelland,J. L;PDP Research Group. Parallel Distributed Processing-Explorations in the Microstructure of Cognition. Vol. 1: Foundufions; MlT Press: Cambridge, MA, 1986; pp 318-362. (31) Rumelhart, D. E.; Hinton, G. E.; Williams, R J. Nature 1986,323, 533536. (32) Robb, E. W.; Munk, M. E. Mikmchim. Acta (Wien) 1990,1 , 131-155. (33) Munk, M. E.; Madison, M. S.; Robb, E. W. Mikrochim. Acta (Wien) 1991, 2,505-514. (34) Fessenden, R J.; Gyorgyi, L.]. Chem. Soc., Perkin Trans. 2 1991,17551762. (35) Smits,J. R M.; Schoenmakers, P.; Stehmann, A; Sijstermans, F.; Kateman, G. Chemom. Intell. Lab. Sysf. 1993,18, 27-39. (36) VanEst, Q. C.; Schoenmakers, P. J.; Smits, J. R M.; Nijssen, W. P. M. Vib. Spectrosc. 1993,4, 263-272. (37) Klawun, C.; Willtins, C. L. J. Chem. Inf: Comput. Sei. 1994.34,984-993.

of 0,0164. Outputs less than 0.5 are interpreted to mean that a specified functional group is absent. Otherwise, the functional group is interpreted to be present. With these training parameters, the danger of overtraining remains fairly low: several training sessions starting with different randomized network weights led to essentially the same prediction rates for 1328 test set spectra.37 In addition, the 752-2535 network architecture contains only 19 675 adjustable weights, compared to 457 968 different data points from the training set spectra, further diminishing the likelihood of chance correlations. The network was programmed in Borland Pascal 7.0 and standard C. Compiled from the C code; it required 6.73 h of training time on an Intel 486DX-33 based computer with 8 MB RAM,256 kE3 RAM cache (20 ns) running under Linux 1.0.9, a Unix freeware package. On a Cray C90, the necessary training time dropped to 6.63 min real time, using one CPU. Comparing the two computing facilities, a single spectral example was processed in 98 ms with an Intel 486DX-33 and in 1.6 ms with a Cray C90. For an estimate of how well the neural network can be trained with a larger number of spectra, three new training sets were compiled containing 1067,1584,and 2651 matrix isolation FT-IR spectra, respectively. Many erroneous and suspicious spectra were excluded from a larger pool of spectra to form these training sets. The composition of the functional group set was changed as well and their number increased to 40 (Table 1). All three training sessions were conducted using a Cray C90 (single CPU), adding 1%random noise during the training, until all output errors for every spectrum dropped to 50.20. Table 2 shows the training parameters and times required for satisfactorytraining. For these training runs, the flashcard algorithm37 was modified to accommodate the increasing disparity between the present and absent functional group count with larger training sets. This modification will be described in the next section. RESULTS AND DISCUSSION

Binary Library Search. In order to facilitate binary searches with bit strings or integers (35 significant bits in three l&bit integers), a binary search tree was constructed with classification Analytical Chemistry, Vol. 67, No. 2, January 15, 1995

375

Table 2. Parametersfor Training Sessions of 1067, 1584, and 2651 Spectrae spectra

hidden nodes

epochs

final epoch error, E

training time (min)*

1067 1584 2651

30 35 45

419.8 310.9 447.9

0.01596 0.01423 0.01155

11:34 14:37 3848

a a = 0.9, 11 = 0.02, /3 = 1.2, 8 = 0.0 for all networks. Cray C90 (single CPLJ), real time.

Table 3. BInaGTke Pattem Distributionfor 609 Spectra.

I

m

I

m

I

m

1 2 3 4

370 60 14 1

5 6 7 8

3 2 1 1

9 10 11

0 2 1

I, number of spectra per unique pattem; m,number of nodes with this configuration.

86

I

1

01 01 0 110 = 86

I

II

Y

I

Y

:

3389

.

1024 389 4238

1997

:

o

I

l 10

6653

Figure 1 Neural network (top left) generates hypothetical bit pattem 010101102 = 8610, filled circles represent present functionalities (“”l), an empty circle means an absent functionality (“0”).Searching of a binary tree with the bit pattem, large numbers inside the tree nodes ovals are reference bit pattems, small numbers below the nodes are spectral index number (see text for more detailed explanation).

bit strings from all 609 spectra. Because not all bit strings contained unique patterns, each distinct bit string was inserted into the search tree only once, and the search tree was constructed by assembling an ordered set of patterns, with accompanying pointers to indicate the presence of two or more spectra producing the same pattern. Thus, the original 609 spectral patterns can be reduced to 455. The logic of the binary search is best understood by reference to a hypothetical example employing %bit patterns rather than the 35bit patterns actually used (Figure 1). In this example, the experimental spectrum presented to the neural network generates the pattern corresponding to the decimal number 86. That is then compared with the middle pattern in the ordered set, which has a value of 115 produced by presenting either spectrum 2289 or spectrum 1924 to the neural network. Because the value 86 is less than 115, the left-hand branch of the search tree is taken. The other branch is shown for completeness. In the second step of the search, the value of 46 is encountered at the table entry halfway between the middle value and the lowest value in the ordered table. This pattern represents the response of the network to spectrum 4289. The next comparison yields the desired match with 86, and it is found that three spectra, numbers 389,1024, and 4238, cause the network to generate that response. Thus, complete spectral comparisons of only those three spectra are required to determine whether one of them corresponds to the unknown spectrum being sought. Table 3 shows how many bit strings are unique and how many occur more than once. If upon retrieval only a single spectral index number is found with a matching bit string, the “unknown” is identified unequivocally. If more than one spectrum results, the corresponding library spectra can be compared to the unknown using a Euclidean distance metric to determine the identity of the unknown. 376 Analytical Chemistry, Vol. 67, No. 2, January 15, 1995

0.00

0.10

0.20

0.30

0.40

0.50

Error

Figure 2. Distribution of 2.1315 x lo6 output node errors (609 spectra x 100 runs x 35 output nodes) for 100 retrieval runs with 1.O% added random noise. Very few errors lie between 0.30 and 0.50, and no errors > 0.50 are observed, showing a robust neural network.

The search tree retrieval efficiency was tested by generating 50 different binary trees with a random sequence of bit pattern insertions. An average tree height of 18.5 nodes (minimum, 15; maximum, 23) and an average search depth of 9.9 nodes (minimum, 8.9 maximum, 11.4) were found. For a perfectly balanced binary tree with 455 nodes, a tree height of 9 with an average search depth of 7.9 would be optimal. Using one of these search trees, an intralibmy search with all 609 training spectra plus 1.0%added random noise was carried out 100 times, testing the robustness of this retrieval algorithm. Each of the 2.1315 x 106 neural network output node values were less than 0.5, most of them below 0.01 (Figure 2). All spectra were retrieved correctly either as a unique pattern or as the first hit if more than one spectrum per bit pattern made spectral comparison necessary. To achieve more accurate timing, the entire spectral collection was held in the computer memory, and it took on average 61.0 ms to retrieve a single spectrum (Intel 486DX-33 processor, compiled from Pascal code for DOS protected mode). In contrast, a sequential spectral search under the same conditions with a Euclidean distance metric required 1557 ms to produce the closest hit for an intralibrary search, using 10 hit list entries. Thus, the neural network prefilter yielded a more than 25fold increase in library search speed. Table 4 shows timing details for the different subroutines. A better way to measure the algorithm robustness rather than adding 1%noise would be to search a sizable number of duplicate spectra3 from a library of many more than 609 spectra, measured under a variety of conditions. Unfortunately, no truly large infrared spectral libmy was available at the time the algorithm was developed and tested, and the matrix isolation FT-IR library contains few duplicates but many poor quality or erroneous spectra. (38) Stein, S. E.;Scott, D.R J. Am. SOC.Mass Spedmm. 1994,5,859-866.

Table 4. Average Subroutine Times for Retrieving I Out of 609 Spectra, Measured on an Intel 4880x43

subroutine

av time (ms)

neural net propagation binary tree search spectral comparison total, neural net prefilter

55.65 1.51 3.87 61.0

subroutine spectral comparison hit list processing

1548.0 9.0

total, sequential search

1557.0

Large Training Sets. Neural network training with sets of

500

1067,1584, and 2651 spectra gave preliminary results to estimate

0.35

the efficacy of the proposed retrieval algorithm for a larger number of spectra. Initially, it turned out to be very dficult to train the neural network to convergence within a satisfactory time frame using the flashcard alg0rithm.3~ With trainiig sets of such considerable size, the imbalance between present and absent instances of any functional group becomes even more pronounced and moves further away from the ideal 50/50 distribution. Deep local minima were observed in about half of the training runs, resulting in extremely long training times or no convergence at all. In an attempt to iron out the training set imbalance, correction factors related to functional group abundance were applied to the individual output errors, leading to faster or slower individual learning rates for each functional group. If n is the total number of spectra in the training set and mi the number of occurences of the functional group i in the set, then the individual correction factor ci can be calculated as outlined in eq 2, describing the deviation for a present functional group from the 50%abundance in either direction. For a perfect 50% balance, ci becomes 1.0. If

0.30

f=-

n

- mi

mi

the functional group i is present in only 10% of the spectra, its output error is magnified for training purposes by a factor of ci = 4.5 to approximate a balanced training set. The reverse logic applies to absent functional groups, which for the previous example appear in 90% of all cases. Here, the output error must be diminished by a factor of ci = 4.5 in order to reduce their impact on the training progress. Unfortunately, this combined approach failed, driving the neural network into early and deep local minima for every training attempt. However, using ci only to dampen output errors of absent functional groups worked extremely well, and training convergence was achieved for almost all training attempts. Because the threat of deep local minima diminishes with training progress, the correction factors were applied only until the number of large output errors (20-95%) dropped to below the number of spectra in the training set, indicating that the network training was well on its way to convergence. At the same time, the flashcard algorithm37was activated for spectra with “opposite output errors” (>95%), repeatedly presenting all spectra with these characteristics to the network, until their number was reduced by 7, and regular training was resumed. Figure 3 illustrates the training progress with a set of 2651 spectra: after 293 epochs, the flashcard algorithm enters a new phase marked with the asterisk. If no opposite errors can be detected, the list of spectra associated with large errors (20-95%) is presented to

av time (ms)

I 450 400

0.25

350

-2

L

W

300

0.20

250

r

8

0.15

200

w

5

H

150 a 0

0.10

100

0.05 50

0

0.00 0

50

100

150

200

250

300

350

400

450

Epochs

Figure 3. Training progress of 2651 spectra (see Table 2 for training parameters). Opposite errors (’95%) begin to be reduced after 203 epochs, as indicated by the arrow, and the asterisk marks the beginning of the reduction of large errors (20-95%).

the network until their number drops by 5. Training is then resumed with a regular epoch. This scheme is repeated until no more errors greater than 20%can be found for the entire training set. From the effort required for larger and larger training sets, a very coarse estimate suggests that when the number of spectra is increased by a factor of n, the number of hidden nodes and thus the number of adjustable network weights must grow only by a factor of A more precise relationship cannot be established at this point. This touches on a fundamental and still unanswered question for neural networks: is it possible to determine a priori the number of weights necessary to achieve satisfactory modeling of a given training set? If the growth ratio of spectra to hidden nodes is as suggested above, the advantage of the proposed spectral retrieval algorithm over a linear database search will grow by n/n0.5= (n times more s p e ~ t r a l ntimes ~,~ more hidden nodes) for bigger databases, because the neural network propagation is by far the largest computational portion of the retrieval algorithm (see Table 4). Assuming that larger training sets converge after approximately the same number of epochs, the computational training effort increases by n x = d5,but this expense occurs only once. CONCLUSION

For libraries larger than 609 spectra, the advantage of the neural network prefilter is expected to be even better, because the neural network capacity, Le., the number of weights, only has to be increased by a modest amount, while the time for a sequential search grows linearly with database size. This advantage was estimated to increase by with n times more spectra, while the computational effort for training larger sets of spectra increases approximately by n1.5. Also, the small bit strings can always be held in the computer memory, while large collections Analytical Chemistty, Vol. 67, No. 2, January 15, 1995

377

of infrared spectra must reside on much slower hard disks. In addition, the neural network will generate suggestions regarding the presence or absence of functional groups when it encounters a spectrum not contained in the database. This method can easily be extended to handle different types of databases, where data are amenable to neural network training, e.g., NMR and Wvisible spectra or images and sound. The advantage of neural network-assisted prefiltering can become even more pronounced when the number of spectral data points used for the neural network input is reduced to an absolute minim~m.3~ Balancing the search tree and grouping more frequently retrieved patterns at the top of the binary tree will have much less effect on the overall retrieval speed, considering the small contribution of the tree search to the overall identification time (Table 4). Currently, research is underway to adapt a much larger IR database (> 10 OOO spectra) to this method and to carry out tests with real-world samples. With new extensions of the flashcard algorithm, it was (39) Meyer, M.; Meyer, IC;Hobert, H. Anal. Chim.Acta 1993,282,407-415.

378 Analytical Chemistry, Vol. 67, No. 2, January 15, 1995

possible to achieve 100%successful training for substantially larger training sets than previously paving the way to implementing the new spectral retrieval algorithm for sizable infrared libraries. ACKNOWLEWMENT This work has been carried out with support from the National Science Foundation, Grant CHE-92-01277. Many computations were done using the Cray C90 at the San Diego Supercomputing Center. Preliminary findings were presented at the PittCon’94 Conference in Chicago, IL, March 1994, paper 645. Received for review August 31, 1994. Accepted October 27, 1994.@ AC940876M @Abstractpublished in Advance ACS Abstrucfs, December 1, 1994.