Anal. Chem. 1997, 69, 1392-1397
Preprocessing of HPLC Trace Impurity Patterns by Wavelet Packets for Pharmaceutical Fingerprinting Using Artificial Neural Networks Elizabeth R. Collantes, Radu Duta, and William J. Welsh*
Department of Chemistry, University of MissourisSt. Louis, St. Louis, Missouri 63121 Walter L. Zielinski and James Brower
Division of Drug Analysis, U.S. Food and Drug Administration, St. Louis, Missouri 63101
The immediate objective of this research program is to evaluate several computer-based classifiers as potential tools for pharmaceutical fingerprinting based on analysis of HPLC trace organic impurity patterns. In the present study, wavelet packets (WPs) are investigated for use as a preprocessor of the chromatographic data taken from commercial samples of L-tryptophan (LT) to extract input data appropriate for classifying the samples according to manufacturer using artificial neural networks (ANNs) and the standard classifiers KNN and SIMCA. Using the Haar function, WP decompositions for levels L ) 0-10 were generated for the trace impurity patterns of 253 chromatograms corresponding to LT samples that had been produced by six commercial manufacturers. Input sets of N ) 20, 30, 40, and 50 inputs were constructed, each one consisting of the first N/2 WP coefficients and corresponding positions from the overall best level (L ) 2). The number of hidden nodes in the ANNs was also varied to optimize performance. Optimal ANN performance based on percent correct classifications of test set data was achieved by ANN-30-30-6 (97%) and ANN-2010-6 (94%), where the integers refer to the numbers of input, hidden, and output nodes, respectively. This performance equals or exceeds that obtained previously (Welsh, W. J.; et al. Anal. Chem. 1996, 68, 3473) using 46 inputs from a so-called Window preprocessor (93%). KNN performance with 20 inputs (97%) or 30 inputs (90%) from the WP preprocessor also exceeded that obtained from the Window preprocessor (85%), while SIMCA performance with 20 inputs (86%) or 30 inputs (82%) from the WP preprocessor was slightly inferior to that obtained from the Window preprocessor (87%). These results indicate that, at least for the ANN and KNN classifiers considered here, the WP preprocessor can yield superior performance and with fewer inputs compared to the Window preprocessor. Wavelet theory, initially proposed by Grossmann and Morlet,1 involves representing general functions or signals in terms of simpler, fixed building blocks of constant shape but at different scales and positions. A wavelet transform resembles the familiar sines and cosines in a Fourier transform (FT) in some respects. (1) Grossmann, A.; Morlet, J. SIAM J. Math. Anal. 1984, 15, 723-736.
1392 Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
But unlike a Fourier transform, which is localized in frequency but not in time, the wavelet transform has dual localization both in scale (frequency) and in position (time). Consequently, only a relatively small number of coefficients are required in order to represent a given portion of a signal. These coefficients encode many of the essential features of the signal, thus enabling signals to be compressed many-fold with minimal loss of information. By virtue of this property, wavelets have been explored for their potential applicability not only in the realm of pure mathematics2 but likewise in fields as diverse as acoustics,3 fluid mechanics,4 and chemical analysis.5 But while developments in wavelets are concentrated in areas of data compression,6 image processing,7 and time-frequency spectral estimation,8 the application of wavelets for classification has not been as fully explored.9 A primary objective of this research program is to evaluate several computer-based classifiers as potential tools for pharmaceutical fingerprinting based on analysis of HPLC trace organic impurity patterns. A recent study in our laboratory demonstrated the utility of artificial neural networks (ANNs) for this purpose by utilizing appropriately normalized data sets obtained from HPLC analysis of commercial L-tryptophan (LT) samples.10 The ANNs were trained to match a given LT sample with its corresponding manufacturer from among six LT manufacturers (Figure 1). The performance of the ANNs was compared with that of two standard chemometric classifiers, K-nearest neighbors (KNN) and soft independent modeling of class analogy (SIMCA), as well as with a panel of human experts. In this work, we introduced a preprocessing scheme termed the Window method which converts the output of 899 data entries extracted from each LT chromatogram into an appropriate input file for the classifiers. Based on the excellent performance of the computer-based (2) Coifman, R.; Weiss, G. Bull. Am. Math. Soc. 1977, 83, 569-645. (3) Kronland-Martinet, R.; Morlet, J.; Grossmann, A. Int. J. Pattern Recognit. Artif. Intell. 1987, 1, 97-126. (4) Liandrat, J.; Moret-Bailly, F. Eur. J. Mechanics, B: Fluids 1990, 9, 1-19. (5) Bos, M.; Vrielink, J. A. M. Chemom. Intell. Lab. Syst. 1994, 23, 115-122. (6) Mallat, S. IEEE Trans. Pattern Anal. Mach. Intell. 1989, PAM1-11, 674693. (7) Mallat, S. IEEE Trans. Acoust., Speech, Signal Process. 1989, 37, 20912110. (8) Bentley, P. M.; McDonnell, J. T. E. Electron. Commun. Eng. J. 1994, August, 175-186. (9) Wunsch, P.; Laine, A. F. Pattern Recognit. 1995, 28, 1237-1249. (10) Welsh, W. J.; Lin, W.; Tersigni, S. H.; Collantes, E. R.; Duta, R.; Carey, M. S.; Zielinski, W. L.; Brower, J.; Spencer, J. A.; Layloff, T. P. Anal. Chem. 1996, 68, 3473-3482. S0003-2700(96)00883-9 CCC: $14.00
© 1997 American Chemical Society
Figure 1. Representative chromatograms from each of the six commercial LT manufacturers designated A-F.
classifiers in general and of the ANNs in particular, it was concluded that preprocessing of the chromatographic data is important, if not essential, for minimizing the effects of noise and for extracting those essential features that maximize correct classification. In light of the importance of preprocessing, the present study was undertaken to investigate the utility of wavelet packets to preprocess the same HPLC data, i.e., extract the salient features from the LT chromatograms for classification purposes. Similar to our earlier study using the Window preprocessor,10 the WPpreprocessed data were fed as input into the ANN, KNN, and SIMCA classifiers to match each LT sample with its manufacturer. Our specific objectives were twofold: (1) to determine whether the performance of the ANNs and chemometric models can be maintained or even improved using WP-preprocessed data rather than Window-preprocessed data, and (2) to determine whether the WP transformation of the LT chromatographic data can reduce the input size without compromising the accuracy of the classification process. To ensure a fair comparison of the two preprocessing schemes for the present application, we evaluated the performance of the WP preprocessor using the same ANN, KNN, and SIMCA classifiers and the same LT samples as studied previously employing the Window preprocessor.10 Basically, a WP decomposition of an arbitrary function is an expansion into smooth localized contributions characterized by a position parameter and a scale parameter. The WP transform or decomposition, as it is called, is based on dilation and translation of a basis function, Ψ, also known as the mother wavelet.11 That is, a WP can be viewed as a “bump” which undergoes (i) dilation to squeeze or expand it by adjusting a scale parameter analogous to frequency and (ii) translation to shift it in time. The time axis is usually taken as the position parameter, although sometimes (11) Tzanakou, E. M.; Uyeda, E.; Ray, R.; Sharma, A.; Ramanujan, R.; Dong, J. Simulation 1995, 64, 15-27.
the wavelength axis is used in spectroscopic applications. This flexible time-frequency resolution enables the WP to capture and characterize locally the most relevant parts of a signal and, hence, to adequately represent a signal with a relatively small number of coefficients. A more rigorous discussion of the theory behind wavelets and their mathematical treatment may be found in a number of papers.12 A wide array of WPs are available, such as the Daubechies and Coifman series, offering varied degrees of compactness of support and regularity.13 The optimal choice for a particular application is, however, still empirical and remains an active area of research. In the present work, encouraging results obtained using the Haar function are presented. Efforts are continuing in this Laboratory to establish the unique combination of WP function and parameters that is best suited for preprocessing in the present application. EXPERIMENTAL SECTION The method followed in this study consists of a three-step sequence: (a) data set preparation, (b) preprocessing by WP transformation, and (c) classification (training and testing) by ANN, KNN, and SIMCA. Data Set Preparation. The data set used in this study was the same series of 253 LT chromatograms employed previously by Welsh et al.10 This set was generated using statistical methods of experimental design (Table 1), so constructed to account for variations in the following four factors: (1) LT manufacturer (designated A-F), (2) LT production lot (lot 1s and 2), (3) HPLC column (columns X-Z), and (4) between-day repeatability (days 1 and 2). For each combination of manufacturer, lot, column, and (12) Daubechies, I. Comm. Pure Appl. Math. 1988, 41, 909-996. Ruskai, M. B., Beylkin, R., Coifman, R., Daubechies, I., Mallat, S., Meyer, Y., Raphael, L., Eds. Wavelets and their Applications; Jones and Bartlett: Boston, MA, 1992. Mallat, S. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 674693. (13) Jawerth, B.; Sweldens, W. SIAM Rev. 1994, 36, 377-412.
Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
1393
Table 1. Composition of Factors Employed in the Experimental Design10 HPLC columns
manufacturers
lots
run days
reps per day
subtotal
X, Vydac 1 Y, Vydac 2 Z, Waters
6 6 6
2 2 2
2 2 2
5 3 3
120 72 72
total number of chromatogramsa
264
a Eleven chromatograms were disqualified, leaving 253 chromatograms for the study.
day, 3-5 replicate chromatograms were obtained. While the column packing and packing particle size of columns X-Z were identical, the vendors were not all the same (i.e., Vydac for X and Y, Waters for Z). The detector response (i.e., peak height h) and retention time (τ) of each HPLC scan were normalized as before10 to units of h′ and τ′, respectively. The so-called fingerprint region, extending from τ′ ) 5.505 to 9.995 in increments of 0.005, contained 899 pairs of h′,τ′ data points. For purposes of training and testing of the classifiers, the 253 chromatograms in this study were partitioned into six separate combinations of training sets and test sets in such a way that (a) no chromatogram in the test set would encounter any of its replicates in the training set and (b) each unique combination of LT manufacturer, LT lot, and HPLC column was included in a test set at least once. This systematic reshuffling of the test set and training set data is tantamount to a “leave-n-out” procedure employed for cross-validation. Details on the handling of the LT samples, processing of the HPLC data, and training/testing of the classifiers have been reported previously.10 Wavelet Packet Transformation. WP transforms of the 253 chromatograms were carried out using the Haar function (D02) implemented in Wavelet Packet Laboratory for Windows (WPLW) version 1.0 (Digital Diagnostic Corp., Hamden, CT). The wavelet packet algorithm14 contained in WPLW operates on a data buffer of a length that is a power of 2; consequently, the 899 data points representing the chromatograms were padded with zeros to yield 1024 (i.e., 210) data points. WP transformation of the segment length of 1024 data points produced L ) 11 levels, each of which consisted of 2L nodes. In turn, each node consisted of k coefficients with k ) segment length/2L. Thus, level 0 consists of a single node (2L)0) containing the original signal, i.e., 1024 coefficients. Level 1 consists of two nodes (2L)1) arranged in order of increasing frequency, with each node containing 512 coefficients arranged in the same order as generated. Nodes for subsequent levels are derived in a similar manner. The results of the Haar transformation of a representative chromatogram for levels L ) 0-10 are visualized in Figure 2, where the first 25 coefficients from each level were taken to reconstruct the chromatograms. It can be seen that the original signal (top row) is reproduced quite well by the lower levels (L ) 1-4) using only these 25 coefficients. However, similarities with the original signal diminish at the higher levels (L > 4), indicating that more coefficients would be required for better reconstruction. Experimentation with the wavelet packet algorithm14 suggested two possible approaches for optimizing the wavelet representation (14) Coifman, R. R.; Wickerhauser, M. V. Wavelets and Adapted Waveform Analysis; AK Peters: Wellesley, MA, 1993.
1394 Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
Figure 2. Reconstruction of a representative chromatogram using WP coefficients derived from the Haar function.
of the original signal: the “best level” and the “best basis”. The best level and best basis are so designated since they provide the optimal compromise between maximum signal representation and minimum number of coefficients. The best level refers to a single Level, while the best basis may refer to a single level or to a combination of levels. Inspection of the transformed signals for the Haar function (Figure 2) revealed that the best level for most of the chromatograms is level 2. Therefore, level 2 was taken as the source of inputs to the three classifiers in the present study. To generate the inputs, the first 10, 15, 20, and 25 values of the WP coefficients were selected from level 2 after sorting the coefficients from highest to lowest in magnitude. This procedure yielded four input data sets, each of which was fed separately to the classifiers. However, it was found that the coefficients alone contained insufficient information to discriminate among the six LT manufacturers. To supplement these coefficients, the position corresponding to each coefficient was included such that the four data sets now consisted of 20, 30, 40, and 50 inputs, respectively. The purpose of evaluating a range of inputs was to minimize their number without compromising classification performance. After WP preprocessing, each chromatogram is represented by a string of numbers composed of pairs of positions and coefficients (“energies”) listed in order from the highest to lowest magnitude of the coefficient. Typical examples of 20-input strings for chromatograms were 0 253. 11 157. 10 87.4 1 73.6 12 73.5 0 72.2 6 48.6 5 45.9 10 -43.4 0 36.4 for manufacturer D and 10 93.3 11 40.8 9 39.8 9 -24.5 224 12.4 12 10.9 9 -10.6 11 10.3 10 10.3 108 6.4 for manufacturer E. These two examples illustrate that the
Table 2. Summary of Results from Runs 1-6 Using Different Classifiers and Different Number of Input Variables Obtained from the Haar Transformation no. of input variables
manufacturer D
E
F
total
% correcta
36/41 36/41 34/41
43/44 43/44 43/44
40/44 44/44 29/44
239/253 246/253 218/253
94 97 86
42/42 42/42 36/42
39/41 35/41 34/41
44/44 37/44 41/44
41/44 34/44 19/44
246/253 227/253 207/253
97 90 82
34/38 33/38
41/42 38/42
32/41 29/41
38/44 34/44
38/44 33/44
224/253 208/253
88 82
35/38
41/42
36/41
34/44
36/44
221/253
87
A
B
C
ANN KNN SIMCA
44/44 44/44 41/44
38/38 38/38 34/38
38/42 41/42 37/42
ANN KNN SIMCA
43/44 42/44 40/44
37/38 37/38 37/38
ANN SIMCA
41/44 41/44
ANN
39/44
20
30
40 50 aThe
“% correct” refers to the ratio of chromatograms correctly classified to the number of chromatograms in the test set.
strings generated by WP preprocessing are quite distinct between manufacturers and that the values of the coefficients can vary both in magnitude and in sign. Both these features contribute substantially to the ability of the classifiers to discriminate among the six manufacturers. While the major focus of this study was to optimize the best level, some attention was directed at the coefficients from the best basis. Unfortunately, the poor performance (∼70% correct classifications) of these initial models formed from the best basis confirmed that the best level was preferable for the present application. Of course, optimization of the best basis by modification of the number of coefficients as well as other parameters (i.e., level, node, and position) would be expected to improve classification accuracy. Classification Models. The WP coefficients and positions produced by level 2 of the Haar function served as inputs to the classifiers. Two types of classifiers were compared: (1) nonlinear ANN models and (2) KNN and SIMCA representing standard chemometric approaches. The basic objective was to evaluate the WP preprocessor using these three classifiers by varying the number of input variables (i.e., 20, 30, 40, and 50) and then to compare the performance of these classification studies with those achieved previously10 using the Window preprocessor. ANN training and testing were carried out using Brainmaker Professional (California Scientific Software, Nevada City, CA). The KNN and SIMCA methods were implemented using the Pirouette software, version 1.1 (Infometrix, Inc., Seattle, WA). Artificial Neural Networks. The ANNs were constructed to contain three layers (input, hidden, output) of nodes with sigmoidal transfer functions and employed a feed-forward back-propagation (FFBP) learning algorithm implementing gradient descent minimization to adjust the weights.15 The output layer consisted of six nodes, one for each LT manufacturer. The register of each output node contained a floating point number spanning the interval 0-1; the output node with the highest numerical value within this range was taken as the LT manufacturer predicted by the ANN. A range of input nodes (i.e., 20, 30, 40, and 50) was investigated to search for the optimal architecture in this regard. To optimize (15) Zupan, J.; Gasteiger, J. Neural Networks for Chemists; VCH Publishers: New York, 1993.
ANN performance with respect to the number n of hidden nodes, n was varied between 10 and 20 for ANN-20-n-6 and between 20 and 35 for ANN-30-n-6 (where the three integers specify the number of input, hidden, and output nodes, respectively). In all other cases considered, the number of hidden nodes equaled the number of input nodes, i.e., ANN-20-20-6, ANN-30-30-6, ANN-4040-6, and ANN-50-50-6. Starting from different randomized initial weights, several networks were trained to convergence of the target fitting parameters (set at 95% of correct facts and 0.2 error tolerance) using the training set data. Among these trained networks, the ANN selected for the classification studies was that one which gave the fewest misclassifications using the test set data. KNN and SIMCA. The KNN method16 attempts to classify a test set sample by polling the class (i.e., LT manufacturer) of the K (K ) 1, 2, 3, ...) nearest samples in the training set within an N-dimensional coordinate space, where N is the number of input variables characterizing each sample. The SIMCA method,17 on the other hand, creates a principal components model for each class in a training set to discriminate among classes. Training after normalization of the HPLC data indicated that K ) 2 for KNN and three principal components (PC ) 3) for SIMCA yielded the best performance with the test sets; thus, these values were chosen for the subsequent classification analysis. RESULTS AND DISCUSSION Inputs for the networks were taken exclusively from the various WP coefficients and positions obtained from level 2 of the Haar analyzing function. Considerable efforts were made to find the ANN architecture that was smallest in size yet still provided acceptable performance in the present application. As observed in our previous study,10 the classification performance of the ANNs was influenced by both the number of input neurons and hidden nodes. In general, adding more hidden nodes provides greater flexibility in the network encoding process.18 At the same time, adding too many hidden nodes can give rise to noise or a (16) .Kowalski, B. R.; Bender, C. F. J. Am. Chem. Soc. 1972, 94, 5632-5639. (17) Wold, S. Pattern Recognit. 1976, 8, 127-139. (18) Villemin, D.; Cherqaoui, D.; Mesbah, A. J. Chem. Inf. Comput. Sci. 1994, 34, 288-1293. (19) Long, J. R.; Mayfield, H. T.; Henley, M. V.; Kromann, P. R. Anal. Chem. 1991, 63, 1256-1261.
Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
1395
Figure 3. Performance (% correct classification) of selected ANN, KNN, and SIMCA classifiers using WP- and Window-preprocessed input data as well as unpreprocessed (899 input) data. Results for the human experts are also shown as an external standard.
“grandmothering” effect whereby the network memorizes the training set rather than encodes the generalized solution.19 Initial results indicated that the performance of ANN-20-20-6 and ANN30-30-6 was superior to that of ANN-40-40-6 and ANN-50-50-6; consequently, only the former two were optimized further in terms of the number of hidden nodes. Optimal performance (97% correct classification over all manufacturers) was achieved with the test sets from a network employing 30 nodes in the input layer and 30 nodes in the hidden layer. In this ANN-30-30-6 architecture, perfect scores (i.e., 100% correct classification) were obtained for manufacturers C and E, while manufacturers A, B, D, and F were misclassified only 1, 1, 2, and 3 times, respectively. Another network architecture, ANN-20-10-6, averaged 94% correct classification over all manufacturers. The performance of several ANN, KNN, and SIMCA classifiers, in which the number of inputs derived from the WP preprocessor is varied, is delineated in Table 2. The results in Table 2 demonstrate that the ANNs require fewer inputs from the WP preprocessor than from the Window preprocessor to achieve at least equal success in classifying the LT samples. ANN performance using the WP preprocessor with either 20 inputs (94%) or 30 inputs (97%) equaled or surpassed that previously obtained10 using the Window preprocessor with 46 inputs (93%) and far exceeded that previously10 obtained (85%) using the full set of 899 unpreprocessed data points. The latter comparison translates to a 12% (i.e., 97% vs 85%) advantage in classification accuracy, coupled with a nearly 30-fold reduction in input data, thus indicating that the WP preprocessor can extract the most relevant and discriminating features from the chromatograms and process them into a very efficient form for presentation to the classifiers. The prediction results using SIMCA and KNN with the 20-, 30-, and 40-input schemes (Table 2) revealed that the best performance for both KNN (97%) and SIMCA (86%) was achieved 1396
Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
Figure 4. Plot of the first two PC scores of chromatographic data from (a) unpreprocessed (899 input) points; (b) Window-preprocessed 46 data points; and (c) WP-preprocessed 30 data points. Each manufacturer is represented by a symbol as follows: 9, A; 2, B; O, C; 4, D; b, E; 0, F.
using the 20-input scheme. Furthermore, KNN performance using the WP preprocessor with either 20 inputs (97%) or 30 inputs (90%) was clearly superior to that obtained previously10 using the Window preprocessor with 46 inputs (85%). SIMCA performances using the WP preprocessor with 20 inputs (86%) and 30 inputs (82%) were respectively equivalent to and slightly inferior to that obtained previously10 using the Window preprocessor with 46 inputs (87%). Figure 3 charts the performance of the ANN, KNN, and SIMCA classifiers when the WP-preprocessed inputs are implemented as compared with the unpreprocessed and Windowpreprocessed inputs. The performance obtained from a panel of human experts in a previous study10 is also shown as an external standard for evaluating the computer-based methods. Prior to the ANN and chemometric classification study, a preliminary data exploration was performed using principal components analysis (PCA).20 Three separate data sets were submitted to PCA, where each data set consisted of the 253 LT chromatograms represented in one of the following ways: (1) as the full set of unpreprocessed 899 input values, (2) as the Windowpreprocessed 46 input values, and (3) as the WP-preprocessed 30 input values. The major purpose was to compare these three preprocessing strategies by visual analysis of the scores plot derived from the PCA results. The scores plot, in which each LT sample is represented as a point within the space of the first two principal components PC1 and PC2, was inspected for evidence of any natural clustering among the LT samples according to manufacturer. The degree to which a cluster is discernible for each LT manufacturer generally reflects the likelihood of a good classification model. Each cluster should represent the chromatograms obtained for LT samples from a particular manufacturer over the various combinations of lot, HPLC column, and run day. The scores plots corresponding to the unpreprocessed 899 data inputs, the Window-preprocessed 46 data inputs, and the WP-preprocessed 30 data inputs are presented in Figure 4a-c, respectively. Inspection of the scores plots reveals a noticeable amount of clustering by manufacturer for the full set of unpreprocessed data inputs (Figure 4a) as well as for the Window-preprocessed data inputs (Figure 4b). However, the clustering for the WPpreprocessed data (Figure 4c) is dramatically more apparent and more resolved into distinct domains according to manufacturer. The appearance of separate domains for manufacturers A, C, and D is clearly visible. Separate domains for manufacturers E and F (and, to a lesser extent, B) would likewise be discernible except that the scale of PC1 and PC2 values was chosen to fit all manufacturers on a single scores plot. These results from the PCA scores plots are particularly noteworthy in that they serve to underscore the overall superior performance of the WP preprocessor among the various preprocessing schemes considered in this classification study. Another basis for comparison of the three preprocessing schemes from the PCA is to consider what proportion of the total (100%) variance among the LT samples is explained by PC1, PC2, PC3, etc. Returning to the scores plots (Figure 4), it can be seen that PC1 and PC2 taken together account for 59% of the total variance using the unpreprocessed 899 data inputs, 44% using the (20) Kowalski, B. R.; Wold, S. Classification, Pattern Recognition and Reduction of Dimensionality; North Holland: Amsterdam, 1982.
Window-preprocessed 46 data inputs, and 67% using the WPpreprocessed 30 data inputs. (The individual percentages for PC1 and PC2 are shown respectively along the x and y axes.) An alternative approach is to count the number of PCs required to capture a specified amount of the total variance, e.g., 85%. Using this criterion, the unpreprocessed data set and Window-preprocessed data set required 10 PCs, while the WP-preprocessed data set required only 5 PCs. These comparisons reveal an added advantage of the WP preprocessor in that it requires fewer PCs than the other preprocessing schemes to capture sufficient information to classify the LT samples. Not only does WP preprocessing appear to work better with the classifiers to yield superior performance, it does so more efficiently from the standpoint of both fewer descriptors (i.e., only 20-30 data inputs for WP vs 46 for Window) and fewer PCs (i.e., 5 PCs for WP vs 10 PCs for Window to capture 85% of the variance). CONCLUDING REMARKS In this investigation, we have demonstrated that WP preprocessing provides a fast and efficient method for encoding the subject LT chromatographic patterns into a highly reduced set of numerical inputs for the classification models. An important consideration when using wavelets is the proper selection of wavelet scale. The performance of every classifier using the WPpreprocessed data inputs was far superior to that obtained previously10 using the full chromatogram of unpreprocessed 899 data inputs. Likewise, the performance of ANN and KNN using the WP-preprocessed data was equal or superior to that obtained previously10 using the Window-preprocessed data, even though the number of inputs was far less (i.e., 20 or 30 vs 46). Although detailed results are not shown here, an important finding of this investigation is that the proper selection of WP features represents a critical factor in achieving accurate classification. Another factor to be considered in the design of a WP preprocessor is the selection of wavelet function. Although only the Haar function was thoroughly investigated in this study, other functional types may provide better results and are currently under investigation in this Laboratory. ACKNOWLEDGMENT The authors wish to express their appreciation to Drs. Samuel W. Page of the FDA Center for Food Safety and Nutrition, Washington, DC, and Robert Hill of the Centers for Disease Control (CDC), Atlanta, GA, for providing the samples of Ltryptophan bulk substance used in these studies and to Professor Grant Welland of the Department of Mathematics & Computer Science, University of MissourisSt. Louis, for helpful technical discussions about wavelets. This work was supported in part by the Center for Molecular Electronics of the University of MissourisSt. Louis and by a contract with the FDA Division of Drug Analysis, St. Louis, under the auspices of Dr. Thomas P. Layloff. Received for review August 29, 1996. Accepted January 20, 1997.X AC9608836 X
Abstract published in Advance ACS Abstracts, February 15, 1997.
Analytical Chemistry, Vol. 69, No. 7, April 1, 1997
1397