Anal. Chem. 1996, 68, 3473-3482
Pharmaceutical Fingerprinting: Evaluation of Neural Networks and Chemometric Techniques for Distinguishing among Same-Product Manufacturers William J. Welsh,* Wangkan Lin,† Samuel H. Tersigni,† Elizabeth Collantes, Radu Duta, and Michael S. Carey
Department of Chemistry, University of MissourisSt. Louis, St. Louis, Missouri 63121 Walter L. Zielinski, James Brower, John A. Spencer, and Thomas P. Layloff
Division of Drug Analysis, U.S. Food and Drug Administration, St. Louis, Missouri 63101
The present study was undertaken to evaluate several computer-based classifiers as potential tools for pharmaceutical fingerprinting by utilizing normalized data obtained from HPLC trace organic impurity patterns. To assess the utility of this approach, samples of L-tryptophan (LT) drug substance were analyzed from commercial production lots of six different manufacturers. The performance of several artificial neural network (ANN) architectures was compared with that of two standard chemometric methods, K-nearest neighbors (KNN) and soft independent modeling of class analogy (SIMCA), as well as with a panel of human experts. The architecture of all three computer-based classifiers was varied with respect to the number of input variables. The ANNs were also optimized with respect to the number of nodes per hidden layer and to the number of hidden layers. A novel preprocessing scheme known as the Window method was devised for converting the output of 899 data entries extracted from each chromatogram into an appropriate input file for the classifiers. Analysis of the test set data revealed that an ANN with 46 inputs (i.e., ANN-46) was superior to all other classifiers evaluated, with 93% of the chromatograms correctly classified. Among the classifiers studied in detail, the order of performance was ANN-46 (93%) > SIMCA-46 (87%) > KNN-46 (85%) ) ANN-899 (85%) > “human experts” (83%) > SIMCA-899 (78%) g ANN-22 (77%) ) KNN-22 (77%) g KNN-899 (76%) > SIMCA-22 (73%). These results confirm that ANNs, particularly when used in conjunction with the Window preprocessing scheme, can provide a fast, accurate, and consistent methodology applicable to pharmaceutical fingerprinting. Particular attention was paid to variations in the HPLC patterns of same-manufacturer samples due to differences in LT production lots, HPLC columns, and even run-days to quantify how these factors might hinder correct classifications. The results from these classification studies indicate that the chromatograms evidenced variations across LT manufacturers, across the three HPLC columns and, for one manufacturer, across lots. The extent of column-to-column variations is particularly S0003-2700(95)01164-4 CCC: $12.00
© 1996 American Chemical Society
noteworthy in that all three columns had identical specifications with respect to their stationary-phase characteristics and two of the three columns were from the same vendor. The discovery of fraudulent practices at a generic pharmaceutical firm in 1989 led the U.S. Food and Drug Administration (FDA) to conduct large-scale investigations into the generic pharmaceutical industry, from which over 1400 samples were received for analysis representing 700 purportedly paired innovator and generic submissions. As a more meaningful alternative to such comprehensive in-depth analyses using compendial methods, it was determined that instances of fraud might be directly detectable from analytical “fingerprints” which could demonstrate sameness or differences between samples.1-3 As a result, we initiated an effort to evaluate methods for establishing normalized analytical database libraries that could be used to reliably compare the sameness of products.1,4 Such methodology and data arrays might be used for monitoring within- and between-batch product consistency, for examining the effects of process changes in the production of pharmaceutical products,4 and for determining whether a product marketed today is the same as that originally approved.1 In the early stages of this endeavor, we concentrated on comparative spectroscopic, thermal analysis, and X-ray powder diffraction data and on compilations of physical attributes of products. Such data provide information on the macroscopic composition of products. It became clear, however, that information on the microscopic chemical composition of products provided by chromatographic trace organic impurity patterns represented an important component of the product fingerprint. It is well established that comparisons of such patterns can often be used to reliably judge whether samples are the same or different and to determine precursor and degradation profiles in the bulk drug.5,6 In addition, comparisons of HPLC trace impurity fingerprints have † Current address: Ethyl Corp., 500 Spring St., Richmond, VA 23217. (1) Layloff, T. P. Pharm. Technol. 1991, 15, 146-148. (2) Kirchhoefer, R. D. J. AOAC Int. 1992, 75, 577-580. (3) Haddad, W. Pink Sheet 1989, (Aug 14), 6. (4) Anon. Gold Sheet 1994, 28 (7), 1-10. (5) Inman, E. L.; Tenbarge, H. I. J. Chromatogr. Sci. 1988, 26, 89-94. (6) For example: Crofford, L. J.; et al. J. Clin. Invest. 1990, 86, 1757-1763.
Analytical Chemistry, Vol. 68, No. 19, October 1, 1996 3473
been used to establish correlations between changes in the synthesis of given lots of a bulk pharmaceutical and occurrences of human pharmacogenic disease [isoxicam with Lyell’s syndrome;7 L-tryptophan with eosinophilia-myalgia syndrome8-12 (EMS)]. While HPLC is extremely useful in testing bulk pharmaceutical products for impurities, it is not a panacea for these detection efforts. One drawback is that HPLC trace impurity data are subject to concerns about repeatability and imprecision inasmuch as even HPLC columns that are nominally identical can exhibit variations in peak height and retention time for a given sample run under the same conditions (vide infra). Because of these concerns, appropriate consideration must be given to properly quantify and normalize the data.13 A major challenge associated with implementing pharmaceutical fingerprinting as a regulatory tool is the sheer volume of data that must be processed, analyzed, and archived. Systematic determination of “sameness” and “differences” among different pharmaceutical manufacturers of the same product or even among different lots from the same manufacturer would be an impossible task without recourse to computer-based classification methods. By virtue of their speed, objectivity, and reliability, such computerbased methods offer capabilities for large-scale screening of samples. At the same time, it is recognized that the basic concept of pharmaceutical fingerprinting by computer-based classification methods must be thoroughly demonstrated and tested before rendering any decision toward implementation as a regulatory tool. The present study was therefore undertaken as an initial step toward assessing the viability of computer-based pharmaceutical fingerprinting. A fundamental requirement for pharmaceutical fingerprinting is the ability of the classifier to match a given sample with its corresponding manufacturer from among other sameproduct manufacturers. The following investigation was conceived to explore this aspect of pharmaceutical fingerprinting. Specifically, the performance of three computer-based classifiers was compared to each other and to human experts in terms of their ability to correctly classify commercial product samples according to manufacturer based solely on analysis of their HPLC trace impurity patterns. Choosing an example of relevance to contemporary pharmaceuticals, the analysis was conducted on samples of L-tryptophan (LT) drug substance from production lots of six different commercial LT manufacturers. Of particular significance, variations in the HPLC patterns of same-manufacturer samples due to differences in LT production lots, HPLC columns, and even run-days were included in the experimental design to quantify what role these factors might play in obfuscating correct classifications. The three computer-based classifiers selected for this initial study were artificial neural networks (ANNs) together with two standard chemometric methods, K-nearest neighbors (KNN) and (7) Wall Street J. 1987, (Feb 19), 51. (8) Anon. Morbidity Mortality Weekly Rep. 1990, 39, 589-591. J. Am. Med. Assoc. 1990, 62, 1656, 1698. (9) Hertzman, P. A.; Blevins, W. L.; Mayer, J.; Greenfield, B.; Ting, M.; Gleich, G. J. N. Engl. J. Med. 1990, 322, 869-873. (10) Slutsker, L.; Hoesly, F. C.; Miller, L.; Williams, L. P.; Watson, J. C.; Fleming, D. W. J. Am. Med. Assoc. 1990, 264, 213-217. (11) Smith, M. J.; Mazzola, E. P.; Farrell, T. J.; Sphon, J. A.; Page, S. W.; Ashley, D.; Sirimanne, S. R.; Hill, R. H., Jr.; Needham, L. L. Tetrahedron Lett. 1991, 32, 991-994. (12) Trucksess, M. W.; Thomas, F. S.; Page, S. W. J. Pharm. Sci. 1994, 83, 720-722. (13) Otto, M. Anal. Chem. 1990, 62, 797A-802A.
3474 Analytical Chemistry, Vol. 68, No. 19, October 1, 1996
Table 1. Composition of Experimental Design Employed in This Study HPLC column X, Vydac 1 Y, Vydac 2 Z, Waters total no. of chromatograms
manufacturers lots run days 6 6 6
2 2 2
2 2 2
reps/ day subtotal 5 3 3
120 72 72 264
soft independent modeling of class analogy (SIMCA). ANNs are rapidly emerging in the field of analytical chemistry as a powerful tool for pattern searching, mapping, modeling/regression, and classifying/fingerprinting.14 Relevant to the present application, ANNs have been used for the detection of the C-13 NMR chemical shifts of dibenzofurans,15 for pattern recognition of camphoraceous and fruity odors of aliphatic alcohols,16 and for UV spectral library searching in photodiode array HPLC analysis.17 The choice of LT in the present investigation was based primarily on the availability of a wide selection of samples from all six commercial manufacturers. As a metabolic precursor of serotonin, LT had been marketed as an over-the-counter dietary supplement to treat insomnia, depression, and obesity. In 1989, the Centers for Disease Control (CDC) reported18 an outbreak of over 1600 cases of EMS (including 38 fatalities) associated with the ingestion of LT tablets which subsequently led to the banning and seizure of all LT products. While the causal agent for the EMS outbreak remains uncertain, evidence implicates one or more impurities in the bulk product rather than LT itself.10-12 By virtue of its investigation of this outbreak, the FDA acquired a large, comprehensive archive of LT products from all six commercial manufacturers. This archive, which included samples of LT drug substance for various production lots from each manufacturer, served as an ideal source of samples for the present study. EXPERIMENTAL SECTION HPLC Experimental Design. To prepare a well-balanced data set for model building, statistical methods of experimental design were applied for collection of the HPLC data. The experimental design was constructed to assess four variables: LT manufacturer, LT lot, HPLC column, and between-day repeatability (Table 1). Hence, two separate production lots (designated lot 1 and lot 2) were randomly selected from each of the six LT manufacturers (designated A-F) to produce a total of 12 LT samples for the study. For each of these 12 samples, three to five replicate HPLC analyses were run on each of three HPLC columns (designated X, Y, and Z) on each of 2 days. Significantly, the column packing and packing particle size of columns X-Z were identical even though the vendors were not all the same: Vydac for X and Y, and Waters for Z. Of the 264 chromatograms (14) For example: Goodacre, R.; Neal, M. J.; Kell, D. B. Anal. Chem. 1994, 66, 1070-1085. Ball, J. W.; Jurs, P. C. Anal. Chem. 1993, 65, 3615-3621. Blank, T. B.; Brown, S. D. Anal. Chem. 1993, 65, 3081-3089. Borggaard, C.; Thodberg, H. H. Anal. Chem. 1992, 64, 545-551. Wythoff, B. J.; Levine, S. P.; Tomellini, S. A. Anal. Chem. 1990, 62, 2702-2709. (15) Clouser, D. L.; Jurs, P. Anal. Chim. Acta 1996, 321, 127-135. (16) Chastrette, M.; Cretin, D.; Elaidi, C. J. Chem. Inf. Comput. Sci. 1996, 36, 108-113. (17) Mittermayr, C. R.; Drouen, A. C. J. H.; Otto, M.; Grasserbauer, M. Anal. Chim. Acta 1994, 294, 227-242. (18) Anon. Morbidity Mortality Weekly Rep. 1989, 38, 765-767, 785-788.
Figure 1. Representative HPLC scan indicating location of the early marker (M1), the late marker (M2), the LT-active manifold of peaks, and the so-called fingerprint region.
comprising this experimental design (Table 1), 11 were disqualified from the subsequent classification analysis since they exhibited erratic base-line shifts that were deemed artifactual. These disqualified scans were randomly distributed among the LT manufacturers and lots as well as the HPLC columns; hence, the classification analysis could proceed using the remaining 253 chromatograms (96% of total) without disturbing the balance of the data set. Chromatographic Experiments. The chromatographic apparatus and reagents were from standard commercial sources. The chromatograms were obtained using a Waters 600E solvent delivery system and WISP 712 autosampler coupled to a HewlettPackard HP1040M photodiode array detector and Chemstation. Three HPLC columns containing 5-µm C18 reverse-phase packing were used: two Vydac 25 × 4.6 mm i.d. columns and a Waters 15 cm × 3.9 mm i.d. column. The sample injection volume was 10.0 µL, and the flow rate was 1.00 mL/min. Chromatographic runs were conducted at ambient temperature (24 ( 1) °C. The HP104M photodiode array detector was set at a wavelength of 220 nm, with a bandwidth of (2 nm; the detector automatically balanced the column eluant at the start of each chromatogram against a reference wavelength of 550 nm with a bandwidth of (50 nm for noise reduction. A “marker” solution containing 30 g/mL uracil and 15 g/mL p-aminobenzoic acid butyl ester was prepared using 0.15 M ammonium hydroxide. LT samples were obtained from the FDA Center for Food Safety and Applied Nutrition (Washington, DC) and the CDC (Atlanta, GA). Each sample of LT drug substance was dissolved in the “marker” solution to a concentration of 20 mg/mL. Upon injection of 10.0 µL of sample, the chromatography was run for 2 min at 100% mobile phase A followed by a linear gradient to 100% mobile phase B over a 38-min period. Chromatographic data were collected every 0.005 min, producing ∼8000 data entries per chromatogram. Mobile phase A was prepared by diluting 1.0 mL of phosphoric acid (85%) to 1.0 L with Milli-Q water; mobile phase B was prepared by mixing 800 mL of acetonitrile with 200 mL of Milli-Q water containing 1.0 mL of phosphoric acid (85%). The mobile phases were filtered through 0.45-µm membranes prior to use. As shown in Figure 1, the “early marker” uracil (M1) and “late marker” p-aminobenzoic acid butyl ester (M2) were chosen to bracket the retention times (RTs) of the peaks associated with
the LT samples and were used to normalize the retention data. Data files collected by the HP 1040M diode array detector were converted to ASCII format by a proprietary conversion program prior to classification analysis. Data Collection and Normalization of Chromatograms. Inspection of the chromatograms for the LT samples revealed a “fingerprint region” (Figure 1), located between the LT-peak manifold and the M2 peak, which served as the source of input data for the ANN, KNN, and SIMCA classifiers as well as for the human experts. This region was selected for fingerprinting because it was rich in fine structure and quite distinct among the six LT manufacturers (Figure 2). To minimize the effects of variability in experimental conditions, the HPLC scans were first normalized with respect to both detector response (i.e., peak height h) and retention time (τ). The observed detector responses were normalized on a linear scale based on a value of h′ ) 10 for the M2 peak. The observed retention times were normalized on a linear scale based on values of τ′ ) 1 for M1 and τ′ ) 10 for M2, according to
τ′i ) (τi - τM1)(9/(τM2 - τM1)) + 1
(1)
where τi, τM1, and τM2 are the observed retention times for the ith peak, the early marker M1, and the late marker M2, respectively. In terms of these normalized units, the fingerprint region extended from τ′ ) 5.505 to 9.995 in increments of 0.005 and contained 899 pairs of h′, τ′ data entries. The Window Preprocessing Scheme. Like the human brain itself, ANNs work best if taught to classify by recognizing general patterns instead of by “memorizing” (overfitting) the training set data.19 To avoid such overfitting of the training set, a Window preprocessing scheme was developed to convert the 899 data entries (i.e., τ′, h′ pairs) extracted from the fingerprint region of each chromatogram into a significantly smaller number of discrete input variables. The motivation behind creating the Window scheme was 3-fold: (1) to create a simple numerical representation for each chromatogram which was suitable as input for the computer-based classifiers; (2) to recognize and extract the discriminating features between the chromatograms of differentmanufacturer samples; and (3) to compensate for the effects of column-to-column and lot-to-lot variations, as well as base-line noise, between the chromatograms of same-manufacturer samples. Differences in same-manufacturer chromatograms due to lot-tolot variations will be evidenced primarily by peak height shifting, since the trace impurity patterns of two lots from the same LT manufacturer (in the absence of drug processing changes) should match in retention time (i.e., same impurities) but might differ in peak height (i.e., different impurity concentrations). Likewise, differences in same-manufacturer chromatograms due to columnto-column variations will be evidenced primarily by retention time shifting, since trace impurity patterns of the same sample from different columns (or from the same column run at different times) should match in peak heights (i.e., same detector response) but might differ in retention times (i.e., slight shifts in intercolumn solute-stationary phase partitioning). It should be noted, however, that slight peak height shifting, presumably inherent to (19) For recent reviews, see: Gasteiger, J.; Zupan, J. Angew. Chem., Int. Ed. Engl. 1993, 32, 503-527. Burns, J. A.; Whitesides, G. M. Chem. Rev. 1993, 93, 2583-2601. Erb, R. Pharm. Res. 1993, 10, 165-170. Brown, S. D.; Blank, T. B.; Sum, S. T.; Weyer, L. G. Anal. Chem. 1994, 66, 315R-359R.
Analytical Chemistry, Vol. 68, No. 19, October 1, 1996
3475
Figure 2. Representative chromatogram from each of the six LT manufacturers (A-F) after normalization with respect to retention time and peak height.
Figure 3. Illustration of the Window preprocessing scheme which divides the fingerprint region of the chromatogram into 22 time windows.
HPLC, was sometimes observed in chromatograms obtained from replicate runs on the same LT sample. In implementing the Window scheme, the fingerprint region (τ′ ) 5.5-9.9) was first divided chronologically into 22 time windows of equal size (Figure 3). Each time window was then analyzed to locate the highest peak h′max within it. The resulting series of 22 h′max values was then converted to integer values based on a 0-5 scale according to the following hierarchical scheme: highest peak, 5; next three highest peaks, 4; next three highest peaks, 3; and next three highest peaks, 2. The 12 remaining h′max values were assigned 0 or 1 depending on being designated noise or not noise, respectively. To establish an acceptable criterion for noise vs not noise, a statistical analysis of variance (ANOVA) 3476
Analytical Chemistry, Vol. 68, No. 19, October 1, 1996
was performed on 10 peak heights h′ appearing after the M2 peak and thus presumed as noise. A peak was judged not noise (and assigned a value 1) if its h′max value was at least two standard deviations larger than the statistical mean of the 10 noise peaks (calculated at the 95% confidence level). The Window preprocessing scheme was designed explicitly to compensate for the negative effects of lot-to-lot and columnto-column variations (and, to a lesser degree, repetition-torepetition variations) on the performance of the classifiers. The use of time windows partially offsets the observed retention time shifts that arise from column-to-column variations; small shifts in τ′ will not likely alter the identification of an h′max with a specific window. Likewise, the conversion of the actual h′max values to a
Figure 4. Schematic of the feed-forward back-propagation ANN with 46 inputs employed in the present study. Input data are fed to the ANN by the Window preprocessor. Each ouput node designates a specific LT manufacturer.
relative scale of integers (viz., 0-5) partially offsets the peak height shifts that arise from lot-to-lot variations; small shifts in the height of an h′max value will not likely alter its integer value. Using the procedure described above, the Window scheme will convert the data set of 899 h′, τ′ pairs extracted from the fingerprint region into a string of 22 integers, with each string containing one 5, three 4’s, three 3’s, three 2’s, and a total of 12 1’s and 0’s (e.g., 0 1 0 2 4 1 0 5 3 0 1 2 1 4 0 2 0 3 0 4 1 3, representing a single chromatogram). Whereas the peak heights h′ are represented explicitly using an integer scale, the retention times τ′ are represented only implicitly via the sequential order of the integers. This string of 22 integers from each chromatogram was fed as inputs 1-22 into the ANN-22-n-6 networks, where the three integers (e.g., 22-n-6) refer to the number of nodes in the input, hidden, and output layers, respectively (Figure 4). For the ANN46-n-6 networks, inputs 1-22 contained this same string of 22 integers while inputs 23-44 contained the number of peaks in the 22 time windows in sequential order. Inputs 45 and 46 contained, respectively, the cumulative number of peaks in the 22 time windows (i.e., the entire fingerprint region) and the cumulative number of peaks with height h′ > h′M2 (Figure 1). Classification Models. The ANNs, constructed using BrainMaker Professional (California Scientific Software, Nevada City, CA), were fully connected feed-forward networks with sigmoidal transfer functions and a back-propagation learning algorithm implementing gradient descent minimization to adjust the weights. Preliminary evaluation yielded two basic ANN architectures, namely, ANN-22-n-6 and ANN-46-n-6. The output layer consisted of 6 nodes, one for each LT manufacturer. The register of each output node contained a floating-point number spanning the interval 0-1; the LT manufacturer predicted by the ANN was taken as the output node with the highest numerical value within this range. The number of nodes n in the hidden layer and, to a lesser extent, the number of hidden layers, were varied to search for
the optimal architecture. Specifically, the number of nodes n per hidden layer was varied between 18 and 50 in ANN-46-n-6 to examine the influence of the hidden layers on performance. Similar ANNs with two hidden layers were also considered but failed to show any significant improvement in performance over the ANNs with a single hidden layer. The target-fitting parameters during training were typically set at 95% of correct facts and 0.2 error tolerance. The ANN was trained several times, starting from different initial random weights, and was optimized during training using two separate approaches. In the first and more conventional approach, the ANN was trained to convergence (i.e., training was terminated when the target-fitting parameters were reached for the training set). In the second approach, described elsewhere as crossvalidation learning,20 training of the ANN was interrupted periodically and evaluated for performance using the test set data. In this case, the ANN was considered optimized and training completed when no further improvement with the test set data was achieved as measured by the root-mean-square error (RMSE) between the predicted and correct outputs. Other workers have found this latter approach preferable for network optimization in related studies.20 For the present application, the 22- and 46-input ANNs typically optimized within 45 and 14 iterations, respectively, regardless of which approach was employed. Considering the two standard classifiers, the KNN method attempts to classify samples based on their spatial proximity to each other within an N-dimensional coordinate system where N is the number of input variables characterizing each sample. In contrast, the SIMCA method creates a principal components analysis (PCA) model for each class in a training set to discriminate among classes. KNN and SIMCA were implemented using the Pirouette software, Version 1.1 (Infometrix, Inc., Seattle, WA). Each data set of τ′, h′ pairs was pretreated prior to SIMCA and (20) Xu, L.; Ball, J. W.; Dixon, S. L.; Jurs, P. C. Environ. Toxicol. Chem. 1994, 13, 841-851.
Analytical Chemistry, Vol. 68, No. 19, October 1, 1996
3477
Table 2. Number of Chromatograms in Training Set/Test Set for Runs 1-6 per LT Manufacturera LT manufacturer
a
run 1
run 2
run 3
run 4
run 5
run 6
A B C D E F
34/10 (1, X) 29/9 (2, X) 36/6 (1, Y) 35/6 (2, Y) 38/6 (1, Z) 38/6 (2, Z)
38/6 (2, Z) 32/6 (1, X) 34/8 (2, X) 35/6 (1, Y) 38/6 (2, Y) 38/6 (1, Z)
38/6 (1, Z) 32/6 (2, Z) 32/10 (1, X) 32/9 (2, X) 38/6 (1, Y) 38/6 (2, Y)
38/6 (2, Y) 33/5 (1, Z) 36/6 (2, Z) 32/9 (1, X) 34/10 (2, X) 37/6 (1, Y)
38/6 (1, Y) 32/6 (2, Y) 36/6 (1, Z) 35/6 (2, Z) 34/10 (1, X) 34/10 (2, X)
34/10 (2, X) 32/6 (1, Y) 36/6 (2, Y) 36/5 (1, Z) 38/6 (2, Z) 34/10 (1, X)
totals
210/43
215/38
210/43
211/42
209/44
210/43
The composition of each test set chromatogram by lot (1 or 2) and HPLC column (X, Y, Z) is indicated in parentheses.
KNN analysis using a variety of established protocols (normalizing, mean-centering, normalizing followed by mean-centering, and autoscaling). Based on a preliminary comparison of these pretreatments, vector length normalization was deemed the most appropriate for the present data sets and was implemented throughout prior to SIMCA and KNN analysis. In addition, three principal components (PCs) for SIMCA and K ) 4 for KNN were chosen for the subsequent classification study since these parameters yielded the best performance with the test sets. Training and Test Sets. The 253 chromatograms in this classification study included three to five replicates for every combination of LT manufacturer, LT lot, and HPLC column, run on two different days. It soon became apparent that placing chromatographic data from replicate runs of otherwise identical samples in both the training and test sets would unintentionally introduce a bias in favor of the computer-based classifiers when the test set chromatograms were analyzed. The classifiers virtually never misclassified a test set chromatogram if even just one of its replicates was found in the training set. To avoid this “replicate contamination”, the 253 chromatograms were partitioned into six separate combinations of training and test sets (called runs 1-6) in such a way that (1) no chromatogram in the test set would encounter any of its replicates in the training set, and (2) each unique combination of LT manufacturer, LT lot, and HPLC column was included in a test set an equal number of times. This “leave-n-out” procedure was equivalent to modified cross-validation inasmuch as each set of chromatograms was systematically omitted and then tested against the model based on the remaining chromatograms. The number of chromatograms in the training/ test sets for runs 1-6, as well as the composition of the test sets in terms of LT manufacturer (A-F), LT lot (lot 1, Lot 2), and HPLC column (X, Y, Z), are summarized in Table 2. “Human Experts” Panel. A panel of 24 human expert volunteers was recruited from the pool of graduate students and postdoctoral students in the Department of Chemistry at the University of MissourisSt. Louis and from chemists at the FDA Division of Drug Analysis, St. Louis, MO. The volunteers were divided randomly into six groups of four people, one group each for runs 1-6. Each volunteer was presented six stacks of chromatograms, appropriately labeled A, B, ..., F according to the LT manufacturer, representing the training set for a particular run. The volunteer was then given the test set chromatograms appropriate to that run and instructed to match (i.e., classify) each chromatogram with one of the stacks of HPLC data labeled A-F. Each chromatogram in the test set was coded in a manner to ensure “double blind” test procedures. In order to recreate the information available to the computer-based approaches, the 3478
Analytical Chemistry, Vol. 68, No. 19, October 1, 1996
Figure 5. Plot of percent correct classification and RMSE as a function of the number of hidden nodes n in ANN-46-n-6.
volunteer was permitted to view only the fingerprint region of each chromatogram. RESULTS AND DISCUSSION ANN Training and Optimization. Optimization of the ANNs was achieved using two approaches: (i) convergence of the training set and (ii) cross-validation training. Using either approach, training was completed usually within 12-20 iterations for ANN-46-n-6; 30-60 iterations for ANN-22-n-6; and 28-200 for ANN-899-n-6. The fast rate of learning for ANN-46-n-6, coupled with the smooth shape of the RMSE iteration curve for both the training and test sets, suggests that the level and quality of information fed by the Window preprocessing scheme to the ANN were appropriate to enable reliable classifications. With respect to the number of hidden nodes n, results from optimization of ANN-46-n-6 are summarized in Figure 5. For the training set, the RMSE exhibited a slight but gradual improvement with increasing n that appeared to level off beyond n = 36. For the test set, the RMSE reached a minimum, and correspondingly, the percent correct classified reached a maximum at n ) 38. By virtue of its optimal performance during both training and testing, the ANN-46-38-6 network was chosen among the ANN-46 networks for all subsequent classification studies. The ANN-22-n-6 and ANN-899-n-6 networks were optimized in a similar manner. Classification Studies. The ANN, KNN, and SIMCA classifiers were each tested using 22, 46, and 899 input variables. Based on percent correct classified, the ANN-46 was superior to all other classifier-input variable combinations in terms of
Table 3. Summary of Results from Runs 1-6 Using Different Classifiers and Different Number of Input Variablesa LT manufacturerb no. of input variables
A
B
C
D
E
F
total
% correct classif (SE
ANN KNN SIMCA
41/44 44/44 39/44
35/38 38/38 38/38
31/42 30/42 23/42
7/41 0/41 9/41
39/44 38/44 34/44
42/44 44/44 42/44
195/253 194/253 185/253
77 ( 12 77 ( 16 73 ( 12
ANN KNN SIMCA 899 ANN KNN SIMCA
44/44 44/44 42/44
38/38 38/38 37/38
37/42 31/42 38/42
35/41 20/41 28/41
38/44 39/44 37/44
44/44 43/44 39/44
236/253 215/253 221/253
93 ( 3 85 ( 8 87 ( 4
44/44 42/44 40/44
37/38 36/38 34/38
35/42 35/42 29/42
23/41 0/41 21/41
36/44 35/44 35/44
40/44 44/44 40/44
215/253 192/253 199/253
85 ( 6 76 ( 15 78 ( 6
22
46
a Given as the ratio of the number of chromatograms correctly classified to the number of chromatograms in the test set. b Correct classifications/ subset size.
Table 4. Summary of Results from the 46-Node ANN for Runs 1-6a LT manufacturer
a
run 1
run 2
run 3
run 4
run 5
run 6
cumul total
A B C D E F
10/10 9/9 6/6 5/6 6/6 6/6
6/6 6/6 8/8 6/6 5/6 6/6
6/6 6/6 5/10 9/9 4/6 6/6
5/6 5/5 6/6 5/9 10/10 6/6
6/6 6/6 6/6 6/6 7/10 10/10
10/10 6/6 6/6 4/5 6/6 10/10
44/44 38/38 37/42 35/41 38/44 44/44
cumul total % correct
42/43 98
37/38 97
36/43 84
38/42 90
41/44 93
42/43 98
236/253
% correct 100 100 88 85 86 100 93
The ratios refer to the number of correctly classified chromatograms relative to the total number in the test set.
predicting LT manufacturer (Table 3). The results for this network are more fully elaborated in Table 4 by run number. Initial inspection of Table 3 reveals that every classifier correctly classified >70% of the test set chromatograms regardless of the number of input variables. At the same time, the number of input variables was a significant factor in terms of the performance of the classifiers. Comparing the three levels of input variables studied in detail (i.e., 22, 46, and 899), the 46-input level consistently gave the best performance and yielded the lowest standard deviation of the mean (σm) for all three computer-based classifiers. By contrast, the 22-input level yielded the highest σm value for all three classifiers, indicating that it lacked sufficient input information to consistently discriminate among the six LT manufacturers. The 899-input level deprived the classifiers of preprocessing which would compensate for lot-to-lot and columnto-column variations and for chromatogram base-line noise. The choice of classifier was another important factor in terms of performance. Specifically, ANNs were equal to or marginally superior to both KNN and SIMCA regardless of the number of input variables selected (Table 3). With 22-input variables, ANN (77%) equaled KNN and performed marginally better than SIMCA (73%) although the differences are not statistically significant. With 46-input variables, ANN (93%) outperformed both KNN (85%) and SIMCA (87%) with differences at the edge of statistical significance. For 899-input variables, ANN (85%) appeared superior to both KNN (76%) and SIMCA (78%) although once again the differences are not statistically significant. These consistent findings suggest that ANNs offer advantages over the standard classifiers KNN and SIMCA for applications in pharmaceutical fingerprinting. The superiority of ANNs over
KNN and SIMCA was also reported by other workers21,22 in evaluating chemical structural classifiers with a large mass spectral database and in GC/MS pattern recognition studies of jet fuel classes. Another important basis for comparison of the classifiers is to observe their performance for each individual LT manufacturer. The percent correct classified for ANN-22, ANN-46, KNN-46, and SIMCA-46 on a per-manufacturer basis is depicted in Figure 6 using the mean of the results from runs 1-6. It is seen that ANN46 had the smallest variation in performance across manufacturers A-F and the greatest number of perfect scores per manufacturer. Perfect scores were attained three times (manufacturers A, B, and F) by ANN-46, twice (manufacturers A and B) by KNN-46, and never by ANN-22 or SIMCA-46. Figure 6 also reveals that the percentage of misclassifications was the smallest for manufacturers A, B, and F and by far the largest for manufacturer D (Figure 6). Manufacturer D was misclassified (almost invariably as manufacturer A) over 80% of the time by ANN-22, over 50% of the time by KNN-46, and over 30% of the time by SIMCA-46. To visualize these differences among the LT manufacturers, scores plots generated from the 46-input data were constructed from the SIMCA scores for each LT manufacturer. The first two principal components PC1 and PC2, which together explain 96% of the total variance, provided a convenient visual aid for identifying inhomogeneity in the data sets. Depending on the number of clusters observed, the cluster (21) Werther, W.; Lohninger, H.; Stanci, F.; Varmuza, K. Chemom. Intell. Lab. Syst. 1994, 22, 63-76. (22) Long, J. R.; Mayfield, H. T.; Henley, M. V.; Kromann, P. R. Anal. Chem. 1991, 63, 1256-1261.
Analytical Chemistry, Vol. 68, No. 19, October 1, 1996
3479
Figure 6. Comparison of the performances of various computerbased classifiers over the six LT manufacturers (A-F).
of points in a particular scores plot can represent the chromatograms obtained for samples from a given manufacturer over the various combinations of lot, HPLC column, and repetition. Evidence of segregation of points into separate clusters can reflect the degree of lot-to-lot and column-to-column variations among the chromatograms. If lot-to-lot and column-to-column variations for a given manufacturer are negligible, the points in the scores plot should reflect this by aggregating into a single cluster. In fact, the scores plots for each manufacturer (Figure 7) revealed segregation of the points into three principal clusters (encircled) corresponding to the three HPLC columns (e.g., Figure 7, manufacturer A). This finding confirmed the existence of appreciable column-to-column variations among all of the samples regardless of LT manufacturer and lot. As expected, little or no segregation between replicate chromatograms was observed in the scores plots. With one important exception, none of the scores plots revealed evidence of appreciable clustering due to lot-to-lot variation (e.g., Figure 7, manufacturer A). Only the scores plot for manufacturer D (Figure 7) showed an additional level of segregation according to lot, yielding six identifiable miniclusters corresponding to all possible combinations of lot (lot 1 or lot 2) and HPLC column (X, Y, or Z). The scores plots of manufacturers A-F provide clear evidence that only D exhibits both lot-to-lot and column-to-column variations to a high degree. In contrast, the extent of lot-to-lot variation among the other manufacturers is negligible even though a slight amount of segregation by lot can be discerned in some cases by a close examination of the scores plots. On the basis of the scores plot of manufacturer D, one can reasonably conclude that the lotto-lot variations, together with the column-to-column variations, confused SIMCA and likely the other classifiers. It is seen that the combination of column-to-column and lot-to-lot variations can impair the performance of the classifiers and, thus, must be watched closely when HPLC data for pharmaceutical fingerprinting are used. It is worth noting, however, that the performance of ANN-46 was less sensitive than KNN-46 and SIMCA-46 to the influence of lot-to-lot and column-to-column variations. A direct measure of the capabilities of the computer-based classifiers is to compare their performance with that of the human experts, as shown in Figure 8 for a representative number of classifiers. It is seen that percent correct classifications for the 3480
Analytical Chemistry, Vol. 68, No. 19, October 1, 1996
human experts (83 ( 2%) was appreciably below ANN-46 (93 ( 3%), slightly below SIMCA-46 (87 ( 2%), KNN-46 (85 ( 4%), and ANN-899 (85 ( 2%), and above ANN-22 (77 ( 3%). The superiority of ANNs to human experts was also reported in a recent study by Rowe et al.,23 who compared a trained, optimized ANN with a human expert for classifying chromatographic peak shapes for 396 peak profiles. While the overall classification success rate (85%) for the ANN and the expert was about equal in this cited work, the ANN exhibited total objectivity and completed the task in 5.6 s compared with 8 h for the human expert. The results depicted in Figure 8 also show that the classifiers exhibited a fair degree of variability in performance overruns 1-6. For example, both ANN-22 and ANN-46 performed poorly in run 3 but relatively well in run 2. Conversely, KNN-46, SIMCA-46, and the human experts performed more poorly in run 2 than in run 3. A more surprising finding was that the ANNs misclassified manufacturers C and E more often than D in run 3. This lack of uniformity from run to run is another indicator of the negative impact of lot-to-lot and column-to-column variations on the performance of the classifiers (including the human experts). In this regard, it must be recognized that each run contained a unique pair of lot-column combinations in the test set which differed from those combinations in the training set for a given LT manufacturer (Table 2). Consequently, poor performance by a specific classifier in a given run would imply that the classifier had difficulty recognizing that different lot-column combinations in the test set and the training set may belong to the same class (i.e., LT manufacturer). The observation of variations in performance from run to run further justified our use of multiple test sets (runs 1-6) for the purpose of evaluating the classifiers. Sensitivity Analysis. The results (Table 3; Figures 6-8) from these classification studies indicated that the chromatograms evidenced variability across manufacturers, across the three HPLC columns (X-Z) (Figure 7), and, in one case (manufacturer D), across lots. The extent of column-to-column variation is particularly noteworthy in that all three columns had identical specifications with respect to stationary-phase characteristics and two of the columns (X and Y) were from the same vendor. The influence of LT manufacturer, lot, and HPLC column on the chromatographic results was quantified by carrying out an analysis of variance (ANOVA) on these three class variables. The ANOVA calculation, performed using the Statistical Analysis System (SAS) computer package, considered the retention time of the highest peak found in the fingerprint region. Tests were carried out at a 5% level of significance. The analysis showed highly significant variation among the manufacturers (F ) 438.94, p ) 0.0001) and among the columns (F ) 13.89, p ) 0.0001), but not between the lots (F ) 1.58, p ) 0.2096). These results substantiate the scores plot in Figure 7 in terms of (1) the presence of clusters corresponding to the three HPLC column and (2) the absence of clusters corresponding to the lots for each manufacturer and column. An exception is manufacturer D, where clustering according to both column and lot was observed (Figure 7). A separate analysis of variance for manufacturer D indicated that the two lots were different from each other and that the observed clustering of lots within a column is significant (F ) 383.91, p ) 0.0001). (23) Rowe, R. C.; Mulley, V. J.; Hughes, J. C.; Nabney, I. T.; Debenham, R. M. LC-GC 1994, 12, 690-698.
Figure 7. Plot of SIMCA scores for manufacturers A-F.
CONCLUDING REMARKS The present study has attempted to describe some of the key issues and pitfalls associated with using computer-based classifiers in conjunction with HPLC trace impurity data for pharmaceutical fingerprinting. Of critical significance for future work, these results stress the need to recognize the potential negative effects
of both lot-to-lot and column-to-column variations when HPLC trace impurity profiles are employed for this purpose. Current efforts24 are aimed toward expanding the scope of this study in (24) Welsh, W. J.; Collantes, E.; Duta, R.; Zielinski, W. L.; Layloff, T. P., work in progress.
Analytical Chemistry, Vol. 68, No. 19, October 1, 1996
3481
represents merely the first step toward realization of the ultimate goals of pharmaceutical fingerprinting, namely, to develop capabilities (1) for discriminating among sources of a drug (manufacturer-manufacturer variation), (2) for detecting changes in manufacturer processing methods and/or bulk materials (lot-tolot variation), and (3) for confirming instances of pharmaceutical counterfeiting or fraud.
Figure 8. Comparison of the performances of various computerbased classifiers vs the panel of human experts over runs 1-6.
several directions, most notably by a more exhaustive examination of ANNs and other classifiers, by development of alternative preprocessing strategies (e.g., wavelet transforms), and by analysis of other drug formulations. Clearly, the progress reported here
3482
Analytical Chemistry, Vol. 68, No. 19, October 1, 1996
ACKNOWLEDGMENT The authors express their sincere thanks to Drs. Samuel W. Page of the FDA Center for Food Safety and Nutrition, Washington, DC, and Robert Hill of the Centers for Disease Control (CDC) for providing the samples of L-tryptophan bulk substance used in these studies and for helpful technical discussions. This work was supported in part by the Center for Molecular Electronics of the University of MissourisSt. Louis and by a contract with the FDA Division of Drug Analysis, St. Louis. Received for review November 30, 1995. Accepted July 15, 1996.X AC951164E X
Abstract published in Advance ACS Abstracts, August 15, 1996.