Spectrometric Data Matrices for

Dec 29, 1998 - Fractal Fingerprinting of Chromatographic Profiles Based on Wavelet Analysis and Its Application To Characterize the Quality Grade of M...
5 downloads 7 Views 111KB Size
Anal. Chem. 1999, 71, 727-735

Full Second-Order Chromatographic/Spectrometric Data Matrices for Automated Sample Identification and Component Analysis by Non-Data-Reducing Image Analysis Niels-Peter Vest Nielsen, Jørn Smedsgaard,* and Jens Christian Frisvad

Department of Biotechnology, Building 221, Technical University of Denmark, DK-2800 Lyngby, Denmark

A data analysis method is proposed for identification and for confirmation of classification schemes, based on single- or multiple-wavelength chromatographic profiles. The proposed method works directly on the chromatographic data without data reduction procedures such as peak area or retention index calculation. Chromatographic matrices from analysis of previously identified samples are used for generating a reference chromatogram for each class, and unidentified samples are compared with all reference chromatograms by calculating a resemblance measure for each reference. Once the method is configured, subsequent sample identification is automatic. As an example of a further development, it is shown how the method allows identification of characteristic sample components by local similarity calculations thus finding common components within a given class as well as component differences between classes from the reference chromatograms. This feature is a valuable aid in selecting components for further analysis. The identification method is demonstrated on two data sets: 212 isolates from 41 food-borne Penicillium species and 61 isolates from 6 soil-borne Penicillium species. Both data sets yielded over 90% agreement with accepted classifications. The method is highly accurate and may be used on all sorts of chromatographic profiles. Characteristic component analysis yielded results in good agreement with existing knowledge of characteristic components, but also succeeded in identifying new components as being characteristic. Modern chromatographic instruments are capable of producing large quantities of data, particularly when spectrometric detection is used, generating second-order chromatograms, i.e., data matrices with full spectra collected for series of retention times. These chromatograms are the basis of some of the most powerful analytical methods, because they allow production of fingerprints of compounds found in highly complex samples. These fingerprints can be used as very effective tools for comparison, classification, identification, or typification of samples and have found widespread use in, for example, flavor research, forensic investigations (e.g., drug abuse and oil spill), and chemotaxonomy * Corresponding author: (email) [email protected]; (fax) +45 45 88 49 22. 10.1021/ac9805652 CCC: $18.00 Published on Web 12/29/1998

© 1999 American Chemical Society

of microorganisms and plants.1-5 In general, hyphenated chromatographic methods, e.g., HPLC-diode array detection (UVvisible spectra) or GC/MS are the most powerful tools available for screening and profiling samples. However, the full second-order data matrices are rarely used as collected, including all data points. The collected data are generally subjected to a substantial data reduction by identification of peaks of interest in the chromatograms, calculating retention index and peak areas, and, in the case of spectral data, extraction of the peak spectra. This is done to extract the relevant information (retention time, amount and character of the eluting components), thus reducing the amount of information to manageable proportions. Reducing data is possible, because much of the information in a chromatogram is irrelevant (noise) or redundant (several measurements on the same, well-separated peak). Thus, data reduction is a way of enhancing the measured data to bring out the relevant information in relation to the irrelevant, making data analysis faster and easier, and yielding more certain results. In recent years, the peak identification method has been automated, using computers to determine position and characteristics of the peaks in chromatograms.6 Data reduction, however, has an intrinsic problem: By reducing the amount of data, information is invariably lost. Irrelevant information should be removed, but in many cases, it is difficult to define exactly what information is relevant and what may be discarded. This is especially true for noisy data or samples with many components, and a data reduction as drastic as the peak identification approach must in these cases be applied carefully, to retain only and all the relevant information. In general, classification or identification of samples based on chromatographic profiles using traditional peak detection relies heavily on the analyst. She or he has to judge exactly which data should be extracted to make sample classification or identification feasible. Information contained in peak shape, overlapping peaks, or peak shoulders can be useful but is mostly ignored. Further(1) ) Roberts, D.; Bertsch, W. J. High Resolut. Chromatogr., Chromatogr. Commun. 1987, 10, 244. (2) ) Logan, B. K. Anal. Chim. Acta 1994, 288, 111. (3) ) Larsen, T. O.; Frisvad, J. C. Mycol. Res. 1995, 99, 1167. (4) ) Ramos, L. S. J. Chromatogr. Sci. 1994, 32, 219. (5) ) White, R. L.; Wentzell, P. D.; Beasy, M. A.; Clark, D. S.; Grund, D. W. Anal. Chim. Acta 1993, 277, 333. (6) ) Prazen, B. J.; Synovec, R. E.; Kowalski, B. R. Anal. Chem. 1998, 70, 218225.

Analytical Chemistry, Vol. 71, No. 3, February 1, 1999 727

more, selection of peaks of interest is often biased toward selection of peaks with spectra that look interesting but might be less relevant for the segregation of samples. Thus, peak identification and selection is done by estimate, extracting more or less the relevant data. This influences the results of the data analysis, because a suboptimal peak identification or selection method will result in biased extracted data. The nature of the bias is usually impossible to judge, due to the complex nature of the peak identification algorithms, of the peak selection process, and of the chromatograms themselves. In this work, data matrices from hyphenated chromatographic analysis have been used to validate the chemotaxonomy of fungi. Within the taxonomy in the genus Penicillium, the analysis of profiles of secondary metabolites produced in cultures has proven to be very successful.7 Even though several hundred secondary metabolites produced by Penicillium species have been identified, most compounds found in the chromatographic profiles are still unknown.8 Each Penicillium species produce 6-10 families of secondary metabolites, with several metabolites from each family. Almost all chemotaxonomic classification is based on detection of selected metabolites with a known structure, neglecting the hundreds of minor, unidentified metabolites that are found in the chromatographic profile. Thus, this is a case where the samples are very complex, with an uncertain division between relevant and irrelevant components, making both identification and selection of relevant peaks a difficult task. To overcome the problems of data reduction, methods from the field of image analysis are applied. Most instrumental software can plot chromatographic/spectrometric data as an image (often called isoplot), where the intensity is represented as a shade of gray or color, with retention time as one axis and the spectral scale (e.g., wavelength) as the other, as shown in Figure 1. Samples analyzed under similar conditions can very often be discriminated, grouped, or identified by visual inspection of these isoplots, and it should therefore be possible to analyze chromatograms by applying image analysis methods. Researchers in image analysis are using non-data-reducing methods for, for example, character recognition,9 and by implementing the well-known image analysis technique of template matching,10 a method for direct comparison of chromatograms may be developed. We present in this paper a non-data-reducing method for classification validation and identification of chromatographic profiles. The basis of the method is that chromatograms from a set of previously classified samples exist, which can be combined to make a single, representative chromatogram for each class. Sample chromatograms are then compared with each representative chromatogram individually, and a resemblance measure is calculated for each class. The entire chromatographic data matrices may be used in the calculations, thus avoiding the problems connected with data reduction methods in general, and with peak identification methods in particular. The chromatograms must be recorded under similar chromatographic conditions, if (7) ) Frisvad, J. C.; Thrane, U.; Filtenborg, O. In Fungal chemical taxonomy; Frisvad, J. C., Bridge, P. D., Arora, D. K., Eds.; Marcel Dekker: New York, 1998; pp 289-319. (8) ) Mantle, P. G. In Penicillium and Acremonium. Biotechnology Handbook 1; Peberdy, J., Ed.; Plenum Press: New York, 1987; pp 161-243. (9) ) Lee, S.-W.; Kim, Y. J. IEEE Trans. Pattern Anal. Machine Intell. 1996, 18, 1045-1050. (10) ) Russ, J. C. The Image Processing Handbook; CRC Press: London, 1992.

728 Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

Figure 1. Chromatograms viewed as images. (A) is a section of a HPLC-DAD chromatogram, and (B) is the isoplot of the same section, showing absorbance as gray levels in an image.

combination and comparison are to yield meaningful results. If, for example, some components elute in a different order in two chromatograms, the chromatograms cannot be combined into a single, representative chromatogram. The method developed may be used as a basis for further development. In this work it is shown how the method allow identification of characteristic sample components by local similarity calculations. The proposed methods are very general in their basic assumptions, and customization for a specific problem is straightforward. The identification method has the added benefit of being automatic, i.e., requiring no operator interaction once the method is configured for a given problem. Like traditional data analysis methods, the proposed methods will yield better results with a higher quality of input data, and thus some sort of data enhancing is required in order to obtain the best possible results. With the proposed methods, however, it is possible to leave a large amount of irrelevant information in the data without seriously affecting the results, as opposed to peak identification methods, where drastic data reduction is an integral part of the method. It is thus possible to perform only mild data enhancing (i.e., baseline correction) and to do it in such a way that the inevitable biasing of the results is deducible from the applied enhancements methods. THEORY AND CALCULATIONS In this work, the following nomenclature will be used: The chromatographic data matrix, containing data of the type (time,

Figure 2. Flow chart illustrating the principle behind the identification method, by showing the data flow and the processing operations.

wavelength, absorbance), is referred to as the chromatogram. The chromatographic trace at a single wavelength is called a chromatographic profile, while the set of readings for a given time point is called a spectrum. In equations, chromatograms and other matrices are indicated by upper case boldface roman letters (e.g., C), while scalars are indicated by lower case italic letters (e.g., i). Scalars derived from matrices, such as the absorbance at a specific time/wavelength point in a chromatogram or the mean of absorbencies in a chromatogram are indicated in normal-width font. For example, a specific absorption reading is indicated using subscripts on the chromatogram variable, e.g., Ct,λ, where t and λ indicate the position in sample and wavelength counts. Chromatograms that originate from samples that have been identified by some method(s) are called standard chromatograms. These standard chromatograms are used to generate representative chromatograms for each of the different groups. The representative chromatogram from each group is called a reference chromatogram. Chromatograms of unknown identity, which are to be identified using the proposed method, are referred to as sample chromatograms. Principle. As described, the basis of the identification method is a combination of the information in chromatograms from a set of previously identified samples (the standard chromatograms) into a set of reference chromatograms, one for each class. To identify a sample, a chromatogram is collected (the sample chromatogram) and compared with the reference chromatograms. The sample is identified into the class whose reference it resembles the most. Prior to combination and comparison, dataenhancing steps are applied to the chromatograms in order to standardize the chromatograms and thus remove the effects of random variations between the chromatograms. The principle of the method is illustrated in Figure 2, where the procedure is illustrated by a flow chart. The exact nature of the combination, comparison, and (especially) data-enhancing steps is determined by the nature of the data, but by making the calculation procedures modular (i.e., independent of each other) a few, general procedures may be defined and the method customized by choosing among these. Implementing the identification procedure, the method of combination and comparison is first devised, and it is then deduced what data-enhancing steps are appropriate. In Figure 3, a detailed flow chart shows the procedures implemented for the data used in this work. Two data sets (1 and 2) are used in this work. The procedures were developed on data set 1, but as the procedures are very

Figure 3. Detailed flow chart, showing the data-processing operations used in this work.

general in their applicability, and the samples and analytical methods of the two data sets were similar, the same method and procedures could be used for data set 2. All procedures presented here are automated, so that samples may be identified without interaction with an operator. Combination of Chromatograms. As an example, consider the standard chromatograms C0, C1, ..., Cn. For combining the standards, it was decided to generate the median chromatogram M, calculated as n

n

max(x|xe /2,x∈Z) min(x|xg /2,x∈Z) Mt,λ ) (Ct,λ + Ct,λ )/2

(1)

(step 2.b in Figure 3), where the standards are numbered according to absorbance at each time/wavelength combination separately: 0 1 n e Ct,λ e ... e Ct,λ Ct,λ

(2)

The median combination method was chosen because the median value is more robust than the mean.11 However, when the median shifts from one standard chromatogram to another, a notch may appear in the median chromatogram. The notches are corrected for by smoothing the median chromatogram, by calculating s Mt,λ )

1

t+2

∑M

5 i)t-2

i,λ

(3)

(step 2.c in Figure 3), where Ms is the smoothed median chromatogram. This is known as applying a mean filter of width 5. Prior to the median calculation, each standard chromatogram is scaled to mean 0 and variance 1 by the operation i i,org Ct,λ ) (Ct,λ - Ci,org)/sCi,org

(4)

(step 2.a in Figure 3), where Ci,org is the original chromatogram (11) ) Press: W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P Numerical Recipes in C, 2nd ed.; Cambridge University Press: Cambridge, 1992.

Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

729

number i, Ci is the scaled chromatogram, and Ci,org and sCi,org, respectively, are the mean and the standard deviation of all the values in Ci,org. The scaling is performed to remove differences in level and amplification in the chromatograms, which might influence the median operation. Similarity Calculation. For calculating the similarity between two chromatograms, it was decided to use a correlation measure. The correlation between two data sets is a measure of the covariance of the data, i.e., the similarity in shape. This is appropriate because the Penicillium species are characterized by the presence and absence of components (metabolites), rather than the quantity of the components, making the peak spectra more important than the peak heights/areas. The correlations are calculated cophenetically (i.e., as correlations between two single vectors), as opposed to multivariately (i.e., as correlations between two sets of vectors), thus yielding a single correlation for an area of arbitrary extent on the time and wavelength axes. For two arbitrary data matrices X and Y, both of dimension (nt × nλ), the cophenetic correlation F is calculated as

F)

1

nt



∑∑ nn λ t i)1 j)1

(

Xi,j - X Yi,j - Y sX

sY

)

(5)

By examining the chromatographic data used in this study, it was found that only very high correlations (>0.85-0.95) reflected what was visually judged as a good correlation. It was therefore decided to use the cube of F as similarity measure instead of F, to emphasize the high correlations in relation to the lower ones. Other functions of F may be used, but the cube was chosen because it retains the feature that -1 e F e 1. Thus, in the chromatograms, few very similar areas will receive greater weight than a larger number of areas with more dubious similarity. Equation 5 shows that a scaling of the data sets to mean 0 and variance 1 is inherent in the correlation coefficient calculation. This means that the correlation between two chromatograms cannot be used as similarity measure, because the tall peaks would be given a large weight in the correlation, whereas the smallest peaks would have almost none. The correlation is therefore calculated locally, and the mean of the local correlations is used as similarity measure instead. The actual calculation of the similarity score is incorporated into data-enhancing step d (step 1.d in Figure 3), described below. Data Enhancing. The data-enhancing procedures developed for data set 1 consist of four separate steps: region extraction, baseline correction, height scaling, and chromatographic aligning, performed in that order. Step a: Region Extraction. This means extracting the part of the chromatograms that contains relevant information and “cutting away” the parts that contain irrelevant information, e.g., only noise. How this is done is extremely data dependent and is therefore described together with the experimental procedures in the next section. Step b: Baseline Correction. This is performed in order to eliminate the effect of variations in base signal level during the analysis and in diode sensitivity in the detector array. The chromatograms also contain systematic baseline variations due 730

Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

to shift in eluent composition during the analysis. In this case, it is necessary to correct all three types of baseline variations: Random variations between the diodes in the detector array can seriously affect the correlation calculation for noise-only areas or small peaks, especially if the eluting substance only shows absorption at a few wavelengths. Variations during the analysis will prevent the height scaling data-enhancing step from giving optimal results. The baseline correction method used has previously been described by Nielsen et al.12 One wavelength is corrected at a time, by first finding the minimum point in a window of specified width on the time axis for all possible window placements. Data points that were found as minimum more than twice were considered to be baseline points, and an estimated baseline for the current wavelength was created by linear interpolation between the detected baseline points. The resulting piecewise linear function was subtracted from the profile at the current wavelength, yielding a baseline corrected profile. Step c: Height Scaling. This is introduced in order to correct the large differences in peak height often seen in chromatograms from fungal extracts. As mentioned earlier, Penicillium species are characterized by qualitative differences in occurrence of components, rather than by quantitative ones. Differences in peak height are therefore of minor importance and should thus be removed in order to obtain data with a higher content of relevant data. This data enhancement is important for the reference chromatogram generation because it makes the standard chromatograms more comparable and, thus, the reference chromatogram more representative. The peak heights are homogenized by taking the logarithm of all points in the chromatograms, according to the formula log ) log(Ct,λ - min(C) + 1) Ct,λ

(7)

where C is a chromatogram and Clog is the chromatogram after the logarithmic transformation. By subtracting the minimum value in C and adding 1, all arguments to the logarithmic function are greater than or equal to 1, thus yielding a chromatogram where the taller peaks are lowered relative to the smaller peaks. Step d: Chromatographic Aligning. This means adjusting the chromatograms on the time scale in order to line up identical portions. This is necessary because minute differences in chemical and physical conditions between HPLC runs produce a shift in retention time from one run to another for all peaks. This is obviously a problem for the direct data analysis method presented here, because the basic premise of the chromatogram combination method is that chromatograms contain the same information in the same positions. The shifts in peak positions are corrected using correlationoptimized warping (COW).12 This method aligns one chromatogram with another by dividing the one chromatogram into small sections that may each be stretched or compressed to a specified degree. Dynamic programming (a global optimization method) is then used to determine what combination of stretching and compression (i.e., warping) yields the best possible alignment. (12) ) Nielsen, N.-P. V.; Carstensen, J. M.; Smedsgaard, J. J. Chromatogr., A 1998, 805, 17-35.

(It was found, that the expressions given in ref 12 for calculating the extent of the end regions, who are warped separately, only work well when the target chromatogram is the longest of the two chromatograms. In this work, the following expression was used instead, for the target chromatogram end region extent r2 (using the nomenclature of ref 12): r2 ) LT - r1.) This is done by calculating the similarity between the two chromatograms for each section, for all possible placements of that section. The cubed cophenetic correlation coefficient is used as similarity measure, as described above. In the combination procedure, one of the standard chromatograms is chosen as target chromatogram and COW is used to align the other standard chromatograms with it, one by one. For identification, COW is used to align each reference chromatogram in turn with the sample chromatogram, returning an aligned reference chromatogram and a similarity score, given by the mean of the COW section similarities. The COW method is thus used as a simultaneous aligning and similarity calculating routine, and the steps 1 and 2 in the identification procedure (Figure 2) are thus combined into one (step 1.d, Figure 3). Additional Applications: Characteristic Component Analysis. With the described identification method it is also possible to analyze classes and samples to identify components that are typical or atypical. This may be done by calculating the chromatographic resemblance locally, which allows the chromatograms to be segmented into similar and dissimilar sections. By calculating the cubed cophenetic correlation coefficient locally between the standard chromatograms, a similarity curve for each possible pairing of standards is generated. A local area size must be specified, and each point on the similarity profile is then calculated as the similarity score between the two chromatograms for the local area that the point is situated in the middle of. If, at a given point, all correlations are large (e.g., >0.9) and a peak is present at that point in the chromatogram, then the component producing that peak is present in all standards and is thus characteristic of that group. Groups can also be compared by generating the similarity curve for the two reference chromatograms. Comparing the result of characteristic component analysis and group similarity analysis, it is possible to determine what components distinguish two groups from each other.

component analysis methods were implemented using Borland C++ Builder Professional on a Hewlett-Packard 266 MHz PentiumII PC (see Appendix). Data Set 1. A total of 212 isolates from 41 species of Penicillium subgenus Penicillium cultivated on YES and analyzed as described in ref 13 were used. As a part of a single study, UVvisible spectra were collected from 200 to 600 nm with a 4-nm resolution every 0.8 s, resulting in a data matrix containing 101 chromatographic traces and approximately 3300 spectra from each isolate. In all the collected chromatograms, the first eluting peak is preceded only by noise and by a characteristic injection noise pattern. It was therefore decided to remove the injection noise pattern and the noise-only region preceding it, by letting an automated algorithm identify the injection noise pattern, thereby determining the cutting point. The eluting metabolites only showed absorbance at wavelengths 200-440 nm, and wavelengths 440-600 nm could thus be discarded. Finally, the sample preparation method meant that a characteristic peak, indicating the elution of ergosterol, appeared at the end of each chromatogram. Again using an automated algorithm, this peak was identified and removed, together with the noise-only region succeeding it. The remaining part of the chromatogram has a higher relevant information content, and as an added benefit, the smaller size speeds up subsequent processing operations. Data Set 2. This data set consists of 61 isolates from 6 Penicillium species within the Penicillium herquei group. It represents a more difficult scenario with respect to taxonomy, and the samples have been cultivated on varying substrates (or combinations of substrates) and analyzed by varying methods13,14 on different HPLC instruments and columns over a longer period of time (more than one year). It was not possible to discard any wavelengths from these chromatograms, because some spectra extended over the full spectral range. As there was no characteristic peak at the end of the chromatograms, it was impossible to identify a discardable section. Finally, the algorithm for removing the injection noise was modified slightly, because of a slight overlap between the noise pattern and the first-eluting peaks.

EXPERIMENTAL SECTION For the two data sets used, full chromatographic matrices were collected from HPLC analyses of plug extracts13 from fungal cultures. The HPLC analyses were performed as standard reversedphase chromatography using gradient elution. All fungal isolates were selected from the fungal collection at The Department of Biotechnology (IBT) to represent the greatest possible diversity in isolation source and geographical locations. The isolates have been identified using all taxonomic characters, yielding a high confidence identification. The collected chromatograms were used in this work, regardless of the quality of the analysis or the representativity or growth of the isolates. The data were used directly from the data files stored by the Hewlett-Packard Chemstation software. The identification and

RESULTS AND DISCUSSION In Figure 4, the effect of data enhancing is demonstrated on two of the chromatograms used in this work. The baseline correction window width was set at 400 data points, as this is greater than the width of the widest peaks in any of the chromatograms used in this work. Thus, the algorithm is prevented from finding baseline points inside peaks. The COW section length was chosen to be 20 sample points, because this is the approximate width of the smallest peaks of interest (see ref 12). The permitted warpings were -1, 0, and 1 sample interval. It is clearly seen how the different procedures enhance the data in different ways, combining to give a much improved starting point for combination or comparison of the chromatograms. The identification method was tested by calculating a reference chromatogram for each species of fungi and then identifying all the chromatograms into one of these species. For data set 1 this

(13) ) Smedsgaard, J. J. Chromatogr., A 1997, 760, 264-270.

(14) ) Frisvad, J. C.; Thrane, U. J. Chromatogr. 1987, 404, 195-214.

Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

731

Figure 4. The 200-nm profile from analysis of two isolates of P. aethiopicum (black and gray line). (A) shows the recorded profiles, (B) is the profiles after region extraction and baseline correction, (C) shows the profiles after logarithmic height scaling, and (D) shows the profiles after aligning. The effects of baseline correction and aligning are most clearly seen in the region above 2500 sample intervals.

meant identifying 212 chromatograms into 41 groups, and for data set 2 identifying 61 chromatograms into 6 groups. The two data sets were analyzed using full cross validation: Any given sample chromatogram to be tested was not used to calculated the reference chromatogram of the species to which it was known to belong. Rather, it was compared with a reference chromatogram generated only from the other chromatograms of that particular species. The sample chromatogram was therefore not used to generate the reference chromatogram with which it is compared, thus it can be considered as a new sample to be compared with a database constructed from the 211 and 60 other chromatograms exactly as if a new sample of unknown origin was to be identified. Using this method, all samples from the two data sets were identified. To judge the significance of the data-enhancing operations, identification runs were made both with and without data 732 Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

enhancing. Without data enhancing, the similarity was calculated by using COW with no warping allowed. The results of the runs for data set 1 are shown in Table 1 and for data set 2 in Table 2. It is seen that the numbers of correctly identified samples are very high: 91.5 and 90.2%. To attain this level of confidence, it has hitherto been necessary to combine the results of several identification methods and to rely on experienced mycologists for the interpretation of the results. The method is also seen to be robust with respect to varying quality in the data, as the more heterogenic data set 2 produced results almost as good as data set 1. This indicates that it is not necessary to hold chromatographic conditions absolutely constant from one run to the other. Normal good laboratory practice is sufficient to produce good results.

Table 1. Results of Identification of Data Set 1 with and without Data Enhancing misidentifications group

with data enhancing

without data enhancing

4 5 5 6 5 6 7 6 6 4 6 6 6 6 5 6 6 5 6 6 5 5 5 4 5 5 5 4 4 5 5 5 4 6 6 4 7 4 5 3 4

0 1 0 1 0 3 0 0 1 1 0 1 0 2 0 1 0 0 0 0 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 2 0 0 0 0

1 1 0 3 1 2 1 1 6 3 0 3 1 6 2 2 2 1 3 2 3 1 3 3 4 1 0 1 1 1 1 2 1 3 5 4 4 4 1 0 2

212

18

86

91.5

59.4

samples

P. aethiopicum P. albocoremium P. allii P. atramentosum P. aurantiogriseum P. aurantiovirens P. brevicompactum P. chrysogenum P. clavigerum P. concentricum P. coprobium P. coprophilum P. crustosum P. cyclopium P. digitatum P. dipodomyicola P. dipodomyis P. discolor P. echinulatum P. expansum P. flavigenum P. freii P. gladioli P. glandicola P. griseofulvum P. hirsutum P. hordei P. italicum P. melanoconidium P. neoechinulatum P. olsonii P. oxalicum P. palitans P. polonicum P. scabrosum P. sclerotigenum P. solitum P. tricolor P. venetum P. viridicatum P. vulpinum total % correct ID

Table 2. Results of Identification of Data Set 2 with and without Data Enhancing misidentifications group

samples

with data enhancing

without data enhancing

P. herquei P. coralligerum P. atrovenetum P. estinogenum P. raistrickii P. soppii

16 4 10 3 8 20

0 1 1 0 3 1

1 0 4 2 3 7

total

61

6

17

% correct ID

90.2

72.1

The identification runs without data enhancing clearly show that it is necessary to perform data enhancing, to exploit the full potential of the identification method. It is interesting that the identifications deteriorate much more for data set 1 than for data

set 2, although data set 1 is the most regular of the two. This is probably simply due to the fact that data set 2 contains far less groups than data set 1 and that random variations in similarity score therefore have a much greater chance of accidentally producing the correct identification. An initial morphological identification were performed of all cultures used; however, the chromatograms were not screened to assess their quality. As morphological identification within the studied group of fungi is very difficult, errors in the species assigned to the chromatograms is possible and cannot be considered as faults in the presented identification method. A reexamination of data set 1 revealed that a chemotaxonomic expert could not determine the species for 14 of the 18 misidentified isolates from the chromatogram. A further examination using all mycological methods revealed that, of these 14 isolates, 5 isolates were actually the species that they were classified as, and not the species to which they originally has been assigned; 3 isolates had, since the compilation of the data set, been assigned to an entirely new species; and 1 isolate was a very atypical example of the species. Finally, five samples were so low in component concentration, that identification was uncertain. Thus, chromatograms from the remaining 4 isolates out of the 18 isolates, considered identifiable by the expert, were all of the species Penicillium cyclopium. Thus, the errors due to inadequacy of the identification method constitute only 4 out of 212 samples, or 1.9%. For data set 2, three of the six misidentifications were due to problems with the chromatograms: Two samples appear to have been contaminated by other species of fungi, and one sample was very low in component concentration. The remaining three errors could not be attributed to any simple characteristics of the chromatograms. In this case, 3 out of 61 samples (or 4.9%) were thus misidentified due to inadequacies in the identification method. It is not surprising that the method performs less well on data set 2 than on data set 1, considering the greater variation in data set 2. It is seen that, by simply screening the chromatograms, a level of less than 5% misidentification can be reached, which is an extremely good result. An interesting conclusion that may be drawn from the data analysis above is that, despite some species including several atypical samples, the typical samples were usually identified correctly. This means that the method is very robust to deficiencies in the standard chromatograms. This is relevant for considerations regarding the proper choice of standards from which to generate a reference: Very poor chromatograms should obviously be excluded, particularly chromatograms with recognized analytical faults. The standards used need not be perfect specimens of the group they represent in order to produce a representative reference. It would be advisable to select samples that reflect the variations expected for each class to be used for construction of the library references, only excluding the very atypical ones. In practice, all available standards should be included by default, and exclusion from the set of standards should only take place under strong evidence of anomaly. The identification method can be used to calculate the similarities between chromatograms stored from previous analysis to recognize whether a similar chromatogram has been encountered before. In this case, the similarity coefficient has to be assisted by a manually inspection of the chromatograms. Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

733

Figure 5. Sum profiles for the reference chromatograms for the species P. aethiopicum (A) and P. coprophilum (B). Shown below the profiles are the minimum similarity profile for the standard chromatograms from which the reference is generated. For P. coprophilum, only the correctly identified standards were used to generate the similarity profile. The portions of the profiles where the similarity is greater than 0.9 are indicated by bars, and the portions corresponding to peaks are marked with asterisks. (C) shows the two references aligned and their similarity curve. Only peaks with similarity greater than 0.9, which were marked in both (A) and (B), are marked in (C).

In this work, the similarity scores have only been analyzed relatively. Analysis of the absolute values is very complex, because classes may have different levels of homogenity, making a score of, for example, 0.7 a low similarity in one class and a high similarity in another. Except for very well behaved data sets, this makes it impossible to specify a threshold similarity value, indicating whether an identification is positive or tentative. A study of the distribution of the similarity scores for the samples used in this work shows that it may be possible to construct a heuristic identification confidence measure: The absolute value of the identifying (i.e., largest) similarity should obviously be included in the heuristic to indicate the degree of similarity. Study shows that the distribution of similarity scores for a sample approximately follows a Gaussian distribution for classes other than the one that the sample belongs to, thus indicating that the separation of the identifying similarity from the bulk of the similarity scores may yield information regarding the relative confidence of the identification. Finally, the separation of the identifying similarity from the largest similarity scores indicates whether the sample has been identified as belonging to a specific class or rather to a group of related classes. Work is being carried out to determine whether 734 Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

a heuristic of the kind described can indeed yield information regarding the confidence of an identification. Additional Applications: Characteristic Component Analysis. By studying the similarity coefficients as a function of the chromatographic time scale, it is possible to identify the sections of the chromatograms responsible for either grouping to samples or the separation of these. To illustrate this technique (characteristic component analysis, CCA), some of the characteristic components were identified for the two related species Penicillium aethiopicum and Penicillium coprophilum, comparing the two reference chromatograms. The local area size used for the similarity calculation was 20 sample points (i.e., the approximate peak width). Figure 5 shows the reference chromatograms and a similarity curve showing the minimum pairwise similarity among the standard chromatograms found for each point on the time axis. Sections with a similarity greater than 0.9 were checked for the presence of peaks, and the characteristic peaks are indicated. The peak UV spectra (correct for the logarithmic transformation used during data enhancing) were then examined and the characteristic components identified if possible.

Penicillium species are characterized by the presence of biosynthetic families of metabolites, and with one exception, the components found to be characteristic all belong to metabolic families known to be characteristic of the species. 7 For P. aethiopicum, metabolites from 4 of the 14 biosynthetic families known to be characteristic were found in this study, and for P. coprophilum, 6 out of 10 families were found. The usual definitions of characteristic biosynthetic families depend on the majority of the isolates producing components of that family, and not all, as was used in this work. This is probably the cause of the discrepancy, although it is also likely that a better correspondence could be found by experimenting with threshold values other than 0.9. In a previous study,15 characteristic components were identified manually, resulting in seven families for P. aethiopicum and six families for P. coprophilum. The four P. aethiopicum families found in this study is a subset of the seven families found in ref 15, but for P. coprophilum, three of the families from this study were among those not selected. This shows that the method presented here is in some cases able to detect similarities that are missed by manual examination. In P. aethiopicum, one characteristic component is found that does not belong to any known metabolic families. A closer examination showed that this component does indeed appear to be characteristic. The reason that the component has not hitherto attracted attention is that the spectrum is not very dynamic, meaning that no conspicuous absorption maximums are seen. The method presented here does not suffer from biasing toward “exciting” spectra and has thus been able to contribute to the knowledge of characteristic components for P. aethiopicum. The analysis of similarity between the two species shows three common components. Two of these correspond to the two metabolic families that the species are known to have in common. The third correspondence is between a known component of P. coprophilum and the new characteristic component found in P. aethiopicum. However, the known P. coprophilum metabolite has never been reported produced by P. aethiopicum, and the metabolites are therefore probably different, despite the similarity in retention time and spectrum. This emphasizes the need for a manual evaluation of the results from component analysis in the form presented here. CONCLUSION It has been shown that it is possible to obtain a high confidence identification of samples from automated, direct comparison of full chromatograms under very general assumptions, using modular data-enhancing procedures. The nature of the samples only influences some of the data-enhancing procedures, and it is easy to determine whether and how to alter or exclude individual procedures. The method has been demonstrated on microbiologi(15) ) Svendsen, A.; Frisvad, J. C. Mycol. Res. 1994, 98, 1317-1328.

cal metabolite data but is applicable to any chromatographic identification problem. Further development of the method could include a detailed study of the similarity scores produced by the method: In this work, only the highest similarity is used to indicate the identification of a sample, but studying a sample’s similarity scores for the other groups might yield information about the relationship between the different groups. It is also possible that a classification method could be formulated, using the direct comparison approach described in this work. Work is under way in this area, but no conclusions have been reached yet. Characteristic component analysis is shown to yield results in agreement with previous work. In its present form, manual evaluation of the results is required, and parameters must be set by estimate. Further development is required, but this initial study of the method definitely shows promise. In the future, the methods of identification and component analysis may be implemented together to form a system for, for example, database searching, as is known from mass spectroscopy. Speedups could easily be applied by initially using only few or a single profile or by considering only the parts of the chromatograms known to be most important. Adjusting the method to use, for example, GC/MS data should also be straightforward. This work is considered by the authors to be a first attempt at a new approach to chromatographic data analysis, namely, the attempt to use all the information available. Computers are now powerful enough to allow us to process entire chromatograms, giving us the opportunity to be free from the constraints imposed by the necessity to reduce the amounts of data to manageable proportions. Research should continue in this direction, as we feel that information-retaining methods could be a benefit in all areas of data analysis. ACKNOWLEDGMENT The authors thank the Centre for Identification and Characterization of Filamentous Fungi, Denmark, for funding the research presented in this paper. APPENDIX A program (running under WIN95 or Windows NT 4) that can perform most of the described preprocessing, aligning, and calculation of similarities will be available from our home page at http://www.ibt.dtu.dk/mycology/cow by the end of January. The program can only process data files in ascii text format as we have removed the HP Chemstation DAD1.UV reading due to the strict confidential policy by Hewlett-Packard (Waldbronn, Germany) regarding their data file structure. Received for review May 26, 1998. Accepted October 29, 1998. AC9805652

Analytical Chemistry, Vol. 71, No. 3, February 1, 1999

735