Multiscale Cluster Analysis - ACS Publications

ACS2GO © 2019. ← → → ←. loading. To add this web app to the home screen open the browser option menu and tap on Add to homescreen...
4 downloads 0 Views 135KB Size
Anal. Chem. 1999, 71, 3092-3100

Multiscale Cluster Analysis Bjørn K. Alsberg*

Department of Computer Science, University of Wales, Aberystwyth Ceredigion, SY23 3DB, UK

This article describes how the concept of multiresolution is used with cluster analysis of spectral data. Multiresolution analysis progressively increases the resolution of a spectrum profile by adding levels of details contained in scales obtained from a discrete wavelet transform. At each resolution level a cluster analysis is performed on all the spectra profiles in the data set. This allows the relating of changes in the cluster pattern to various broad and narrow features in the spectral data profiles. The analysis also provides an approximate location of the important features in the original wavenumber domain. The aim of cluster analysis is to obtain information about the “natural patterns” of multivariate samples (objects). There are no universally accepted definitions of what a cluster is and therefore the results from a cluster analysis will depend on selecting the right cluster model for the problem at hand.1 Many popular clustering algorithms detect globular clusters, whereas others find cluster structures that are, e.g., linear, planar, or disjoint.2,3 Usually, the cluster analysis is performed on the original, full-resolution representation of the data profiles. This means that all available detail in the data profiles is used in the cluster analysis. In many cases, however, this is not necessary or even desirable. To illustrate: In order to distinguish an elephant from a dog, the level of detail obtained from using a magnifying glass is not needed. In fact, it would probably be possible to distinguish these two animals from a distance of more than a kilometer (i.e., whether it is an elephant or not). Only the broad and overall features of these animals would be necessary for a classification. However, if a wolf stands next to the dog, it would be necessary to incorporate a higher level of detail to distinguish between them. In this article it is argued that a similar situation exists for cluster analysis of spectral profiles. In general, the same applies to other profile examples such as, e.g., chromatograms and timevarying signals; however, they will not be discussed in the present article. Sometimes it is necessary to zoom in on various levels of detail in the search for interesting cluster patterns. In other words, the

spectra should be compared at the optimal resolution. At large scales only the broad components (low resolution) are detected, whereas at the finest scales (high resolution) more narrow patterns in the signal can be observed. Taking a frequency approach to the different resolution levels, it is possible to associate low-frequency components with low resolution levels and high-frequency components with high resolution levels. This combination of wavelength/wavenumber (or time) axis and frequency is accomplished describing spectral profiles in a wavenumber-frequency (WF) domain or a time-frequency domain (TF) for a time axis. By performing the cluster analysis in the WF or TF domain, it is possible to obtain information that would be difficult to obtain using only the time/spectral domain. This new approach allows the investigator to obtain answers to the following questions: (1) In what way do the cluster patterns depend on the frequency content in the spectral data profiles? (2) Are broad or narrow features involved in the separation and change of the area (i.e., variance) of the clusters? (3) Where in the original spectral profile domain are these features that give rise to the observed cluster pattern approximately located? 2 THEORY 1. Wavelet Transforms. Wavelet transforms4-6 constitute a group of methods that can be used to study nonstationary signals, i.e., signals where the statistical properties are not constant with time. In this article, analysis of intensities sampled at different wavenumbers rather than at different time intervals will be used. However, for the general introduction (see below) to wavelets, time is kept as the axis since it provides a better intuitive basis for understanding the concepts discussed. To obtain information about a signal in the TF domain it is necessary to construct bases that have localization in both time and frequency. However, there is an important restriction to how much resolution a basis function has along the time and frequency domain directions simultaneously. This restriction is determined by the Heisenberg uncertainty principle:

(∆t)(∆f) g * Corresponding author. Tel.: 44-1970-622537. Fax: 44-1970-622455. E-mail: [email protected]. (1) Everitt, B. Cluster Analysis, 3rd ed.; Edward Arnold, Halsted Press: New York, 1993. (2) Massart, D.; Kaufman, L. The Interpretation of Analytical Chemical Data by the Use of Cluster Analysis. Chemical Analysis, A Series of Monographs on Analytical Chemistry and its Applications, vol. 65; John Wiley & Sons: New York, 1983. (3) Bezdek, J. C. Pattern recognition with fuzzy objective algorithms. In Advanced Applications in Pattern Recognition; Nadler, M., Ed.; Plenum Publishing: New York, 1981.

3092 Analytical Chemistry, Vol. 71, No. 15, August 1, 1999

1 4π

(1)

Here, ∆t is the uncertainty in the time localization and ∆f is the uncertainty in the frequency localization. This restricts the (4) Morlet, J. Proc. 51st. Annu. Meet. Soc. Explor. Geophys., Los Angeles, 1981. (5) Grossmann, A.; Morlet, J. SIAM J. Math. Anal. 1984, 15, 723-736. (6) Mallat, S. A Wavelet Tour of Signal Processing. Academic Press: San Diego, 1998. 10.1021/ac9811672 CCC: $18.00

© 1999 American Chemical Society Published on Web 06/25/1999

“partition” of the TF domain by the analyzing basis functions. One of the central ideas for classical wavelet transforms is to have better resolution in time for high-frequency components and low time resolution for low frequency components. The continuous wavelet transform (CWT)7,8 is based on using a single wavelet function which is localized in time and frequency space. To capture time variations of the signal the wavelet function is shifted along time. To capture changes in frequency the wavelet is scaled. In contrast to the Gabor wavelet,9 the size of the wavelet changes rather than the number of oscillations of the wavelet. Scale is similar, but not identical, to frequency. As described below, it has a certain range in the frequency region. A set of scaled and translated versions of the original wavelet ψ(t) are made

ψa,b )

1

ψ

||xa||

(t -a b)

(2)

such that the time-frequency atoms ψa,b cover the TF plane, a is the scaling parameter, and b is the shift along the time axis. A signal f is projected onto this basis:

Wf(a,b) ) 〈f|ψa,b〉 )

∫f(t)||x1a||ψ(t -a b) dt

(3)

identified by a set of coefficients, ci, that relate the scaling (φ(t)) and wavelet (ψ(t)) functions: N-1

∑ c φ(2t - k)

(5)

∑ (-1) c φ(2t + k - N + 1)

(6)

φ(t) )

k

k)0

N-1

ψ(t) )

k

k

k)0

Here, N is the number of nonzero coefficients ck. From this equation it is sufficient to know the ci coefficients since both φ(t) and ψ(t) can be constructed from these coefficients. Each lowpass filtering of a signal produces a smoother signal with less information content. Assume that the signal at level j can be decomposed in terms of the scaling functions φ(2jt - k) (which span space Vj). After low-pass filtering a smoother version of the signal is obtained that can be represented in the smoother basis φ(2j-1t - k) (which spans space Vj-1). The detail lost in the smoothing is the difference between the two spaces Vj and Vj-1. This detail information is spanned by the wavelet bases ψ(2j-1t k) belonging to space Wj-1. The relationship between the two spaces is therefore

Vj ) Vj-1 x Wj-1 Here, Wf(a,b) is the result from the wavelet transform. It is also possible to interpret this equation as a series of convolutions of the function f with different scaled versions of the original wavelet

Wf(a,b) ) f X ψa(b)

(7)

where x is the direct sum. Inserting for Vj-1 in eq 7 to the lowest resolution level V0 gives:

(4) j-1

Vj ) V0 x Wk k)0

where X is the convolution operator and ψa(b) is the wavelet at scale a. In 1985 Yves Meyer made the important discovery10 that it was possible to construct orthonormal wavelet bases. In 1988 Ingrid Daubechies presented11 a framework for constructing orthonormal bases that also had compact support. On the basis of Daubechies’ work, Stephane Mallat invented multiresolution theory where a fast and nonredundant wavelet transform was demonstrated.12 This wavelet transform is usually referred to as the Fast Wavelet Transform (FWT) and requires only O(log2 n) operations (n is the number of elements in the data vector). The FWT is performed by a set of contiguous bandpass filters spread logarithmically over the frequency space. The bandpass filters are constructed using a combination of low- and high-pass filters which are referred to as quadrature mirror filters (QMF).13,14 The low-pass (LP) filter corresponds to the scaling function, and the high-pass (HP) filter corresponds to the wavelet function. The LP and HP filters are (7) Goupillaud, P.; Grossmann, A.; Morlet, J. Geoexploration 1984, 23(1), 85102. (8) Goupillaud, P.; Grossmann, A.; Morlet, J. Geophysics 1984, 49(5), 669. (9) Gabor, D. J. of the IEE 1946, 93, 429-457. (10) Meyer, Y. Congr. Int. Phys. Math. July, 1989, Chapter 79, 38-47. (11) Daubechies, I. Comm. Pure Appl. Math. 1988, 41(7), 909-96. (12) Mallat, S. IEEE, Trans. Pattern Anal. Machine Intell. 1989, 11(7), 674693. (13) Croisier, A.; Esteban, D.; Galand, C. Int. Conf. on Info. Sciences and Systems, Patras, Greece, 1976, pp 443-446. (14) Esteban, D.; Galand, C. Proc. 1977 IEEE Int. Conf. Acoustic, Speech, and Signal Processing, Hartford, 1977, pp 191-95.

(8)

In this paper, the cluster analysis is performed on each space Vj. When a new detail space Wj is added, it is possible to study how this level of detail affects the observed cluster structure. The WaveLab15 toolbox for Matlab was used for all the wavelet computations performed in this paper. 2. Finding the Optimal Wavelet. Unlike the Fourier transform, there are several basis functions to select from when using wavelet transforms. This means there must be a criterion for choosing the optimal wavelet. In this article the criterion is based on the compression ability of the analyzing wavelet. Thus, the optimal wavelet is defined to be the one that gives the smallest number of coefficients that are needed to describe the data. The procedure used in this article is as follows: (1) A representative spectrum of the data set is chosen. One typical representative is the mean vector xj which has a corresponding wavelet coefficient vector w j. (2) Sort (from high to low) the squared elements w j 2j and insert the result into a vector f. i (3) Calculate ui ) log(Σj)1 fj). Note that it is here assumed that vector indices start at no. 1 (following MATLAB notation). (4) The u vectors are normalized such that the largest element in all the vectors is set to one and the lowest to zero. (15) Buckheit, J.; Chen, S.; Crutchfield, J.; Donoho, D.; Gao, H.; Johnstone, I.; Kolaczyk, E.; Scargle, J.; Young, K.; Yu, T. Wavelab. http://playfair.Stanford.EDU/wavelab/ (accessed 1996).

Analytical Chemistry, Vol. 71, No. 15, August 1, 1999

3093

(5) The sum of all the elements for each of the normalized u vectors is calculated. This gives an indication of the area, Ek, below each curve. These curves are monotonously decreasing with only positive values, and the one with the smallest area Ek corresponds to an estimate of the optimal wavelet for a particular data set. For both data sets in this article the following six wavelets were tested for: Haar, Beylkin, Coiflet, Daubechies, Symmlet, and Vaidyanathan with a varying number of vanishing moments. 3. Fourier Versus Multiscale Cluster Analysis. Before discussing the use of wavelet transforms with cluster analysis, it is natural to explore whether it would be beneficial to use the Fourier transform instead. This is indeed possible if the important features are not localized in the time/wavenumber domain. The idea behind Fourier cluster analysis is to use a subset of all the possible frequencies in the signal in the reconstruction of the spectra. In other words: How is the cluster analysis affected by applying various bandpass filters to the spectra? One simple method is to employ a low-pass filter to the spectra where a parameter R indicates the upper limit to the frequencies retained in the filtered signal. For each value of R all frequencies above it are smoothed to zero using a Gaussian function. A cluster analysis of the filtered data set is performed, and the cluster structure is observed. Sometimes global frequencies can affect the cluster pattern; however, it is more likely to observe dependencies of a more localized nature. For such cluster-pattern dependencies, the Fourier transform is unable to localize this since its basis functions have no localization in time, only in frequency. For the Fourier cluster analysis performed in this article the following filter Q has been used:

Q(ω) )

{

(ω-R)2

e

1

δ

ω>R

Let xT be a row vector containing a spectrum in a data matrix X. Let xˆ (k) be the kth reconstruction using the dyadic scales [0... k]:

(9)

ωeR

Here, R is the location of the Gaussian shape used, ω is the frequency, and δ is the width of the Gaussian peak. The conceptual relation between classical, Fourier, and multiscale cluster analyses is illustrated in Figure 1. 4. Multiresolution in Cluster Analysis. As seen in eq 7, a signal can be built by adding the detail information between different scales. The lowest scale is the coarsest approximation of the signal. In the method suggested the cluster structure of a data set at each successive addition of a scale of the spectral domain (e.g., infrared, Raman, or UV) is investigated. Similar approaches to cluster analysis and classification have been done in image analysis,16-19 in mammogram screening20 and climatology (historical flood-level time series).21 (16) Starck, J.-L.; Murtagh, F.; Bijaoui, A. Image processing and data analysis. The multiscale approach; Cambridge University Press: Cambridge, U.K., 1998. (17) Murtagh, F.; Starck, J. Pattern Recognition 1998, 31(7), 847-855. (18) Murtagh, F. J. Classification 1998, 15(2), 161-183. (19) Simon, U.; Berndtgen, M. WaveStat - Cluster analysis of image data and wavelet coefficients. In Fourteenth International Conference on Pattern Recognition, vols 1 and 2; Brisbane, Australia, Aug 16-20 1998, 1998. Jain, V. S. A. K., Lovell, B., Eds.; pp 1622-1625. (20) Kalman, B.; Reinus, W.; Kwasny, S.; Laine, A.; Kotner, L. Academic Radiology 1997, 4(6), 405-414. (21) Fraedrich, K.; Jiang, J.; Gerstengarbe, F.; Werner, P. Int. J. Climatology 1997, 17(12), 1301-1315.

3094 Analytical Chemistry, Vol. 71, No. 15, August 1, 1999

Figure 1. Multiscale or wavelet clustering can be seen as something between classical cluster analysis in the time domain and cluster analysis in the frequency domain. In classical (time) cluster analysis there is an excellent resolution in time, but no resolution in frequency. In Fourier clustering we have the opposite situation. In multiscale cluster analysis information from both the time and frequency domains are incorporated.

k 2i-1



(k)

) φ0 +

∑ ∑ w ψ(2 t - j) i

ij

(10)

i)0 j)0

φ0 is the coefficient associated with the scaling function at the lowest scale 0 and wij is a wavelet coefficient for scale i at location j in the spectrum. ψ(2it-j) is a wavelet basis function. This reconstruction can be written in a matrix equation as follows:

xˆ (k) ) Bw

(11)

Here, w contains all the wavelet coefficients and the single coefficient associated with the scaling function at the lowest scale. The structure of this vector from FWT is usually as follows:

wT ) [φ0 wT0 wT1 ...wTj ...0 0]

(12)

Here, wTj is the transposed wavelet coefficient vector for scale j and “... 0” indicates that all coefficients above scale j are set to zero. Using the reconstruction formula in eq 10, a set of data matrices, X(1), X(2), ..., X(k), is obtained which all contain reconstructed spectra using [0, 1, ... k] scales. Each data matrix is subjected to a cluster analysis (see Figure 2). In this scheme, it is now possible to detect, e.g., whether some clusters are better separated or change their areas at certain scales. Since scale

Figure 2. Illustration of the basic operations performed in the multiscale clustering approach. First, each row in the data matrix (top) is subjected to a Fast Wavelet Transform (FWT). This gives a wavelet coefficient matrix which has the structure indicated (middle) by vertical lines placed at increased (power of 2, dyadic) separation. This is a result of the decimation and dyadic sampling in the frequency domain by the FWT. In the example the matrix has 4 different scales from 0 to 3. The first column in the matrix is a projection onto a scaling function (not a wavelet function) and will always be included in the reconstructions. All coefficients above the scale used in a particular reconstruction are set to zero. Each of the matrices (bottom) from the reconstruction are subjected to a cluster analysis to detect how the cluster pattern changes with respect to scale.

information is related to frequency information, it will tell whether the features involved are broad (low-frequency) or narrow (including higher frequency components). Spectral Regions Associated with the Observed Cluster Patterns. In addition to determining the frequency content of important features, it also important to localize these features in the wavenumber (or time) domain. Since the cluster analysis is performed after adding subsequent scales, it is possible to detect at which scales something interesting happens to the cluster structure. When going from a reconstruction involving k - 1 to k scales, it is interesting to find which regions of the wavenumber domain are associated with a change in the cluster pattern. If a dyadic discrete wavelet transform is used, there are 2k wavelet coefficients in the added scale. As described above, each coefficient is associated with a wavelet that has a certain localization along the wavenumber (or time) axis. If the wavelet function for a region is not contributing to the change in the observed cluster pattern, it is expected that setting the corresponding coefficient to zero will not change the main observed cluster pattern. If all coefficients are necessary to maintain the observed cluster pattern, this indicates that there may be a nonlocalized frequency that is important, i.e., the same effect would have been detected by performing a Fourier cluster analysis (see above). One way to

Figure 3. Illustration of the masking principle using scale 2 with a block size of 2 as an example. Shaded areas are where a masking value of 1 is used, i.e., the wavelet coefficient in this area is used. The white areas for scale 2 signify that these coefficients are not used in this region. Note however that all coefficients smaller than scale 2 are used in the reconstruction. For each of the reconstructions, a DFA is performed, and the cluster pattern is observed.

find the important wavenumber (or time) regions in the cluster analysis is to look through all the possible combinations of wavelet coefficients being multiplied with zero or one for the scale in question. In total there are 22k binary mask vectors m(i) that can be multiplied with a wavelet coefficient vector wk at scale k: (k) (i) v(i) j ) wj m j

(13)

Here, v(i) is the new wavelet coefficient vector for a certain masking combination i. Thus the structure of the new wavelet coefficient w(i) is as follows: T (w(i))T ) [φ0 wT0 wT1 ...(v(i) j ) ...0 0]

(14)

) [φ0 wT0 wT1 ...(wj.m(i))T...0 0] Here, . indicates element-wise multiplication between two vectors. This masking operation is applied to all the wavelet coefficient vectors of the spectra in the data set and then reconstructed. All the wavelet coefficients above scale k are set to zero and will therefore not contribute to the reconstructed spectrum. For each combination i, a cluster analysis of the reconstructed data is performed, see Figure 3 for an illustration of the method. It is now necessary to measure the effect the various combinations have on the pattern studied. For instance, cases in which clusters are separated/merged, depending on whether scale k is added or not, can be measured by calculating the cluster overlap. Of course, it is vital to have decided on which objects belong to a Analytical Chemistry, Vol. 71, No. 15, August 1, 1999

3095

cluster. In other words, the analysis is temporarily turned into a supervised rather than an unsupervised problem. Consequently, methods like CART22 and discriminant PLS23-25 and Quinlan’s C4.5 algorithm26 can be used in the analysis of which regions are important for the change in, e.g., overlap or relative cluster distance with respect to which wavelet coefficients are set to zero. In this article, however, all the rules for the two data sets studied were easy to detect manually, and thus, the use of these advanced methods was not necessary. Through the masking analysis it is possible to detect the important variables (i.e., wavelet coefficients which are located in the wavenumber domain) that are always associated with maintaining, e.g., zero overlap between selected clusters. In a masking analysis it is not necessary to examine the real values of the wavelet coefficients, only the binary masking matrix. It should also be stressed that it is, of course, not necessary to use all 2k coefficients in the search. For the higher scales the number of possible combinations is very large. One way to reduce this number is to look at larger blocks of coefficients, rather than the individual coefficients, and investigate the combinations of whole wavelet coefficient blocks being set to zero or one. In this study, the size of the blocks has been restricted to be a power of 2. Thus, 2k

the total number of mask vectors is 2 p where p is the block size. 3 METHODS AND MATERIALS 1. Data Collection. The details of the sample preparations and experimental conditions for the data set used can be found in refs 27 and 28. Ten-microliter aliquots of bacterial suspensions were evenly applied onto a sand-blasted aluminium plate. Prior to analysis, the samples were oven-dried at 50 °C for 30 min. Samples were run in triplicate. The FT-IR instrument used was the Bruker IFS28 FT-IR spectrometer (Bruker Spectrospin Ltd. Coventry, UK) equipped with an MCT (mercury-cadmium-telluride) detector cooled with liquid N2. The aluminum plate was then loaded onto the motorized stage of a reflectance TLC accessory. The wavenumber range is 4000-600 cm-1. Spectra were acquired at a rate of 20 s-1. The spectral resolution used was 4 cm-1. Two hundred and fifty-six spectra were co-added and averaged. The digital sampling level was set to produce 882 data points. Pre-processing. Regions in the spectra dominated by CO2 (2403-2272 cm-1 and 683-656 cm-1) were filtered out to follow the spectral trend of neighboring data points (although the contribution to CO2 in this data set was hardly detectable in these regions). The spectra were normalized so that the smallest absorbance was set to 0 and the highest to +1 for each spectrum. (22) Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and Regression Trees; Wadsworth & Brooks/Cole Advanced Books & Software: Pacific Grove, CA, 1984. (23) Martens, H.; Naes, T. Multivariate Calibration; John Wiley & Sons: New York, 1989. (24) Alsberg, B. K.; Goodacre, R.; Rowland, J. J.; Kell, D. B. Anal. Chim. Acta 1997, 348(1-3), 389-407. (25) Alsberg, B. K.; Kell, D.; Goodacre, R. Anal. Chem. 1998, 70(19), 41264133. (26) Quinlan, J. C4.5 Programs for Machine Learning; Morgan Kaufmann Publishers, Inc., 1993. (27) Goodacre, R.; Hiom, S.; Cheeseman, S.; Murdoch, D.; Weightman, A.; Wade, W. Curr. Microbiol. 1996, 32(2), 77-84. (28) Alsberg, B. K.; Wade, W.; Goodacre, R. Appl. Spectrosc. 1998, 52(6), 823832.

3096 Analytical Chemistry, Vol. 71, No. 15, August 1, 1999

Figure 4. Results from a multiscale cluster analysis of the UTI data set.

It was further necessary to employ a detrend operation with a linear function increasing from 4000-600 cm-1. The UTI Data Set. This data set consists of three different bacterial species called E. coli, P. mirabilis, and P. aeruginosa which in this article are referred to as clusters 1, 2, and 3, respectively. Twenty-two E. coli (Ea-Eq), fifteen P. mirabilis (PaPj), and fifteen P. aeruginosa (Aa-Aj) were isolated from the urine of patients with urinary tract infection (UTI) and prepared as described previously.29 In total, there are 148 (4 × 37) infrared spectra in this data set. The Eubacterium Data Set. This data set consists of four different species of Eubacterium, E. timidum, E. infirmum, E. exiguum, and E. tardum, which are referred to in this article as clusters 1, 2, 3, and 4, respectively. Four replicates for each sample are used. Four E. timidum (Ta-Te), four E. infirmum (1a-1d), four E. exiguum (2a-2e), five E. tardum (Na-Ne), and five eubacterial hospital isolates (Ha-He) were prepared as described previously.27 In total there are 88 infrared spectra (4 × 22 samples) in this data set. 2. Data Analysis. Measures of Cluster Structure. To demonstrate the feasibility of the multiscale clustering approach, the two data sets used in this article have information from taxonomic classification associated with each infrared spectrum. This information is used here as an independent check on the results from the multiscale clustering operation. To analyze the change in cluster structure over different scale additions, three variables are monitored for each cluster analysis: (1) relative area of each cluster, (2) relative distances between the clusters, and (3) overlap between the clusters. The reason relative rather than absolute cluster areas and distances are used is because the resulting DFA spaces for the different wavelet scales have different magnitudes of the eigenvalues. This is caused by the fact that using a smaller than maximum number of scales in the reconstruction of the spectra removes variance by making them smoother. Thus, it is not possible to compare the absolute DFA scores from different (29) Goodacre, R.; Timmins, M.; Burton, R.; Kaderbhai, N.; Woodward, A.; Kell, D.; Rooney, P. Microbiology 1998, 144(5), 1157-1170.

Figure 5. Convex hulls of the four clusters in 2D DFA space using a varying number of wavelet scales in the reconstruction (the Eubacterium data set).

wavelet reconstruction levels. This is the main reason for not including score values in Figures 4 and 5. It should be noted that only 2D DFA spaces are analyzed in this article (see explanation for why this is sufficient below) which makes it possible to use the convex hull of the 2D region containing the spectra belonging to a taxonomic class to define a cluster. The convex hull can be defined as follows: Let P1 and P2 be any two points in a convex hull region. Then all points Pj falling on a straight line from P1 to P2 must also be within the convex hull region. This means that, e.g., a triangle and a circle are all convex hull regions whereas an “F” shaped region is not. The area of a cluster is here defined as the area of the 2D convex hull polygon for each cluster divided by the area of the convex hull polygon of the whole data set. Distances between clusters are with respect to the centers of the convex hulls. The overlap pij between two clusters i and j is defined as

pij )

#(Ci)∩#(Cj) #(Ci ∪ Cj)

× 100%

(15)

where the operator # returns the number of elements in a set C. Ci and Cj designate the set of objects for clusters i and j, respectively. The Scale Dendrogram. The dendrogram is an efficient way to summarize cluster structure in agglomerative cluster analysis techniques. Initially all objects begin by forming individual clusters. Close clusters are gradually merged until finally all objects are members of a single cluster. The merging process can be followed graphically in the dendrogram at various distance levels. A similar approach can be employed for multiscale cluster analysis to summarize how clusters agglomerate when wavelet scales are removed from the reconstruction of spectra. In this instance the scale, rather than distance, is used to construct the scale dendrogram. Scale dendrograms are used in this paper to supplement the 2D DFA plots showing the cluster structure for the two data sets investigated.

Discriminant Function Analysis. In this paper the discriminant functional analysis (DFA)30-32 method is used to detect cluster patterns. DFA can be used in both a supervised and unsupervised mode. In the unsupervised mode, the dependent variable contains information about the replicate numbers. No information about the external taxonomic class membership information is used in this mode. In both modes, however, it is possible to discuss the separation of clusters with respect to wavelet scale and wavenumber regions. One reason for using data sets in which each object has a well-defined class/cluster membership was to simplify the evaluation of the results from the new multiscale clustering strategy. Discriminant function analysis (also referred to as canonical variates analysis (CVA)) is a multivariate statistical technique that separates objects (samples) into groups or classes by minimizing the within-group variance and maximizing the between-group variance. The principle of DFA is similar to that used in principal components analysis (PCA), but the objective of DFA is to find latent variables that maximize the ratio of the between-group to within-group variances, rather than finding latent variables that maximize the total variance. To find the discriminant functions (DF) direction the within-sample matrix of sums of squares and cross products, W, and the total sample matrix of sums of squares and cross products, T, are first computed. The between-group matrix is computed as: B ) T - W and the eigenvectors of (BW)-1 correspond to the DFA components onto which the data objects are projected. To make the input variables to the DFA independent, PCA is chosen as a prepossessing method. The A first principal components (PCs) scores are used as inputs to the DFA routine. The remaining PCs are usually due to random “noise” in the data and can be ignored without reducing the amount of useful information representing the data. It should be noted that since different levels of wavelet scale reconstruction are investigated, it is not possible to use a fixed number of PCs for all reconstructions. The reason for this is that reconstructions using a small number of scales, will not have as many significant latent variables as reconstructions using more scales. In all analyses described in this paper, the PCs extracted from the different reconstructions account for more than 99.99% of the total variance; however, the actual number of PC scores will vary with the number of scales used in the wavelet reconstruction. The DFA algorithm used here was implemented in MATLAB (The Math-Works, Inc., Natick, MA) following the description by Manly.30 4 RESULTS 1. The UTI Data Set. The maximum number of scales that can be extracted from a data set in the discrete wavelet transform is dependent on the length of the input data vectors. For all the data sets described in this article, the lengths of the data vectors are 1024 (zeros have been added to the 882 length vectors from the FT-IR instrument). The total number of scales is 10 because log2(1024) ) 10. The 10 scales are: 0, 1, 2, ..., 9. Since the zeroth (30) Manly, B. Multivariate Statistical Methods: A Primer; Chapman and Hall: London, U.K., 1994. (31) MacFie, H.; Gutteridge, C.; Norris, J. J. Gen. Microbiol. 1978, 104, 67-74. (32) Windig, W.; Meuzelaar, H.; Haws, B.; Campbell, W.; Asay, K. J. Anal. Appl. Pyrolysis 1983, 5(3), 183-198.

Analytical Chemistry, Vol. 71, No. 15, August 1, 1999

3097

Table 1. Relative Areas of Clusters in 2D DFA Score Space for Each Addition of a Wavelet Scale (The UTI Data Set) scale added

cluster 1

cluster 2

cluster 3

1 2 3 4 5 6 7 8 9

24 19 23 11 9 5 4 4 4

35 24 20 14 9 4 4 3 3

48 53 17 15 7 4 4 4 4

scale is only a constant, the changes in cluster patterns are observed from scale no. 1 (contains 2 wavelet coefficients). For all data sets in the article, 9 different reconstructions were made using the following set of scales: 0 - 1, 0 - 2, 0 - 3, 0 - 4, 0 5, 0 - 6, 0 - 7, 0 - 8, and 0 - 9. The UTI data set contains 3 clusters, and the external taxonomic information is used directly here in the DFA analyses. Since only three classes are investigated, the number of significant DFA dimensions is two. The optimal wavelet for this data set was found to be Coiflet 5. The results from the multiscale cluster analysis of this data set are summarized in Figure 4. A scale dendrogram for the same analysis is shown in the upper part of Figure 7. Here it is observed that after adding scale 3, clusters 1, 2, and 3 separate into nonoverlapping clusters. Adding more scales does not affect the overlapping between the clusters. From Table 1 we see that the relative areas of the clusters are becoming smaller as we add more scales, in particular after adding scale 4. Finding Important Regions. Where in the spectral domain is the feature located which is responsible for the separation of clusters 1, 2, and 3 after adding scale 3? To answer this question, a systematic variation of mask vectors is applied to scale 3. This results in a total of 223 ) 256 possible masking vectors and thus 256 different DFA analyses. The overlap between each of the three clusters is measured in all 256 combinations. Forty-eight of the combinations produced DFA spaces with no overlap between the clusters. The important wavelet coefficient was found to be no. 5 in scale 3 which covers the region 2024-1534 cm-1. 2. The Eubacterium Data Set. For this data set the DFA is used in an unsupervised mode. The class variables contain information about the twenty-two different replicate sets used in the recording of the spectra. By using the contribution to cluster separation as a criterion, it was found that only the two first DFA dimensions were significant. Therefore, none of the dimensions above two were used in the analyses described below. The optimal wavelet for this data is Symmlet 9. For each of the nine scale reconstructions, a DFA was performed and two DFA components were extracted. Figure 5 shows the convex hull polygons for the four clusters in the DFA scores space. A scale dendrogram for the same analysis is shown in the lower part of Figure 7. Each subplot in Figure 5 corresponds to a different scale reconstruction. This figure shows that going from scale 0 to 1 causes cluster 1 to separate from all the other clusters. This means that there is information in scale no. 1 that makes cluster 1 different from the others. The fact that this 3098 Analytical Chemistry, Vol. 71, No. 15, August 1, 1999

Figure 6. Summary plot of masking analyses performed on the Eubacterium data set. (A) Region associated with separation of cluster 1 from the other clusters when scale no. 1 is added. (B) Region associated with separation of cluster 2 from the other clusters when scale no. 3 is added. (C) Region associated with the significant reduction of the relative area of cluster 1 when scale no. 5 is added.

Figure 7. Scale dendrograms for the two data sets.

separation happens at scale 1 tells us that it must be a very broad feature. In the section below titled Finding Important Regions, an analysis is presented that enables an approximate localization of this feature. After subsequent scale additions, cluster 1 remains separated from all the other clusters. Cluster 2 separates out next, at scale 3. Clusters 3 and 4 separate after addition of scale 5. Adding more scales to the spectra does not improve the separation. This demonstrates clearly that there are optimal choices for how much detail (i.e., scales) should be included to achieve total separation of the clusters. Here, the optimal number of scales is 5, since this is the smallest number of scales needed to achieve total separation between the clusters. See Table 4 for a list of percentages of overlap between the various clusters at different scale reconstructions. Adding more scales makes the clusters more compact, as can be seen from Table 3. Cluster 1 requires more scales, as compared with the others, to become compact. In particular, after adding scale 5, there is a significant decrease in the relative area of cluster 1 from 31% to 9%. Below, the masking method is used to find the approximate localization of the feature which is associated with this area reduction.

Table 2. Percentage Overlap between Clusters 2D DFA Score Space for Each Addition of a Wavelet Scale (The UTI Data Set) scale added

(1,2)

(1,3)

(2,3)

1 2 3 4 5 6 7 8 9

41 4 0 0 0 0 0 0 0

57 46 0 0 0 0 0 0 0

26 13 0 0 0 0 0 0 0

Table 3. Relative Areas of Clusters in 2D DFA Score Space for Each Addition of a Wavelet Scale (The Eubacterium Data Set)

coefficients, and there are 223 ) 256 different binary combinations. A systematic search was performed in which a DFA analysis was performed for each of the 256 masking combinations. The overlap of cluster 2 with any of the other clusters was recorded for each combination. In 74 of these combinations it was found that cluster 2 does not overlap with any other cluster. In all of these 74 combinations, it was found that wavelet coefficient no. 3 for scale 3 was always present. The region corresponding to this wavelet coefficient number is 3012-2522 cm-1 and is shown in Figure 6B. For all the scales up to scale no. 4, cluster 1 has a large withingroup variance, as can be seen from the large relative area of the cluster. Above we saw that after adding scale 5 there is a significant decrease in the area (from 31% to 9%). What regions in the spectral domain influence this? Using a window length of 25 4

scale added

cluster 1

cluster 2

cluster 3

cluster 4

1 2 3 4 5 6 7 8 9

29 24 30 31 9 6 3 3 4

22 21 9 6 3 2 4 4 4

27 42 17 8 6 5 5 5 4

9 10 4 2 3 5 3 3 2

Table 4. Percentage Overlap between Clusters in 2D DFA Score Space for Each Addition of a Wavelet Scale (The Eubacterium Data Set) scale added

(1,2)

(1,3)

(1,4)

(2,3)

(2,4)

(3,4)

1 2 3 4 5 6 7 8 9

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0

27 56 0 0 0 0 0 0 0

11 6 0 0 0 0 0 0 0

30 32 14 16 0 0 0 0

Finding Important Regions. Above it was demonstrated that cluster 1 separates out from the other clusters after adding scale 1. There are 2 coefficients in scale 1. Four different masking vectors are obtained which make it possible to deduce that the last coefficient is always associated with a separation. When the masking vector is [0 0] (for mask vector coefficients m1, m2) there is no separation of cluster 1. However, when the spectra are using the masking vectors [0 1] or [1 1], a separation of cluster 1 is observed. This indicates that there is a broad feature in the the second half of the wavenumber region that makes cluster 1 significantly different from the others. The broad feature is localized in a region corresponding to 2024-52 cm-1 (actually the region is over 2024-600 cm-1 since 600 cm-1 is the lower detection limit for the IR instrument used). The region of the feature is drawn as a thick line in Figure 6A together with the standard deviation spectrum of the data. When scale 3 is added, we see that cluster 2 separates out from the other clusters. The masking method is used to find the region which is necessary for this to happen. Scale 3 has 8

4 at scale 5 we obtain 2 ) 256 different masking combinations. For each of the combinations the area of cluster 1 is computed. A significant decrease in the area is here set to be less than 10%. Sixty-eight of the 256 combinations produced a cluster 1 with an area less than 10%. All these combinations had wavelet coefficients nos. 21, 22, 23, and 24 always present. These coefficients together cover the following wavenumbers: 1530-1040 cm-1. The region is shown in Figure 6C. Results from Fourier Clustering. When Fourier clustering is performed on this data set, similar, but not identical, results are seen. DFA of a data set reconstructed using only 0-0.0029 Hz (space Fourier analysis is performed with 4 cm-1 used as the sampling frequency) shows a separation of both clusters 1 and 2 from the others. This separation holds for higher frequencies added except for on two occasions in which the frequencies are 0-0.0118 Hz (some overlap with clusters 1 and 3) and 0-0.0265 Hz (an overlap between clusters 1 and 4). As also observed for the multiscale cluster analysis it can be seen here that clusters 3 and 4 are separated at higher frequencies (after frequencies greater than 0.0177 Hz are added). The reasons for the differences in the results of the two methods may be explained by the fact that the filter shapes used in the two transforms are different and that they sample the frequency domain differently. Another effect is that the Fourier bases do not have any localization in the time (wavenumber) direction. However, both wavelet and Fourier cluster analysis demonstrate that clusters 1 and 2 are more dominated by broad (lowfrequency) features compared with clusters 3 and 4. 5 DISCUSSION As demonstrated, the multiscale approach provides a way to zoom in on important regions in both the wavenumber (or time) and frequency direction in cluster analysis. The approach is demonstrated for cluster analysis of infrared spectra, but other types of spectra like UV and Raman could have been used as well. Since this is a general approach, it will be useful in most areas where cluster analysis of smooth profiles is concerned, for instance: (1) Biological taxonomy where infrared and Raman spectroscopy is used to produce whole-cell fingerprinting.33-36 The multiscale approach may provide a starting point for identification (33) Timmins, E. M.; Howell, S. A.; Alsberg, B. K.; Noble, W. C.; Goodacre, R. J. Clin. Microbiol. 1998, 36(2), 367-374.

Analytical Chemistry, Vol. 71, No. 15, August 1, 1999

3099

of compounds or cellular components associated with formation of interesting clusters. (2) Multivariate image analysis (MIA)37 in which cluster analysis is sometimes used to identify interesting areas of pixels in the original image. Using a multiscale approach, it is possible to locate spatial features which exist at different spectral-resolution levels. (3) Environmental monitoring38-41 in which identification of important spectral regions may be used to create fast and cheap instruments. (4) Molecular dynamics analysis42-44 in which the data contain dynamics involving a wide range of time scales. In the folding of a protein, for instance, there are time scales from (34) Haag, H.; Gremlich, H. U.; Bergmann, R.; Sanglier, J. J. J. Microbiol. Methods 1996, 27(2-3), 157-163. (35) Grossman, E. L.; Mii, H. S.; Zhang, C. L.; Yancey, T. E. J. Sediment. Res. Sect. A 1996, 66A(5), 1011-1022. (36) Vandermei, H. C.; Naumann, D.; Busscher, H. J. Arch. Oral Biol. 1993, 38(11), 1013-1019. (37) Geladi, P.; Grahn, H. Multivariate Image Analysis; John Wiley & Sons: New York, 1996. (38) Saucy, D. A.; Anderson, J. R.; Buseck, P. R. J. Geophys. Res., Atmos. 1991, 96(D4), 7407-7414. (39) Vangrieken, R.; Artaxo, P.; Bernard, P.; Leysen, L.; Otten, P.; Storms, H.; vanput, A.; Wouters, L.; Xhoffer, C. Chem. Anal. (Warsaw) 1990, 35(13), 75-89. (40) Samson, S. A.; Lewis, D. T. Geoderma 1991, 50(1-2), 1-11. (41) Mendez, J.; Quejido, A. J.; Perezpastor, R.; Perezgarcia, M. Anal. Chim. Acta 1993, 283(1), 528-537.

3100 Analytical Chemistry, Vol. 71, No. 15, August 1, 1999

seconds and even minutes down to femto- and picoseconds. Since all the time scales are interconnected, it is to be expected that any comparison of conformations using cluster analysis would benefit from a multiscale approach. Further work is needed to investigate other methods for extracting information from the TF domain such as continuous wavelet, wavelet packets, and Wiegner-Ville transforms.6 These methods allow more flexibility with respect to the sampling of the time-frequency domain. ACKNOWLEDGMENT The author wishes to thank the Realise Our Potential Award (ROPA) for financial support. The author also thanks Dr. Roy Goodacre for kindly providing the data sets and Professor Douglas B. Kell for useful discussions and support of the research.

Received for review October 23, 1998. Accepted April 26, 1999. AC9811672

(42) McCammon, J.; Harvey, S. Dynamics of Proteins and Nucleic Acids; Cambridge University Press: Cambridge, 1988. (43) Hu, Y.; Fleming, R.; Freed, K. Chem. Phys. 1991, 158(2-3), 395-408. (44) Rey, A.; Skolnick, J. Chem. Phys. 1991, 158(2-3), 199-219.