Anal. Chem. 1997, 69, 3370-3374
Analysis of Spectroscopic Imaging Data by Fuzzy C-Means Clustering James R. Mansfield,* Michael G. Sowa, Gordon B. Scarth, Rajmund L. Somorjai, and Henry H. Mantsch
Institute for Biodiagnostics, National Research Council Canada, 435 Ellice Avenue, Winnipeg, Manitoba, Canada R3B 1Y6
A novel method of analyzing spectroscopic imaging data is presented. A fuzzy C-means clustering algorithm has been applied to the analysis of near-infrared spectroscopic imaging data acquired with the combination of a CCD camera and a liquid crystal tunable filter. The use of fuzzy C-means clustering dramatically increased the information obtained from near-IR spectroscopic images and allowed for the detection of small subregions of the image that contained novel and unanticipated spectral features, without the need for a priori knowledge of the chemical composition of the sample. Two illustrative samples were analyzed, one comprised of four different inks printed on label paper and the other containing indocyanine green and human blood patches. The regions containing the different constituents were clearly demarcated and their mean spectra determined. The mean spectra of the second sample were shown to match those obtained using a scanning near-IR spectrometer. In addition to probing the spatial and spectral characteristics of the samples, the fuzzy C-means clustering analysis also helped improve the signal-to-noise ratio of the spectra. Spectroscopic imaging methodologies and data are becoming increasingly common in analytical laboratories, whether it be magnetic resonance (MR),1,2 infrared,3-5 Raman,6-8 fluorescence7,9 and optical9 microscopy, or near-IR10-16 or visible10,12,15 based (1) DeLaPaz, R. L. Curr. Opin. Neurol. 1995, 8, 430-436. (2) Maudsley, A. A.; Lin, E.; Weiner, M. W. Magn. Reson. Imaging 1992, 10, 471-485. (3) Choo, L. P.; Wetzel, D. L.; Halliday, W. C.; Jackson, M.; LeVine, S. M.; Mantsch, H. H. Biophys. J. 1996, 71, 1672-1679. (4) Treado, P. J.; Morris, M. D. Appl. Spectrosc. Rev. 1994, 29, 1-38. (5) Guilment, J.; Markel, S.; Windig, W. Appl. Spectrosc. 1994, 48, 320-326. (6) Morris, H. R.; Hoyt, C. C.; Miller, P.; Treado, P. Appl. Spectrosc. 1996, 50, 805-811. (7) Morris, H. R.; Hoyt, C. C.; Treado, P. Appl. Spectrosc. 1994, 48, 857-866. (8) Schaeberle, M. D.; Karakatsanis, C. G.; Lau, C. J.; Treado, P. J. Anal. Chem. 1995, 67, 4316-4321. (9) Farkas, D. L.; Baxter, G.; DeBiasio, R. L.; Gough, A.; Nederlof, M. A.; Pane, D.; Pane, J.; Patek, D. R.; Ryan, K. W.; Taylor, D. L. Annu. Rev. Physiol. 1993, 55, 785-817. (10) Hoyt, C. C Adv. Imaging 1995, 4, 53-55. (11) Lodder, R. A. Eur. J. Pharm. Sci. 1994, 2, 84. (12) Frostig, R. D.; Lieke, E. E.; Ts’o, D. Y.; Grinvald, A. Proc. Natl. Acad. Sci. U.S.A. 1990, 87, 6082-6086. (13) Eckart, A.; Genzel, R.; Hofmann, R.; Sams, B. J.; Tacconi-Garman, L. E. Astrophys. J. 1995, 445, L23-L26. (14) Carlson, R.; Smythe, W.; Baines, K.; Barbinis, E.; Becker, K.; Burns, R.; Calcutt, S.; Calvin, W.; Clark, R.; Danielson, G.; Davies, A.; Drossart, P.; Encrenaz, T.; Fanale, F.; Granahan, J.; Hansen, G.; Herrera, P.; Hibbitts, C.; Hui, J.; Irwin, P.; Johnson, T.; Kamp, L.; Dieffer, H.; Leader, F.; Lellouch, E.; Lopes-Gautier, R.; Matson, D.; McCord, T.; Mehlman, R.; Ocampo, A.; Orton, G.; Roos-Serote, M.; Segura, M.; Shirley, J.; Soderblom, L.; Stevenson, A.; Taylor, F.; Torson, J.; Weir, A.; Weissman, P. Science 1996, 274, 385388.
3370 Analytical Chemistry, Vol. 69, No. 16, August 15, 1997
imaging. The volume of information contained in spectroscopic images can make standard data processing techniques cumbersome. Furthermore, there are few techniques that can demarcate which regions of a spectroscopic image contain similar spectra without a priori knowledge of either the spectral data or the sample’s composition.5,17 The objective of analyzing spectroscopic images is not only to determine what the spectrum is at any particular pixel in the image but also to determine which regions of the image contain similar spectra, i.e., what regions of the image contain chemically related compounds. To this end, we have implemented a fuzzy C-means clustering18 algorithm to separate the spectra on the basis of their shape. Since fuzzy C-means clustering is an unsupervised classifier, it is model-free, requiring no prior knowledge of the constituent chemical species or their spatial distribution. Because cluster analysis groups spectra according to their shape, regions with distinct spectral features are separated into distinct clusters. In addition, clustering can differentiate varying levels of absorbance. Thus, the spectra of individual regions with similar concentrations of constituents typically cluster together. These characteristics of the analysis lend themselves to a variety of applications in material and surface evaluation. For instance, the uniformity of a material deposited on a transparent or reflective substrate can be rapidly determined by this method. Fuzzy clustering analysis (FCA) tends to group together similar spectra such that the dissimilarity between the resulting group averages is maximized. The results of a FCA include, for each cluster, the cluster centroid (i.e., the weighted mean spectrum for the cluster) and the corresponding cluster membership map (i.e., the spatial distribution of the cluster). Taken together, they answer two commonly posed questions about spectroscopic imaging: where did the different types of spectra occur (shown by the cluster membership maps), and what were the spectral characteristics (depicted by the cluster centroids)? Associated with each cluster is a fuzzy membership map, which contains at each pixel (i.e., spectrum) a membership value ranging from 0 (no membership) to 1 (full membership). In contrast to hard clustering techniques, which only allow either a 0 (not belonging to the cluster) or 1 (belonging) membership, fuzzy memberships increase the reliability of the analysis. As an example of this, consider pixels from areas which lie along a border between two chemically distinct regions and whose spectra (15) Rowan, L. C.; Bowers, T. L.; Crowley, J. K Econ. Geol. 1995, 90, 19661982. (16) Afromowitz, M. A.; Callis, J. B.; Heimbach, D. M.; DeSoto, L. A.; Norton, M. K. IEEE Trans. Biomed. Eng. 1988, 35, 842-850. (17) Wienke, D.; van den Broek, W.; Buydens, L. Anal. Chem. 1995, 67, 37603766. (18) Bezdek, J. C. Pattern Recognition with Fuzzy Objective Function Algorithms; Plenum Press: New York, 1981. S0003-2700(97)00206-0 CCC: $14.00
© 1997 American Chemical Society
contain contributions from two or more constituents with varying concentrations. The spectra along the border do not match a specific constituent uniquely, and thus the spectra do not belong exclusively to any one constituent cluster. The fuzzy memberships, which are integral to the fuzzy clustering method, are ideally suited to handle this situation and, therefore, allow a more specific, hence robust, classification of the spectroscopic images. Ideally, this analysis assigns distinct constituents to separate clusters, and the membership functions can be interpreted as the relative constituent concentrations. In addition to identifying in which regions of the spectroscopic image certain spectra were found, a fuzzy clustering analysis may also improve the spectral quality obtained by averaging those spectra which are the most similar. This is, in effect, a means of combining spectra, with a concomitant increase in their signalto-noise level. In this article, two examples of a fuzzy C-means clustering analysis of spectroscopic imaging data are presented. These analyses exhibit the region selection, spectral selection, novelty detection, and signal enhancement features which are this methodology’s hallmarks.19 It is important to note, however, that this analysis methodology could easily be applied to any existing spectroscopic imaging technique, whether it be spectroscopic microscopy imaging using confocal, multiplexed,4,6-10 or samplemapping technologies3,5 or astronomical,13,14 airborne,15 industrial,8 or medical11,12,16 visible/near-IR or MRI1,2 based spectroscopic imaging. EXPERIMENTAL SECTION Near-IR spectroscopic images were collected using a Photometrics Series 200 CCD camera consisting of a 512 × 512 backilluminated CCD element and a 14-bit A/D converter (Photometrics, Tuscon, AZ). The images were collected as 256 × 256 arrays, binning the CCD in 2 × 2 squares. The camera was fitted with a Nikon Micro AF60 lens with the f-stop set to 8, and the surfaces of the samples were evenly illuminated using a Bencher CopyMate II copy stand equipped with quartz lamps. A liquid crystal tunable filter (LCTF) unit from Cambridge Research Instruments (Cambridge, MA) was used to scan through from 650 to 1050 nm at 10 nm intervals. Color CCD images were collected with a Sony CCDIRIS/RGB color video camera equipped with a zoom lens and captured using a Creative Labs Video Blaster RT300 image capture card. A Perstorp Analytical Model 6500 NIRSystems (Perstorp Analytical, Silver Springs, MD) scanning near-infrared spectrometer equipped with a bifurcated randomized fiber-optic bundle (Fiberguide, Stirling, NJ) was used to collect reflectance spectra. Spectroscopic images were collected from two samples: a piece of blotting paper with a patch of dried indocyanine green dye and dried human blood, and a plastic box mounted with a paper label on which are printed four different types of inks. Each was collected using nearly identical parameters, with only the exposure time being varied in order to adjust for the differing reflectance properties of the samples. The spectra were acquired by capturing an image with the bandpass of the LCTF sequentially set to each of the data points collected. Data Analysis. Each sequence of raw reflectance images was converted to an absorbance scale by ratioing against a sequence of images of a Kodak “Gray Card” white surface (Eastman Kodak, (19) Scarth, G. B.; McIntyre, M.; Wowk, B.; Somorjai, R. L. Proc. Int. Soc. Magn. Res. Med. 1995, 238.
Rochester, NY) using -log(IS/IR), where IS is the intensity of a sample image pixel and IR is the intensity of the corresponding pixel from the reference image taken at the same wavelength as the sample image. This is the inverse of the more commonly used absorbance scale conversion and was done in order to produce spectra such that a higher absorbance of light (and, therefore, a lower reflectivity) gives a positive peak in the tracing, as is seen in transmission spectra. The spectroscopic imaging data were analyzed using “EvIdent”, version 2.0, an in-house software package developed in the Interactive Data Language (IDL, Research Systems Inc., Boulder, CO) for the analysis of temporal image data.19-21 EvIdent contains the fuzzy C-means clustering algorithm within a graphical user interface. The clustering calculations were performed on a Silicon Graphics Challenge series server (SGI, Mountain View, CA), and each took from 30 s to 3 min to complete. Spectroscopic imaging data are two-dimensional arrays of spectra, with each pixel containing a spectrum. Homogeneous patterns in the spectra are identified by clustering such that the differences in the intracluster spectra are minimized, while simultaneously maximizing the intercluster spectral differences. The fuzzy C-means algorithm in these analyses uses iteration to converge to a solution. During each iteration, the fuzzy cluster centroids, vkl (in this case, the weighted means of the spectra), and the fuzzy cluster memberships, uki, are updated for each of the C clusters as follows: n
∑(u
n
∑(u
(1)
(∑( ) )
(2)
vkl ) (
m Xil)( ki) k
i)1
uki )
m -1 ki) )
i)1
C
dki
j)1
dji
2/(m-1) -1
n
∑(Xk
dki ) (
il
- vkl)2)1/2
(3)
i)1
Here, dki represents the Euclidean distance between cluster centroid k and data pixel i, m is the fuzzy index, n is the number of data points in each spectrum, and k Xil is the absorbance at pixel i and spectral data point l. The iteration process terminates when the magnitude of the change in the cluster membership values decreases below a set threshold. Fuzzy clustering assigns a spectrum i to a cluster k with a fuzzy membership value uki: 0 e uki e 1. The sum of all memberships for each pixel, uki, is constrained to 1. This is analogous with but not equivalent to the probability that pixel i belongs to cluster k. The threshold level which determined the boundaries of the clusters was arbitrarily set to 0.975 for all calculations. The raw spectra were introduced to the clustering routine with no normalization or baseline correction. RESULTS AND DISCUSSION The spectra obtained in the near-IR region between 650 and 1050 nm of most samples generally contain broad peaks and are rather featureless. Nevertheless, the clustering routine used in (20) Sowa, M. G.; Mansfield, J. R.; Scarth, G. B.; Mantsch, H. H. Appl. Spectrosc. 1997, 51, 143-152. (21) Mansfield, J. R.; Sowa, M. J.; Scarth, G. B.; Somorjai, R. L.; Mantsch, H. H. Comp. Med. Imaging Graph., in press.
Analytical Chemistry, Vol. 69, No. 16, August 15, 1997
3371
Figure 1. Fuzzy C-means clustering into three groups. (A) Three clusters superimposed on an absorbance image. Cluster 1 is shown in green, cluster two in red, and cluster three in yellow. (B) Color CCD image of the sample. (C) Three centroid spectra with the color of each trace corresponding to the color displayed in (A).
these analyses was successful in assigning the different spectral features to individual clusters. To test the spectral quality obtained in the spectroscopic images, an aliquot of a concentrated indocyanine green solution and an aliquot of fresh human blood were dried on a thick piece of blotting paper, and a spectroscopic image was collected. A color CCD image of the sample can be seen in the bottom left corner of Figure 1. Spectroscopic images were classified into three fuzzy clusters, the results of which can be seen superimposed on the spectroscopic image, with the first class being shown in green, the second in red, and the third in yellow. The boundaries of the clusters agree closely with the boundaries seen upon visual inspection of the images. The average spectrum, or centroid, of each cluster is displayed in the appropriate color on the right side of Figure 1. To compare the spectral quality, reflectance spectra were taken in the same wavelength region using a scanning NIRSystems spectrometer with a randomized fiber bundle. A comparison between the centroid spectra of the clusters (right side) and the reflectance spectra (left side) can be seen in Figure 2. The spectra correspond closely in terms of both absorbance and band shape in the region of 730-1050 nm. At wavelengths shorter than 730 nm, there is an apparent decrease in the absorbance of the spectroscopic image’s spectra, whereas the absorbance of the scanning spectrometer’s spectra increases. However, it is at this point that the CCD camera/liquid crystal tunable filter (LCTF) system’s throughput drops to below 25% of the maximum efficiency of the system. The centroid spectrum of indocyanine green (Figure 2, solid trace) matches the expected spectrum with an absorbance maximum at 800 nm. The centroid spectrum of blood (Figure 2, dotted trace) displays a broad absorbance centered at 900 nm, 3372
Analytical Chemistry, Vol. 69, No. 16, August 15, 1997
Figure 2. Comparison of spectral quality. The left-hand side shows the spectra taken with a scanning near-IR spectrometer equipped with a fiber-optic probe. The right-hand side shows the centroid spectra from the fuzzy C-means clustering analysis.
arising from a low-energy electronic transition of oxygenated hemoglobin. The two spectra shown as dashed traces are 100% lines, since the white surface of the blotting paper was used as a background for both the reflectance spectra and the spectroscopic images. The reflectance spectrum of the background material over the 700-1050 nm range displays a maximum peak-to-peak noise level of 280 µA, whereas the spectroscopic imaging centroid spectrum exhibits 11 mA of noise over the same wavelength range. Figure 3 shows representative raw spectra extracted from various locations of the spectroscopic image. These spectra clearly show the same absorbance features exhibited by the clustering centroid spectrum from the regions in which they are located. Both a single reflectance spectrum and the complete spectroscopic image (65 536 spectra) were acquired in approximately 40 s.
Figure 3. Representative raw spectra. The left-hand side shows the raw spectra extracted from various points in the spectroscopic image.
Figure 4. Fuzzy C-means clustering into five groups. (A) Color CCD image of the sample. (B) Five clusters superimposed on an absorbance image. Cluster 1 is shown in green, cluster two in yellow, cluster three in red, cluster four in blue, and cluster five in white. (C) Five centroid spectra with the color of each trace corresponding to the color displayed in (B).
The second test sample was a spectroscopic image of the cover of a box used to store a bandpass filter. The label on the box is made of paper, and there are, by visual inspection, at least four different inks printed on it. There is a red ink printed on the label (see Figure 4, color CCD image), two different types of ball point pen inks, and the ink from a stamp pad. The results of a fuzzy C-means clustering of the spectra can be seen superimposed on a 660 nm absorbance image (Figure 4, bottom left), along with their associated centroid spectra (Figure 4, right). There are clearly four different spectra shown in Figure 4, with the two spectra illustrated in yellow and green differing only in intensity
and the spectrum in white being essentially a 100% line, or a spectrum of the background paper. The pen and stamp pad inks each clearly show different spectra (Figure 4, blue, red, and yellow/green traces) and spatially correspond to the locations of the inks in the absorbance image. The regions containing red ink, which reflects red light and has no near-IR absorbance features, clustered as a part of the white paper background. The centroid spectra from the yellow and green clusters represent the spectrum of the same ink differing only in intensity. This may be a desirable result when the concentration distribution or the uniformity of a species on a Analytical Chemistry, Vol. 69, No. 16, August 15, 1997
3373
surface must be rapidly determined. In this case, the yellow and green clusters represent different thicknesses of the same ink deposited on the surface. However, when only compositional differences need to be distinguished, normalization of the spectra prior to FCA eliminates clustering based on differences in absolute spectral intensity. In this case, spectral normalization prior to FCA results in the yellow and green cluster maps coalescing to form a single cluster, thus solely distinguishing the three inks and the paper from one another. The cluster memberships vary from 130 pixels (Figure 4, green tracing) to 26 756 pixels (Figure 4, white tracing). This methodology is, therefore, able to extract small regions of a spectroscopic image whose spectra differ from that of a large number of background pixels. CONCLUSION Like many forms of spectroscopic imaging, the spectra obtained using the combination of a liquid crystal tunable filter and a CCD camera, although generally displaying the spectral features of the sample at that point, are poor compared with those which could be acquired using more common single spectrum techniques. However, the combined advantages of spatially resolving spectral features and the multiplexed method of collecting tens of thousands of spectra simultaneously are enticing. Combining spectroscopic imaging techniques with an unsupervised classification scheme, such as fuzzy C-means clustering, can both increase the signal-to-noise levels in the centroid spectra and identify which regions of the spectroscopic image contain which spectral features. The effect of averaging hundreds or even thousands of spectra into one centroid spectrum for each fuzzy region is similar to the effect achieved by coadding scans in a spectrometer, except that, in the fuzzy clustering methodology, only those spectra which are most similar are combined. Thus, spectra are obtained from this spectroscopic imaging modality that are comparable to those obtained using a conventional scanning spectrometer, in comparable acquisition times. Therefore, the primary disadvantage of
3374
Analytical Chemistry, Vol. 69, No. 16, August 15, 1997
spectroscopic imaging modalities, viz., their poor signal-to-noise properties, is circumvented. Simultaneously, the primary advantage of spectroscopic imaging, viz., access to spatial information, is enhanced by the unsupervised selection of regions of the image whose spectra are most similar. All of this can then be presented in an information-rich, graphical format which can clearly display which regions of the spectroscopic image contain which spectral features. Additionally, the heterogeneity of chemical species within the field of view, as well as their identity, could be rapidly screened in a completely unsupervised fashion by matching the centroid spectra selected by the cluster analysis with compound identification libraries. Point defects in materials would be identified as outlier clusters, consisting of very few pixels which are highly localized spatially. In complex multicomponent systems, fuzzy C-means clustering would provide a superior means of distinguishing both constituent and compositional variations across the spectral image. The use of fuzzy C-means clustering dramatically increases the information obtained from near-IR spectroscopic images and allows for the detection of small subregions of the image that contain novel and unanticipated spectral features, without the need for a priori knowledge of the chemical composition or spectral features of the sample. Furthermore, this data processing methodology need not be limited to this one application. This methodology could easily be extended to existing spectroscopic imaging modalities such as Raman microscopic imaging, mid-IR microscope sample mapping methodologies, astronomical LCTF spectroscopic imaging, or magnetic resonance spectroscopic imaging. ACKNOWLEDGMENT This work is issued as NRCC No. 34788. Received for review February 20, 1997. Accepted June 5, 1997.X AC970206R X
Abstract published in Advance ACS Abstracts, July 15, 1997.