Application of Fuzzy c-Means Clustering in Data Analysis of

May 1, 2009 - CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Da...
193 downloads 12 Views 2MB Size
Anal. Chem. 2009, 81, 4468–4475

Application of Fuzzy c-Means Clustering in Data Analysis of Metabolomics Xiang Li,† Xin Lu,† Jing Tian,†,‡ Peng Gao,† Hongwei Kong,† and Guowang Xu*,† CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian 116023, and Department of Modern Technology, Dalian Polytechnic University, Dalian 116034, China Fuzzy c-means (FCM) clustering is an unsupervised method derived from fuzzy logic that is suitable for solving multiclass and ambiguous clustering problems. In this study, FCM clustering is applied to cluster metabolomics data. FCM is performed directly on the data matrix to generate a membership matrix which represents the degree of association the samples have with each cluster. The method is parametrized with the number of clusters (C) and the fuzziness coefficient (m), which denotes the degree of fuzziness in the algorithm. Both have been optimized by combining FCM with partial least-squares (PLS) using the membership matrix as the Y matrix in the PLS model. The quality parameters R2Y and Q2 of the PLS model have been used to monitor and optimize C and m. Data of metabolic profiles from three gene types of Escherichia coli were used to demonstrate the method above. Different multivariable analysis methods have been compared. Principal component analysis failed to model the metabolite data, while partial least-squares discriminant analysis yielded results with overfitting. On the basis of the optimized parameters, the FCM was able to reveal main phenotype changes and individual characters of three gene types of E. coli. Coupled with PLS, FCM provides a powerful research tool for metabolomics with improved visualization, accurate classification, and outlier estimation. Multivariate analysis has been proven to be a useful and powerful tool for the analysis of complex and multidimensional data from metabolomics research.1,2 Large amounts of data are generated by investigating the metabolic response to pathophysiologic stimuli or genetic modification in body fluids, tissues, etc.3 As nuclear magnetic resonance spectroscopy (NMR)4,5 and mass * To whom correspondence should be addressed. Phone/fax: 0086-41184379559. E-mail: [email protected]. † Chinese Academy of Sciences. ‡ Dalian Polytechnic University. (1) Trygg, J.; Holmes, E.; Lundstedt, T. J. Proteome Res. 2007, 6, 469–479. (2) van der Greef, J.; Smilde, A. K. J. Chemom. 2005, 19, 376–386. (3) Nicholson, J. K.; Connelly, J.; Lindon, J. C.; Holmes, E. Nat. Rev. Drug Discovery 2002, 1, 153–161. (4) Brindle, J. T.; Antti, H.; Holmes, E.; Tranter, G.; Nicholson, J. K.; Bethell, H. W. L.; Clarke, S.; Schofield, P. M.; McKilligin, E.; Mosedale, D. E.; Grainger, D. J. Nat. Med. 2002, 8, 1439–1444. (5) Cloarec, O.; Dumas, M. E.; Craig, A.; Barton, R. H.; Trygg, J.; Hudson, J.; Blancher, C.; Gauguier, D.; Lindon, J. C.; Holmes, E.; Nicholson, J. Anal. Chem. 2005, 77, 1282–1289.

4468

Analytical Chemistry, Vol. 81, No. 11, June 1, 2009

spectrometry (MS)6-8 are most frequently used to analyze the metabolites, a data structure of many more variables than observations is usually gained in metabolomics. Thus, multivariate analysis is more necessary to extract useful information in the complex data set for classification as well as exploratory data analysis. Principal component analysis (PCA), partial least-squares (PLS), and partial least-squares discriminant analysis (PLS-DA) are now widely used in metabolomics data analysis. PCA compresses the raw data X (N × M) into a few principal components. N is the number of observations, and M is the number of variables (in this case metabolites). It is useful to visualize metabolomics data, gain classification information, and find outliers of the data.1 However, not all of the data collected in metabolomics can be decomposed by PCA into a few principal components, especially if the differences between groups are minor and obscured by other covariates. PLS utilizes response values, grouping information, or other metadata assigned to observations as predicted vector/matrix Y and extracts principal components explaining the maximum covariance between X and Y in the data model. If Y is a vector, the model is called PLS1, while in PLS2 Y is a matrix which we do not distinguish here. PLS-DA models the group membership using dummy values or contrasts for each observation. A typical problem of PLS/PLSDA is the possibility of overfitting.9 Since a predetermined group membership is assigned to each sample, the modeling result greatly relies on this input, which is usually based on sample information and estimation of the analyst. Thus, any misclassification and ambiguous estimation will distort the model and misguide the explanation of the data. Fuzzy c-means (FCM) clustering10 is one of the unsupervised methods for clustering based on fuzzy logic. In FCM, one sample can be assigned to more than one class instead of only one. The membership of each sample is calculated and then represented by a membership value between 0 and 1, instead of 0 or 1 in the hard clustering. Thus, fuzziness of the clustering tolerates ambiguity of the classification and directly correlates the informa(6) Weckwerth, W.; Loureiro, M. E.; Wenzel, K.; Fiehn, O. Proc. Natl. Acad. Sci. U.S.A. 2004, 101, 7809–7814. (7) Crockford, D. J.; Lindon, J. C.; Cloarec, O.; Plumb, R. S.; Bruce, S. J.; Zirah, S.; Rainville, P.; Stumpf, C. L.; Johnson, K.; Holmes, E.; Nicholson, J. K. Anal. Chem. 2006, 78, 4398–4408. (8) Dettmer, K.; Aronov, P. A.; Hammock, B. D. Mass Spectrom. Rev. 2007, 26, 51–78. (9) Defernez, M.; Kemsley, E. K. TrAC, Trends Anal. Chem. 1997, 16, 216– 221. (10) Bezdek, J. C.; Ehrlich, R.; Full, W. Comput. Geosci. 1984, 10, 191–203. 10.1021/ac900353t CCC: $40.75  2009 American Chemical Society Published on Web 05/01/2009

tion in X without artificial assignment. As only raw data are used and no supervised method is applied, the chance for overfitting is avoided in FCM. In former research, FCM has been proven to be a useful and efficient tool for data analysis. By incorporating a priori knowledge of the data within the clustering algorithm, FCM was shown to be efficient and reliable for classification of organic compounds using piezoelectric chemical sensor array data of 14 analytes.11 When other automated techniques failed, robust and repeatable results could be gained by fuzzy clustering of segment dynamic neuroreceptor SPET images, without inherent subjectivity of the traditional method which required definition of manual drawing regions of interest (ROI).12 FCM was also used to distinguish tumor tissue on the basis of its FT-IR maps.13 Both PCA and FCM had been used to explain the results from a multiarray chemical sensor (MACS) by visualization and quantitation of different tastes on a personal digital assistant (PDA). In addition, dimensional reduced data by PCA were quantitatively analyzed using the FCM algorithm for classifying taste patterns in the liquids.14 FCM could also be used in process monitoring in combination with PLS.15 Process drifts due to seasonal fluctuations could also be monitored by FCM, in which data dimensionality first had been reduced by PCA.16 29Si NMR data of calcium aluminosilicate glasses were clustered by both hard clustering and FCM. By optimizing the degree of fuzziness, valuable additional information on outliers could be gained by FCM results.17 In other research, FCM was used to cluster chemical compounds in combinatorial chemistry. A total of 627 alcohols were characterized by 50 semiempirical descriptors. The descriptors matrix was compressed by PCA first. In the FCM algorithm, the cluster number was predetermined and different fuzziness coefficients were used with two different distances. Validation of the cluster analysis was performed by PLS, in which the membership matrix was used as the Y matrix and the descriptors matrix was used as the X matrix. An optimal result could be gained by the method above.18 In our research herein, FCM was used to cluster three gene types of Escherichia coli on the basis of their metabolic profiles. The fuzziness coefficient and cluster number of FCM were both optimized by PLS, in which the X matrix contains the metabolite data and the Y matrix reflects the membership matrix of FCM. FCM clustered the samples successfully, revealing main phenotype changes in metabolic profiles. With a hypo-optimal result, the same clustering as reflected by the gene types was gained. Outliers could be easily distinguished, and variances within and between groups were revealed. Additionally, significantly changed (11) Barko´, G.; Abonyi, J.; Hlavay, J. Anal. Chim. Acta 1999, 398, 219–226. (12) Acton, P. D.; Pilowsky, L. S.; Kung, H. F.; Ell, P. J. Eur. J. Nucl. Med. 1999, 26, 581–590. (13) Richter, T.; Steiner, G.; Abu-Id, M. H.; Salzer, R.; Bergmann, R.; Rodig, H.; Johannsen, B. Vib. Spectrosc. 2002, 28, 103–110. (14) Kim, J.-D.; Byun, H.-G.; Kim, D.-J.; Ham, Y.-K.; Jung, W.-S.; Yoon, C.-O. Talanta 2006, 70, 546–555. (15) Teppola, P.; Mujunen, S. P.; Minkkinen, P. Chemom. Intell. Lab. Syst. 1998, 41, 95–103. (16) Teppola, P.; Mujunen, S. P.; Minkkinen, P. Chemom. Intell. Lab. Syst. 1999, 45, 23–38. (17) Ehrentreich, F.; Nofz, M.; Bartel, H.-G. Chemom. Intell. Lab. Syst. 1992, 15, 61–73. (18) Linusson, A.; Wold, S.; Norde´n, B. Chemom. Intell. Lab. Syst. 1998, 44, 213–227.

metabolites could be discovered on the basis of the FCM clustering. THEORY Different from hard clustering, in fuzzy clustering one sample can belong to more than one cluster. Considering a data matrix X (N × M), xkj is the jth metabolite’s intensity value in the kth sample. The number of metabolites is M. Initially, all of the N samples are assigned to predetermined C clusters. The FCM algorithm minimizes the weighted sum of distances dik between the kth sample and the ith cluster center vi, i.e., minimizes the generalized least-squared errors function:10 C

N

∑ ∑ (u

J(U, V) )

dik2

m

ik)

(1)

i)1 k)1

where dik2 ) xk - νi

| |

2

) (xk - νi)TA(xk - νi)

N

vi )

∑ (u

(2)

N

∑ (u

m ik) xk/

k)1

ik)

m

(3)

k)1

The Euclidean distance has most frequently been used, so the A matrix (M × M) in eq 2 is set to the identity matrix I.17 The degree of membership of the kth sample to the ith cluster is defined as the membership value (uik), which is between 0 and 1. A large membership value implies a high degree of membership. Membership values close to 1 mean that the sample belongs to the corresponding cluster. The sum of membership values of the kth sample for each of the clusters should satisfy the following equation: C

∑u

ik

) 1,

∀k,

0 e uik e 1

(4)

i)1

m (1 < m < ∞) in eqs 1 and 3 denotes the degree of fuzziness in the algorithm, which is called the fuzziness coefficient. When m moves toward infinity, the result of minimization leads to uik ) 1/C for all of the samples and cluster centers. That means all of the samples have equal membership values. When m is 1, eq 1 will degenerate to the objective function of hard clustering, in which membership values of all samples equal 0 or 1. 10 The initiation of the algorithm demands predetermination of the cluster number C, fuzziness coefficient m, and an initial membership matrix U which is randomly initialized here. The cluster centers vi are calculated by eq 3. New membership values are calculated by C

uik ) 1/

∑ j )1

( ) dik djk

2(m-1)

(5)

Iterative calculations based on eqs 2, 3, and 5 are then performed until the stop criterion is reached, i.e., minimization of eq 1. Since the important parameters C and m are predetermined, their optimization plays a crucial role in the clustering perforAnalytical Chemistry, Vol. 81, No. 11, June 1, 2009

4469

mance. PLS is applied to optimize the FCM parameters by using the membership matrix U as input to PLS. The results of FCM (the membership matrix U) are used as the predictor matrix Y in PLS. The data matrix clustered by FCM is further used as the X matrix in PLS. PLS extracts principal components of a data matrix X that explain the correlation with a predictor matrix Y; thus, the clustering results of FCM will be used to explain and predict the data matrix. The PLS model quality is evaluated by internal validation which computes the parameters R2Y and Q2. R2Y reflects the explained variance of the Y matrix, and Q2 represents the cross-validated predictive capability. When different settings of FCM parameters are tested, different clustering results will be modeled by PLS. These models give a serial of R2Y and Q2, indicating the fitness of the clustering. When R2Y and Q2 are maximal, the corresponding clustering results fit the data best. Thus, by comparing R2Y and Q2 of the PLS model for different settings and combinations of C and m, the optimal parameters could be defined. EXPERIMENTAL SECTION Sample Preparation. Three gene types of E. coli, DC101∆ackApta, DC100∆sdhAB, and E.coli1.1566 (wild type) were used in the research. DC100∆sdhAB was the mutant of deletion of the succinodehydrogenase (SDH) gene, and DC101∆ackA-pta was the acetate kinase gene knocked out mutant. Each of the types was planted in three parallel broth cultures to investigate biological variability. Three parallel extractions were taken from each culture for further derivatization and analysis. At the late log phase, 3 mL of culture solution was quenched by 12 mL of acetonitrile at -80 °C for 15 min. Then each mixture was centrifuged at 10000 rpm for 10 min to discard the supernatant. The remainder was extracted by 3.75 mL of -45 °C chloroform/methanol/water (2:2:1). The extracted supernatant (1 mL) was lyophilized and then dissolved in 50 µL of pyridine and derivatized by bis(trimethylsilyl)trifluoroacetamide (BSTFA) (60 µL, 75 °C; 45 min). All of the samples were analyzed within 48 h after derivatization. To be simple, the samples were nominated as follows: The first part in the name represents the gene type, “DC101” for DC101∆ackA-pta, “DC100” for DC100∆sdhAB, and “Wild” for E.coli1.1566. The second part in the name represents the serial number of the culture, and the third part represents the serial number of the extraction. For example, the second culture and third extraction for DC101∆ackA-pta would be written as DC1012-3. Experimental Conditions. Metabolic profiles of the samples were collected by a Leco Pegasus 4D GC × GC-TOFMS instrument (Leco Corp., St. Joseph, MI) equipped with an Agilent 6890N GC instrument. Column 1 was a 30 m × 250 µm × 0.25 µm DB-5 (J&W Scientific, Folsom, CA), and column 2 was a 1.6 m × 100 µm × 0.10 µm DB-1701 (J&W). Helium (99.9995%) was used as the carrier gas in constant pressure mode of 600 kPa. An Agilent 7683B autosampler (Agilent, Palo Alto, CA) injected 1 µL of the sample at a split ratio of 1:10. The column 1 oven began at 70 °C with a hold time of 5 min and then ramped at 8 °C/min to 270 °C and held for 10 min. The column 2 oven was at a constant 20 °C higher temperature than the column 1 oven, and had the same temperature program as the column 1 oven. The temperatures of the GC inlet and transfer line were 270 °C. The modulation period was 4 s. The solvent delay time was 6 min. The ion source 4470

Analytical Chemistry, Vol. 81, No. 11, June 1, 2009

was set to 220 °C. The detector voltage was 1650 V, and the electron energy was -70 V. Mass spectra of m/z 33-500 were collected at an acquisition rate of 50 spectra/s (50 Hz). Data Processing. Multivariate analysis methods, e.g., PCA, PLS/PLS-DA, and FCM, require a data matrix that represents each sample by a row and each metabolite as one column. Since the GC × GC-TOFMS instrument acquires a three-dimensional array for each sample, data reduction and transformation are necessary.19 Only data from m/z 73 were used in the data analysis, because the ion of m/z 73 was the common fragment ion of trimethylsilylated metabolites. Comma-separated value (CSV) files of the raw GC × GC-TOFMS data were exported by the Leco ChromaTOF software V.3.25 (Leco). Then the CSV files were converted into two-dimensional matrices and exported as a binary files by an in-house conversion program. Peak alignment between different chromatograms was performed on these binary files by an in-house software “GC × GC workstation”.20 A program written in Matlab (Mathworks, Natick, MA) was used to merge the peak tables generated by the GC × GC workstation into one data matrix. Total ion intensities (TIC) of each chromatogram value were calculated, and peak volumes in each chromatogram were normalized to the respective TIC value. PCA, PLS, and PLS-DA were performed by the SIMCA-P 11 version (Umetrics AB, Umeå, Sweden). FCM was implemented using the fuzzy logic toolbox in Matlab and programs written in-house. Mass spectral similarity searches were executed by NIST MS search 2.0 (NIST/EPA/ NIH Mass Spectral Library, NIST 05). RESULTS AND DISCUSSION A typical GC × GC-TOFMS chromatogram of m/z 73 of E.coli1.1566 (Wild) as extracted using the in-house software GC × GC workstation can be seen in Figure 1. The chromatograms of the other two gene types are similar. After peak alignment, peak table merging, and normalization, the final metabolite matrix included 27 samples and 888 metabolites. Since the metabolites in the samples had a great range of concentration, the influence of concentration in multivariate analysis needs to be considered. Therefore, before PCA, PLS (PLS-DA), and FCM were applied, the variables in data matrix X (N × M) were first scaled by range scaling as follows:

xij′ )

xij - min(xij) max(xij) - min(xij)

j ) 1, 2,..., M

(6)

After the range scaling, each variable was between 0 and 1, making the concentration ranges of metabolites comparable. Thus, the influence of the concentration range could be decreased. PCA and PLS-DA Results. PCA was first used to analyze the data. PCA explained variance of the X matrix (R2X) was 0.548 if three principal components were used for the model. This means only 54.8% of the information could be explained by the first three components. In the score plot of PCA (Figure 2A), the wild type was very different from the two gene knockout types, but DC101 and DC100 were overlaid with each other. This means (19) Pierce, K. M.; Hope, J. L.; Hoggard, J. C.; Synovec, R. E. Talanta 2006, 70, 797–804. (20) Qiu, Y. Q.; Lu, X.; Pang, T.; Zhu, S. K.; Kong, H. W.; Xu, G. W. J. Pharm. Biomed. Anal. 2007, 43, 1721–1727.

Figure 1. A typical GC × GC-TOFMS chromatogram of m/z 73 of E.coli1.1566 from the in-house software GC × GC workstation.

that the three gene types could not be distinguished very well on the basis of the results of PCA. Thus, PCA did not have enough ability to model the data for investigating the metabolic differences in the three gene types. Because genotype information of the samples was known, it was convenient to apply PLS-DA. PLS-DA modeled the data similarly to PLS but with a different setting of the Y matrix. The group assignment of each sample was coded by three columns in the PLS-DA Y matrix, namely, (1, 0, 0), (0, 1, 0), and (0, 0, 1), which represent gene types DC101, DC100, and Wild, respectively. The result of PLS-DA was much better than that of PCA. A three-component model was finally selected on the basis of the R2Y and Q2 values: cumulative R2Y and cumulative Q2 values should be large, while Q2 of each new component should be larger than a significance limit.21 R2Y and Q2 were 0.921 and 0.801, respectively, indicating good correlation and crossvalidated predictive capability. The score plot of PLS-DA is shown in Figure 2B. Three gene types of E. coli could be differentiated in the score plot, with the wild type farther away from the other two. Relative distances between the groups in the score plot indicated the relative degree of metabolic differences. This implied that the wild type was metabolically different from the other two types, also showing a smaller within-group variance. One of the samples, DC101-2-3, was far away from the DC101 cluster in the score plot while being close to three samples of DC100-1. DC100-1 was also distant from the DC100 cluster while showing little variation between replicates. It was not possible to determine whether only DC101-2-3 or both DC100-1 and DC1012-3 were outliers. Overfitting Test. Although PLS-DA could model the data and classify the samples quite well, the possibility of overfitting could not be neglected. Former research revealed that overfitting could distort real group relationships and impair the reliability of multivariable analysis.9 Therefore, it was important to validate the model and avoid overfitting. Cross-validation22 is widely used in model validation, e.g., to determine the number of components. (21) Eriksson, L.; Johansson, E.; Kettaneh-Wold, N.; Trygg, J.; Wikstro¨m, C.; Wold, S. Multi- and megavariate data analysissprinciples and applications; Umetrics AB: Umea˚, 2006. (22) Wakeling, I. N.; Morris, E. J. J. Chemom. 1993, 7, 291–304.

However, using Q2 solely is not proper since the influence of background correlation23 should be considered. Background correlation means chance correlation between variable matrix X and response Y vectors. The permutation test21,23 and randomization test24,25 can be used for model validation and PLS component selection. Both methods set up a series of models based on randomly scrambled Y vectors. Thus, we used only the permutation test to monitor overfitting. Each of the three PLSDA Y vectors has one permutation plot as shown in Figure 2C. The intercepts of the plots show the background correlation. The R2 intercept (R2-int) and Q2 intercept (Q2-int) reflect the average R2 and Q2 values of PLS-DA/PLS, respectively, when the random Y vectors (having no relationship with the real Y) are used as predictive vectors. The R2-int values of the scrambled Y vectors of DC101, DC100, and Wild were 0.566, 0.545, and 0.515, respectively; the Q2-int values of the scrambled Y vectors of DC101, DC100, and Wild were -0.306, -0.271, and -0.333, respectively. Former research suggested that a near-zero slope and a high intercept of the regression line in the permutation plot indicate the possibility of overfitting and extra needed principal components.23 It had also been recommended that the R2 intercept should be less than 0.3-0.4, and the Q2 intercept less than 0.05.21 With the above criteria, the Q2 intercepts of our model were all less than 0.05 while the R2 intercepts were all larger than 0.4, indicating evident overfitting.21 Despite the model’s large R2Y value and good cross-validation results (Q2 value), the model was not absolutely reliable. The clustering trend generated by PLS-DA might be partially caused by the setting of Y vectors, which enlarged the “differences” and noises of the data to adapt the assumptive classification given in the Y vectors. Parameter Optimization of the FCM. FCM was then used to model the data and classify the samples on the basis of the metabolic profiles. After the range scaling, the metabolite matrix (23) Lindgren, F.; Hansen, B.; Karcher, W.; Sjo ¨stro ¨m, M.; Eriksson, L. J. Chemom. 1996, 10, 521–532. (24) Wiklund, S.; Nilsson, D.; Eriksson, L.; Sjo¨stro ¨m, M.; Wold, S.; Faber, K. J. Chemom. 2007, 21, 427–439. (25) Faber, N. M.; Rajko´, R. Anal. Chim. Acta 2007, 595, 98–106.

Analytical Chemistry, Vol. 81, No. 11, June 1, 2009

4471

Figure 3. Optimizations of the fuzziness coefficient (m) and cluster number (C) by PLS model parameters R2Y and Q2. (A) Fixed C to 3 and changed m from 1.1 to 1.7. The number of components in all of the PLS models is 3. (B) Fixed m to 1.4 and changed C from 2 to 6. The number of components in the model is as follows: 3 when C ) 2 and 3, 4 when C ) 4, 4 when C ) 5, and 6 when C ) 6.

Figure 2. (A) PCA score plot of modeling the data with three principal components. PLS-DA results of modeling the data: (B) score plot of PLS-DA, (C) permutation test plot of PLS-DA on the first Y vector.

was sent to the FCM program as mentioned in the section “Theory”. The membership matrix U calculated by FCM was used as the predictive matrix in PLS. Model parameters of R2Y and Q2 were used to choose the number of PLS components as described above. Since the permutation test is not applicable for FCM (see ref 23), only cross-validation was used to test for overfitting. 4472

Analytical Chemistry, Vol. 81, No. 11, June 1, 2009

Different settings of C and m values in FCM were tested to optimize the model on the basis of the PLS quality parameters. As the samples could be classified into three groups according to their gene types, first the cluster number was fixed to 3 and different m values were tested. The number of components in all of the PLS models is 3. From the plot of the R2Y and Q2 values (Figure 3A), it was clearly visible that the R2Y and Q2 values increased rapidly with increasing m value until m was about 1.5. After that the change was only minimal. The increased explanative ability and predictive capability of the model showed that FCM could fit the data better with a higher degree of fuzziness. When the membership values uij with m g 1.5 were investigated, the uij values of two of the clusters were identical, meaning that they completely overlap, which indicates that the process is out of control. This was also the reason why R2Y and Q2 changed little after m g 1.5. To avoid out-ofcontrol situations, the m value was chosen as 1.4. Then we fixed the m value to optimize the cluster number C. Five different cluster numbers were investigated; see Figure 3B. Q2 decreased rapidly when the cluster number increased, and R2Y did not evidently change. This indicates that although the explanative ability of FCM for the raw data was not influenced greatly with the redundant cluster center, the predictive capability declined prominently. It was also observed that, with evidently redundant clusters (C ) 5, 6), the membership values of one sample were near 0 in most clusters; thus, the clustering was near hard. Besides above, the number of components in the model increased from 3 (when C ) 2 and 3) to 6 (when C ) 6). The dimensionality of the model increased when C became larger, implying the possibility of overfitting increased.9 This indicates that when an unreasonable cluster number is

Figure 4. Results with a fuzziness coefficient of 1.4 and a cluster number of 2: (A) scatter plot of membership values in class 1, (B) scatter plot of membership values in class 2. The samples were arranged as follows: 1-9 for DC101-1-1 to DC101-3-3, 10-18 for DC100-1-1 to DC100-3-3, and 19-27 for Wild-1-1 to Wild-3-3. To compare the classification results of FCM with real gene types, three gene types were partitioned by dashed lines.

used, the algorithm could still converge to fit the data itself, but provided poor predictive performance and became overfitting, giving an indicator for selecting the number of clusters. Furthermore, when m was optimized, the number of components in the model did not change, but it changed greatly when C was optimized. Q2 changed from 0.952 for C ) 2 to 0.713 for C ) 6 (Figure 3B), and it changed from 0.843 for m ) 1.1 to 0.954 for m ) 1.7 (Figure 3A). Considering the ranges of changes above, FCM seemed more sensitive to the setting of the cluster number C of the data in metabolomics. FCM Classification. A fuzziness coefficient of 1.4 and a cluster number of 2 gave the best result of the PLS model. The membership values of two classes are shown as scatter plots in Figure 4. A membership value close to 1 means that it belongs to the corresponding class. Therefore, the gene type Wild belonged to class 1, and gene types DC101 and DC100 belonged to class 2. DC101 and DC100 could not be distinguished within these two classes. Similar to the scatter plots of the membership values, partial overlap of DC101 and DC100 occurred in the score plot of PCA and they could be evidently distinguished from Wild (Figure 2A). This implies that phenotype changes between wild and gene knockout types were the main differential factor in metabolic profiles of the three E. coli types. It was suggested that the metabolites in the two types of gene knockout E. coli have some overall analogical changes when they are compared with the wild

Figure 5. Results with a fuzziness coefficient of 1.4 and a cluster number of 3: (A) scatter plot of membership values in class 1, (B) scatter plot of membership values in class 2, (C) scatter plot of membership values in class 3. The samples were arranged as in Figure 4. To compare the classification results of FCM with real gene types, three gene types were partitioned by dashed lines.

type. Furthermore, the similar clustering trends in PCA and FCM suggested different unsupervised methods may reach the same conclusion. Although the optimal cluster number of FCM was different from the sample information (i.e., gene types in our research), it was interesting to investigate the hypo-optimal result, whose cluster number was 3 and fuzziness coefficient was 1.4. In the scatter plots of membership values (Figure 5), the samples from Wild had membership values of near 1 in class 1 and near 0 in class 2 and class 3; thus, the gene type Wild belonged to class 1. DC101 and DC100 mainly belonged to class 3 and class 2, respectively (membership values were more than 0.5). It should be noticed that membership values of samples from DC100 Analytical Chemistry, Vol. 81, No. 11, June 1, 2009

4473

Table 1. Significantly Changed Metabolites Identified by PLS with the Membership Matrix as Y metabolite

Rt,1/s

Rt,2/s

VIP

[DC101]a

[DC100]

[Wild]

stearic acid, 1-monoglyceride*# lactic acid*# unknownb *# unknown*# unknown*# unknown*# unknown*# C6 sugard *# unknown*# uridine derivative*# talonic acid, lactone*# mannonic acid, lactone*# unknown*# myristic acid*# unknown# D-arabinonic acid, lactone*# unknown*# galacturonic acid*# tryptamine*# galactose derivative#

1976 512 1112 1204 1456 768 784 1012 1308 1716 1760 1832 1192 1324 1720 1304 1832 2108 1312 1512

1.48 1.02 1.42 1.26 1.20 1.14 1.04 2.94 3.1 1.54 1.06 1.70 1.72 1.02 0.98 1.12 2.14 1.44 0.92 1.26

3.41 3.24 3.13 3.12 2.90 2.89 2.89 2.88 2.84 2.83 2.82 2.82 2.80 2.79 2.76 2.76 2.76 2.68 2.61 2.57

-c 0.81 ± 0.00 67.67 ± 25.61 7.35 ± 1.46 5.08 ± 0.00 2.58 ± 0.05 1.41 ± 0.91 2.11 ± 0.00 8.77 ± 5.89 1.16 ± 0.35

3.28 ± 0.00 4.40 ± 0.00 1.43 ± 0.00 0.73 ± 0.15 16.52 ± 3.12 1.72 ± 0.40 0.74 ± 0.13 2.98 ± 0.89 -

5.03 ± 0.25 0.61 ± 0.05 10.00 ± 1.97 1.22 ± 0.20 2.97 ± 0.55 103.4 ± 5.68 18.00 ± 1.68 4.39 ± 0.44 1.40 ± 0.23 1.90 ± 0.24 1.06 ± 0.08 28.21 ± 2.55 1.44 ± 0.10 3.56 ± 0.35 2.17 ± 0.24 5.04 ± 0.17 5.57 ± 1.48 2.47 ± 0.43 17.00 ± 2.40 1.31 ± 0.02

a The relative peak volumes were normalized to the TIC of each chromatogram in the corresponding gene type. All of the relative peak volumes shown in the table were multiplied by 103. Statistical significance: (*) p < 0.05 for DC101 and wild types, (#) p < 0.05 for DC100 and wild types. b When the similarity in NIST MS search 2.0 of the peak was less than 750, the corresponding metabolite was considered as unknown. c Not detected, meaning the signal was lower than the limit of detection. d Metabolite identified as a C6 sugar, while the similarity was not more than 750.

were not more than 0.8 in class 2, and they were more than 0.1 in class 3. DC101 also had a situation similar to that of DC100. This revealed that DC101 and DC100 did not belong to class 2 or class 3 clearly and completely, and the boundary of the two classes was fuzzy. DC101 belonged to class 2 partly, and DC100 also belonged to class 3 partly. They were different from the wild type, which almost absolutely belonged to class 1. The fuzzy boundary of clustering in DC101 and DC100 revealed that these two types had some similarity in metabolites, and both of them were very different from the wild type. This deduction was also consistent with the optimal result of FCM above. It was clear that DC101-2-3 was different from other samples of the DC101 type. It had low membership values (less than 0.5) in both class 2 and class 3. Therefore, it could not be clustered according to the membership values and should be considered as an outlier. Furthermore, culture DC100-1 was different from DC100-2 and DC100-3, showing culture-to-culture variances. The membership values of the samples in DC100-1 were all larger than 0.5 in class 2 and smaller than 0.4 in class 3, similar to the other DC100 cultures; thus, they should not be considered as outliers. Compared with PLS-DA (Figure 2B), in which it was not possible to determine whether DC100-1 was an outlier, FCM was more suitable and sensitive for outlier estimation with the data herein. Not considering the outlier, the results of hypo-optimal FCM were almost the same as the real gene types; i.e., three classes of FCM clustering represented three gene types, respectively. In metabolomics research, supervised methods usually set a dummy matrix as the Y matrix, e.g., in PLS-DA or PLS. The dummy matrix is set on the basis of sample information or group information, possibly incorporating an artificial partitioning. Considering that the values of Y vectors within one group are identical, using group information may distort the real relationships in the samples, namely, enlarge the differences between “groups” (which are estimated by the analyst) and reduce the differences within 4474

Analytical Chemistry, Vol. 81, No. 11, June 1, 2009

“groups”. As an unsupervised method, FCM avoids this bias, revealing the real relationships between samples without artificial judgments. With the concept of fuzziness, ambiguous grouping can be tolerated. FCM could also be optimized on the basis of straightforward parameters. As the results above showed, optimal FCM clustering revealed the dominant differences among three gene types: the phenotype changes between wildness and gene knockout types. In addition, using the hypo-optimal FCM clustering results, the metabolite changes between gene knockout and the wild type could also be identified without bias by hard cluster assignment. Application of FCM in Discovering Significantly Changed Metabolites. Compared with the wild type of E. coli, many metabolites in the gene knockout types have been changed. Thus, it is necessary to discover the most significantly changed metabolites for investigating the influences of gene knockout on the metabolic pathways. The significantly changed metabolites could be identified on the basis of the clustering results of FCM combined with PLS. The variables whose VIP (variable importance in the projection)21 values of the PLS model were in the top 20 and within 95% confidence in the VIP plot were selected as candidates for significantly changed metabolites. After exclusion of variables whose standard deviations were larger than the mean values, the remainder were considered as the most significant and then searched by NIST MS search 2.0. The results are listed in Table 1. The 10 most significantly changed metabolites were positively identified. The results of a paired t test are also shown in Table 1. Most of the metabolites discovered by FCM were significantly different between DC101/DC100 and Wild on a 0.05 significance level. These altered metabolites could reflect the most important variances of metabolic pathways in gene knockout types. Detailed investigations and explanation of these variances will be implemented in further research. Unsupervised methods, such as PCA, sometimes have a low ability of data explanation, as in our experiment. Using supervised

methods (e.g., PLS or PLS-DA) based on group information, artificial partitioning of groups or ambiguous grouping might lead to false modeling or overfitting, either of which will distort data analysis. The method we provided here combines unsupervised (FCM) and supervised (PLS) methods. The important parameters of FCM can be optimized by a straightforward approach, and the results of FCM can also be visualized conveniently. As an unsupervised method, overfitting is limited in FCM. Followed by PLS, the method can also give information of significantly changed metabolites which cannot be gained by FCM itself. The obtained significantly changed metabolites reflect the changes of metabolites by using more accurate classification of the FCM, so they will be more objective and informative. It can be expected FCM can be used in metabolomics, especially when hard clustering fails. In the meantime, FCM will play a very important role in the metabolomics data analysis of drug administration or time trajectory issues involving nonlinear measurements and relationships. CONCLUSIONS A method combining FCM with PLS is developed to deal with metabolomics data objectively and effectively. FCM is demonstrated to be a powerful tool to reveal details of variances within

and between groups, without the problem of artificial estimation and false classification. Detailed information of metabolic changes can be found through significantly changed metabolites discovered on the basis of fuzzy clustering. The flexibility of FCM, embodied both in optimization and in fuzziness, makes the method superior in dealing with multiclass and ambiguous grouping problems. Further applications of the method will be attempted, for example, in disease typing, pharmaceutical metabolism, time trajectory problems, etc. ACKNOWLEDGMENT These studies have been supported by the National Basic Research Program (Grants 2004CB719605 and 2007CB707802) and “863” project (Grants 2006AA02Z240 and 2006AA02Z342) of the State Ministry of Science and Technology of China and Grants 20775078 and 20776029 from the National Natural Science Foundation of China.

Received for review February 15, 2009. Accepted April 7, 2009. AC900353T

Analytical Chemistry, Vol. 81, No. 11, June 1, 2009

4475