Anal. Chem. 2008, 80, 7562–7570
Analysis of Metabolomic Data Using Support Vector Machines Sankar Mahadevan,† Sirish L. Shah,† Thomas J. Marrie,‡ and Carolyn M. Slupsky*,‡ Department of Chemical and Materials Engineering and Department of Medicine, University of Alberta, Edmonton, Canada Metabolomics is an emerging field providing insight into physiological processes. It is an effective tool to investigate disease diagnosis or conduct toxicological studies by observing changes in metabolite concentrations in various biofluids. Multivariate statistical analysis is generally employed with nuclear magnetic resonance (NMR) or mass spectrometry (MS) data to determine differences between groups (for instance diseased vs healthy). Characteristic predictive models may be built based on a set of training data, and these models are subsequently used to predict whether new test data falls under a specific class. In this study, metabolomic data is obtained by doing a 1H NMR spectroscopy on urine samples obtained from healthy subjects (male and female) and patients suffering from Streptococcus pneumoniae. We compare the performance of traditional PLS-DA multivariate analysis to support vector machines (SVMs), a technique widely used in genome studies on two case studies: (1) a case where nearly complete distinction may be seen (healthy versus pneumonia) and (2) a case where distinction is more ambiguous (male versus female). We show that SVMs are superior to PLS-DA in both cases in terms of predictive accuracy with the least number of features. With fewer number of features, SVMs are able to give better predictive model when compared to that of PLS-DA. Metabolomics can be defined as the field of science that deals with the measurement of metabolites in an organism for the study of the physiological processes and their reactions to various stimuli such as infection, disease, or drug use.1 The field of metabolomics has shown great promise for early diagnosis of diseases or preclinical screening of candidate drugs in the pharmaceutical industry.2 In order to carry out such studies, analytical processes such as NMR spectroscopy and mass spectrometry are combined with statistical techniques, such as multivariate analysis and machine learning tools.1 In general, metabolomic NMR data consists of observations (study subjects) with associated variables (bins, metabolites, etc.). The main task in the analysis of this type of data is to extract meaningful information and hopefully facilitate an understanding * To whom correspondence should be addressed. E-mail: cslupsky@ ualberta.ca. † Department of Chemical and Materials Engineering. ‡ Department of Medicine. (1) Nicholson, J.; Lindon, J.; Holmes, E. Xenobiotica 1999, 29, 1181–1189. (2) Lindon, J.; et al. Toxicol. Appl. Pharmacol. 2003, 187, 137–146.
7562
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
of the complex biological processes. In general, for metabolomic and other such chemometric data, the distinguishing factors between any two classes (for instance diseased versus healthy) are a combination of variables or features rather than a single variable. Thus, multivariate analysis, using techniques such as principal components analysis (PCA),3,4 clustering analysis, or PLS-DA (partial least-squares-discriminant analysis),5 is necessary to discriminate between different classes of observations. In general, there are two types of multivariate analysis techniques: unsupervised, such as PCA, and supervised, such as PLS-DA.6 Supervised multivariate analysis uses class information to build predictive models. Machine learning algorithms are a more recent class of multivariate analysis technique. These algorithms can be trained to learn rules and form patterns from the input data and subsequently be applied to analyze new data. Machine learning tools often provide as good or better classification than PCA or PLS-DA. In particular, Bayesian classifiers, hidden Markov models, and evolutionary algorithms such as genetic algorithm, simulated annealing, and particle swarm optimization7 have also been used to efficiently and effectively interpret biological data such as microarray databases, protein structures, and networks. In this work, the applicability of one such machine learning technique, namely support vector machines (SVMs), is explored to analyze and classify metabolomic data. SVMs are known to have excellent generalization abilities when compared to other statistical multivariate methods, such as PCA or PLS-DA. Unlike PCA or PLS-DA, SVMs can be extended to nonlinear cases with the help of kernels, whereas PCA and PLSDA have an assumption of linearity. For nearly a decade, SVMs have been used in the field of bioinformatics for classifying and evaluating gene expression microarray data8-11 and identifying (3) Holmes, E.; Antti, H. Analyst 2002, 127, 1549–1557. (4) Holmes, E.; Nicholson, J.; Nicholls, A.; Lindon, J.; Connor, S.; Polley, S.; Connelly, J. Chemom. Intell. Lab. Syst. 1998, 44, 245–255. (5) Keun, H.; Ebbels, T.; Antti, H.; Bollard, M.; Beckonert, O.; Holmes, E. Anal. Chim. Acta 2003, 490, 265–276. (6) Eriksson, L.; Johansson, E.; Wold, H.; Wold, S. Introduction to Multi and Megavariate Analysis using Projection Methods (PCA and PLS); Umetrics Academy: Sweden, 1999. (7) Baldi, P.; Brunak, S. Bioinformatics: The Machine Learning Approach, Second Edition (Adaptive Computation and Machine Learning); The MIT Press: Cambridge, MA, 2001. (8) Furey, T.; Cristianini, N.; Duffy, N.; Bednarski, D.; Schummer, M.; Haussler, D. Bioinformatics 2000, 10, 906–914. (9) Mukherjee, S.; Tamayo, P.; Mesirov, J.; Slonim, D.; Verri, A.; Poggio, T. Support vector machine classification of microarray data, 1999. (10) Brown, M.; Grundy, W.; Lin, D.; Cristianini, N.; Sugnet, C.; Furey, T.; Ares, J.; Haussler, D. Proc. Natl. Acad. Sci. U. S. A. 2000, 97, 262–267. (11) Guyon, I.; Weston, S. B.; Vapnik, V. Machine Learning 2002, 46, 389–422. 10.1021/ac800954c CCC: $40.75 2008 American Chemical Society Published on Web 09/04/2008
protein homologies.12 SVMs are a relatively new machine learning supervised technique for classification of data. The foundation for SVMs was laid in the year 1982 by Vapnik13 and formally proposed in 1992.14 SVMs have seen great success in the task of classifying handwritten digit recognition of US postal codes,15 as they are quite robust when it comes to handling noisy data, and are generally not susceptible to the presence of outliers. The basic principle of SVMs, which are essentially a binary support vector classifier, is as follows: given a set of data with two classes, an optimal linear classifier is constructed in the form of a hyperplane which has the maximum margin (the simultaneous minimization of the empirical classification error and maximization of the geometric margin). The margin of a classifier is defined as the width up to which the boundary can be extended on both sides before it hits one of the data points. The points onto which this margin hits are called the “support vectors”. In the case of data sets that are not linearly separable, the original data is mapped into higher dimensional feature space and a linear classifier is constructed in this feature space (this is known as the “kernel trick”) which is equivalent to constructing a nonlinear classifier in the original input space. This mapping is implicitly given by the kernel function. Consider a training data set xi ∈ Rn, i ) 1,..., m where each of the xi fall in one of the two categories yi ∈{-1, 1}. For this case, SVMs determine the hyperplane, the parameters of which are given by (w,b) as obtained from the solution of14 the following convex optimization problem: m
∑
1 min wtw + c i w,b, 2 i)1
(1)
subject to yi(wtxi + b) g 1 - i i g 0 Here c is the regularization parameter, which is a trade off between the training accuracy and the prediction term, and is a measure of the number of misclassifications and known as the slack variable. The inclusion of the regularization term reduces the problem of overfitting which is the fundamental principle of soft margin classifiers.16 The optimization problem in eq 1 can be solved. The expression for the maximum margin classifier involves the term 〈xi, xj〉 (inner product). Let φ be the term that maps the data to the feature space. The expression for the maximum margin classifier will involve 〈φi, φj〉. Next, we define a kernel function K such that K(xi,xj) ) φ(xi) · φ(xj). This kernel function is required in the training phase to solve the optimization problem, thus getting rid of the need for explicit knowledge of the mapping function φ. The (12) Jaakkola, T.; Diekhans, M.; Haussler, D. Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology; AAAI Press: Heidelberg, Germany, 1999. (13) Vapnik, V. Estimation of Dependences Based on Empirical Data; Springer Verlag: New York, 1982. (14) Boser, B.; Guyon, I.; Vapnik, V. Proc. 5th Annu. ACM Workshop Comput. Learn. Theory 1992, 144–152. (15) Bottou, L.; Cortes, C.; Denker, J.; Drucker, H.; Guyon, I.; Jackel, L.; LeCun, Y.; Muller, U.; Sackinger, E.; Simard, P.; Vapnik, V. Proc. 12th IAPR Int. Conf. Pattern Recognition 1994, 2, 77–82. (16) Cortes, C.; Vapnik, V. Machine Learn. 1995, 20, 273–297.
commonly used kernel functions are as follows: (1) linear kernel, xitxj; (2) polynomial kernel, (γxitxj + constant)d, γ > 0, d is the degree of the polynomial; (3) radial basis function kernel, 2 e-γ||xi-xj|| ; (4) sigmoidal kernel, tanh(γxitxj + constant). The appropriate parameters for different kernels are chosen by performing a grid search. For a radial basis function (rbf) kernel, the values of γ and the regularization parameter c are varied and for a polynomial kernel γ, c, and d are varied, while for the linear kernel only the value of c is varied. The usage of a sigmoidal kernel is generally avoided as it does not obey Mercer’s theorem (any kernel function should satisfy this theorem) for all values of γ. A detailed account on SVMs may be found in the literature.17 SVMs were originally developed for handling only binary class problems. However, most of the practical real world problems are multiclass, and solving these are much tougher than solving binary class problems. Several algorithms have been proposed for solving multiclass problems.18-21 Once classification is achieved, it is important to identify the discriminating features that cause the categorization to enable an in-depth understanding of the disease mechanisms or the pathways involved in the case of metabolomic data. This can be achieved through feature selection, which involves identifying the optimum subset of the variables in the data set that gives the best separation or classification accuracy. Feature selection algorithms in machine learning can be broadly classified under two categories: (1) the filter approach and (2) the wrapper approach. The former is independent of the actual classifier algorithm and is mainly done on the basis of a ranking system. Univariate correlation scores such as Fisher scores fall under this category. In the wrapper method, feature selection is done in conjunction with the training phase. A subset of the variables is chosen and the performance of the classifier is evaluated on this subset. The subset of variables that gives the best classifier performance is chosen for final analysis. Detailed reviews on feature selection methods may be found in the literature.22,23 SVM recursive feature elimination (SVM-RFE) is a wrapper approach that uses the norm of the weights w to rank the variables. Initially, all data is taken and a classifier is computed. The norm of w is then computed for each of the features and the feature with the smallest norm is eliminated. This process is repeated until all the features are ranked. A more elaborate version of this algorithm is described by Guyon et al.11 Variable importance to projection (VIP) is a measure of the importance of the terms in the model (i.e., variables in the data), with respect to its correlation to the responses (y). VIP metric values are usually computed from all of the extracted components. VIP is computed as follows: (17) Burges, C. J. Data Min. Knowledge Discovery 1998, 2, 121–167. (18) Krebel, U. Pairwise Classification and Support Vector Machines. In Advances in Kernel Methods-Support Vector Learning; Scholkopf, B., Burges, C., Smola, A., Eds.; MIT Press: Cambridge, MA, 1999. (19) Lee, Y.; Lin, Y.; Wahba. J. Am. Stat. Assoc. 2004, 99, 67–81. (20) Crammer, K.; Singer, Y. J. Machine Learn. Res. 2001, 2, 265–292. (21) Weston, J.; Watkins, C. Multi-class support vector machines, 1998. (22) Kohavi, R.; John, G. Artif. Intell. 1997, 97, 273–324. (23) Chen, Y.; Lin, C. Combining SVMs with various feature selection strategies. In Feature Extraction, Foundations and Applications; Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L., Eds.; Physica-Verlag, Springer: NTU, Taiwan, 2006.
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
7563
VIPkp )
k
∑ (w
2 ip (SSYi-1 - SSYi)
i)1
p 0 - SSYk
) SSY
where k is the number of extracted components, p is the dimension of the system, SSYi is the cumulative sum of squares explained by the ith component, and wip is the squared PLS weight of that term. In this work, we will show how SVMs can be used with metabolomic data (using both spectral binning and targeted profiling), and how it performs better than the techniques that are being currently used (PCA and PLS-DA). We will also compare the performance of a wrapper approach (SVM-recursive feature elimination) with the VIP (variable importance to projection) approach of PLS-DA.6 METHODS Populations: Written informed consent was obtained from each subject before entering this study, and the institutional ethics committees approved the protocols outlined below. Healthy volunteers: For comparison of gender, 30 male and thirty female subjects aged 19-69 years self-identified as healthy volunteered to participate in this study as previously described.24 Multiple urine samples were taken either as a first void or at 5 p.m., over four nonconsecutive days from each subject to obtain a total of 352 samples, of which 194 were female and 158 were male. It should be noted here that the genders were matched in terms of age and other metadata parameters such as diet and alcohol consumption. Details of age matching statistical analysis are included in the Supporting Information file. Hence, these metadata parameters do not have any impact on the classification between the two groups. For comparison between healthy volunteers and the pneumonia group, one urine sample obtained from each of 59 subjects was used. Pneumonia volunteers: A total of 59 patients infected with Streptococcus pneumoniae as determined through cultures of blood, sputum, cerebrospinal fluid, bronchoalveolar lavage, endotracheal tube secretions, ascites, or a combination of any of these, constituted the pneumococcal group. Information regarding the metadata parameters for the pneumonia volunteers was not available. However, it is likely that most of the patients were on analgesics and antibiotics. Sample handling: Upon acquisition of urine samples, sodium azide was added to a final concentration of approximately 0.02% to prevent bacterial growth. Urine was placed in the freezer and stored at -80 °C until ready for preparation and data acquisition. Urine samples were prepared for NMR analysis as described previously.24 NMR Spectral Data. NMR spectra were acquired using standard techniques on a Varian 600 MHz spectrometer.24 Standard preprocessing steps such as phasing and baseline correction were applied to each spectrum. Standard binning was performed with a bin width of 0.025 ppm between 0.2 and 10.0 ppm, with the region between 4.65 and 4.86, 5.5-6.0, and 7.33-7.39/8.34-8.44 ppm excluded to account for water, urea, and imidazole, respectively, and integrals were measured. Each (24) Slupsky, C.; Rankin, K.; Wagner, J.; Fu, H.; Chang, D.; Weljie, A.; Saude, E.; Lix, B.; Adamko, D.; Shah, S.; Greiner, R.; Sykes, B.; Marrie, T. Anal. Chem. 2007, 79, 6995–7004.
7564
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
integral region was then normalized to the total area. Standard binning reduced the dimension of the system to 360 bins (variables). Concentration Data. Concentration data was obtained by quantification of known urinary components (metabolites) in the NMR spectra by fitting the NMR peaks using Chenomx NMRSuite 4.0 (Chenomx Inc., Edmonton Canada). Peak areas reflect the concentration of each particular compound. Calibration is done a priori, for all compounds in the database with respect to a known concentration of the reference compound (DSS). Unlike spectral data, this type of data is of a lower dimension, since the number of metabolites that are quantified is of the order of 100. The precision and accuracy of the metabolite concentrations thus determined as well as the repeatability of sample preparation has already been demonstrated in an earlier work.24 The main advantage of doing the metabolomic analysis on the concentration data is that we are able to preselect the metabolites that are important and remove unwanted metabolites (metabolites related to analgesics and antibiotics were not included in the controls pneumonia case study). However, at the same time it should be noted that this work does not focus on giving a diagnostic for detecting a particular phenotype. The main focus of this work is to introduce support vector machines as an efficient tool for doing metabolomic analysis. An inherent assumption of both analyses was that none of the samples were outliers. Data Analysis. Multivariate data analysis (PLS-DA, and SVMs) were performed on binned NMR spectral data that had been mean centered and normalized with respect to the total area using MATLAB. The Matlab codes that were used to generate the results are provided in the Supporting Information as a downloadable zip archive. Targeted profiling data was obtained after measuring the concentration of metabolites in the samples. Mean centering and unit variance scaling was applied for PLS-DA and SVM analysis. PLS-toolbox 4.1 of Eigenvector Research was used for PLSDA. PLS-DA can be considered to be the regression extension of PCA. Details of this algorithm have been described elsewhere.6 Computation of Classification Accuracy. Classification accuracy was computed based on k-fold cross validation by dividing each data set comparison into k random subsets. For each computation, one of the k subsets was kept aside as the test set and the remaining k - 1 subsets were pooled together to form the training set. Each of the k subsets was therefore treated once as the test set, which generated k accuracy values. Classification accuracy was calculated as the mean of k accuracy rates. In order to reduce the variance of the accuracy rate estimate, the process was repeated many times and the mean value was taken. Two types of accuracy rates were reported: (1) A 4-fold cross validation, calculated by randomly dividing the data set into four subsets and calculating the mean of the four accuracy values. This process was repeated 100 times and the mean accuracy was computed. The estimate calculated in general has a smaller variance than leave-one-out cross validation (LOOCV) but is pessimistically biased. (2) Leave-one-out cross validation (LOOCV). This is the same as n-fold cross validation, where n is the total number of samples. This method produces a good estimate of the true accuracy rate
with less bias. When comparing classification accuracy rates obtained from two different methods, the LOOCV method is preferred since it does not involve formation of random subsets as the k-fold cross-validation approach. Thus, a definite measure of the accuracy estimate is obtained. Kernel parameters were chosen using a 3-fold cross-validation approach by dividing the training data set randomly into three subsets. The classifier was trained on either of the two subsets and was tested on the third subset. The set of parameters which gave the best cross-validation accuracy were chosen for further analysis using LIBSVM interface in MATLAB.25 The kernel parameters which corresponds to the maximum cross-validation accuracy is chosen to perform the actual classification. The same statistics for mean centering and scaling to unit variance was used for both the training and test data. In other words, the mean and the standard deviation of the training data is used to mean-center and scale the test data, since it is assumed that the test data set is fully blinded and thus their statistics is unknown. Feature Selection. Feature selection using SVM-RFE was done to identify the significant biomarkers causing the classification. For selection of the k best features, feature selection was done on 100 randomly partitioned sets of training and test data, selecting k features each time generating 100 subsets, each containing k features. The frequency of all original features appearing in each of the 100 subsets was calculated and the top k features with the maximum frequency was selected. Further details regarding the kernel and the parameters used for SVMRFE are included in the Supporting Information file. The SVMRFE algorithm was implemented using the Spider feature selection objects.29 Test of Model Overfit. Once the model was trained, it was used to test whether the data was overfit. One way to do this is to have a validation set with known class labels and check whether it gives a comparable accuracy rate to that of the training data. Another method is an R2/Q2 validation plot that helps assess the risk that the current model is spurious, i.e., the model just fits the training set well but does not predict Y well for new observations. The R2 value is the percent variation of the training set that can be explained by the model. The Q2 value is a crossvalidated measure of R2. The formula for computing the R2/Q2 values are as follows: n
∑ (y - yˆ)
2
2
R )1-
i)1 n
∑ (y - yˆ)
2
i)1
n
∑ (y - yˆ
)2
∑ (y - yˆ
)2
(i)
2
Q )1-
i)1 n
(i)
i)1
In the above equation, n represents total number of samples, y is actual class, y is predicted class when all the samples are used for model building, y(i) is predicted class when all the samples
except the ith sample are used for model building. This validation compares the goodness of fit of the original model with the goodness of fit of several models based on data where the order of the Y-observations have been randomly permuted, while the X-matrix has been kept intact. The criteria for model validity are as follows: (1) All the Q2 values on the permuted data set are lower than the Q2 value on the actual data set. If this is not the case, it means that the model is capable of fitting well any kind of data set which is overfitting. (2) The regression line (line joining the actual Q2 point to the centroid of the cluster of permuted Q2 values) has a negative value of intercept on the y-axis. RESULTS To demonstrate the usefulness of SVMs in classifying both binning and targeted profiling metabolomic data, two case studies are presented here: (1) a case where nearly complete distinction may be seen (healthy versus pneumonia); and (2) a case where distinction is not as easily seen (male versus female). To determine how well SVM and PLS-DA perform to discriminate the data, 59 healthy subjects were compared to 59 subjects with pneumococcal pneumonia with a class vector Y of length 118 with -1 and 1 representing healthy and pneumonia, respectively. The variables consisted of either 360 bins, or 82 metabolites derived from the urinary NMR spectral profiles. Figure 1 shows a visual comparison of the controls-pneumonia spectral and concentration data separated using PLS-DA and SVMs. Traditionally, most classification methods for biological data involve visualization of the observations, and where they lie with respect to one another. This is because the majority of methods involve projection of the data set to a lower dimensional space (usually 2D or 3D) as in PCA or PLS. However, in SVM, separation of the data set involves mapping to a higher dimensional feature space. Hence visualization of the resulting data is difficult. This problem may be overcome by plotting the decision function of the sample, where the sign of the decision function determines the class of the test sample. Here, we plotted the decision function against a dummy variable. In this case, a random Gaussian variable was chosen as the dummy variable. Figure 1a is a plot of the separation achieved between control and pneumonia data based on spectral binning using SVM, and Figure 1b is a plot of the separation achieved with PLS-DA. Table 1 shows the classification accuracy rates for both methods using 4-fold cross validation as well as LOOCV. While the visual separation appears slightly better for the SVM, the classification accuracy rates are similar, with SVMs being marginally better. Classification using targeted profiling data produces slightly different results. Figure 1c shows a visual comparison of the raw targeted profiling data using SVMs, and Figure 1d shows the visual comparison using PLS-DA. Table 1 shows the classification accuracy rates for both methods. It is clear that visual separation on the raw data by PLS-DA is not as good as the separation using SVMs. Moreover, the classification accuracy is poor compared to SVMs for both 4-fold cross validation and LOOCV. To determine if SVMs perform better in a case where separation is not as apparent, we analyzed the urine profiles of (25) Chang, C.; Lin, C. Software available at http://www.csie.ntu.edu.tw/cjlin/ libsvm, 2001.
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
7565
Figure 1. Comparison of PLS-DA and SVM multivariate analysis techniques on controls-pneumonia data. Control data (solid circles, n ) 59); pneumococcal data (hollow circles, n ) 59). (a) SVM spectral data; (b) PLS-DA spectral data; (c) SVM concentration data; (d) PLS-DA concentration data. Table 1. Case Studies: Classification Accuracy Rates case study case study 1a: control-pneumonia spectral data case study 1b: control-pneumonia concentration data case study 2a: male-female spectral data case study 2b: male-female concentration data
method SVM PLS-DA SVM PLS-DA SVM PLS-DA SVM PLS-DA
352 individuals to see if we could separate male from female. In this case, healthy human urinary profiles were investigated to assess the differences associated with gender. The data set used here is same as previously published24 and consists of processed NMR spectra from 158 male and 194 female samples. Standard binning was done on the spectra to reduce the dimension of the system to 360. A class vector Y of length 352 with -1 and 1 representing males and females, respectively, was assigned. Metabolite concentration data was obtained by quantification of 72 known urinary components (metabolites) in the NMR spectrum. The concentration data consisted of 352 samples with 158 males and 194 females. Each observation has associated variables of the concentration value (measured in µmolar) of 72 metabolites. This class vector Y had a length of 352 with -1 and 1 representing males and females respectively. Table 1 shows classification accuracies for gender separation (both spectral and concentration data). As with the normal7566
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
parameters rbf γ ) k)3 rbf γ ) k)2 rbf γ ) k ) 10 rbf γ ) k)5
2-13, c ) 211 2-9, c ) 210 2-9, c ) 211 2-6.6, c ) 211
4-fold CV (mean %)
LOOCV (mean %)
95.65 94.44 93.02 91.80 83.65 84.44 90.90 90.13
96.61 95.76 95.76 92.37 86.93 87.50 92.04 91.19
pneumonia case, SVM appears to perform marginally better than PLS-DA for the concentration data, however for spectral data PLSDA appeared slightly better. Figure 2 shows the visualization plot for both binned and concentration data. The visualization plot for SVM concentration data (Figure 2a) is better than that of PLSDA (Figure 2b); however, that of SVM spectral data (Figure 2c) is only as good that of PLS-DA spectral data (Figure 2d). It should be noted that the actual PLS-DA model for this spectral gender data was built using 10 components; however, in the visualization plot only two components were used. This explains the very poor performance of PLS-DA visualization. On the other hand, SVM visualization plots are independent of the dimension of the data set or the kernel used to build the model. In order to show that none of the models generated overfit the data, the R2-Q2 figures were plotted along with the separation plots for all the data sets as in Figures 1 and 2. This plot can be explained as follows: the vertical axis represents the R2/Q2 value
Figure 2. Comparison of PLS-DA and SVM multivariate analysis techniques on gender separation data. Male (solid circles, n ) 158); female data (hollow circles, n ) 194). (a) SVM spectral data; (b) PLS-DA spectral data; (c) SVM concentration data; (d) PLS-DA concentration data.
for the original model (far to the right) and of the cluster of Y-permuted models further to the left. The horizontal axis shows the correlation between the permuted Y-vectors and the original Y-vector. The original Y has the correlation 1.0 with itself, defining the high point on the horizontal axis. In all the plots (Figures 1 and 2), the R2 and Q2 values for the permuted models are lower than that of the original model. The Q2 regression line has a negative intercept in all the cases. Hence it can be concluded that neither of the models overfit the data. Moreover, the R2/Q2 values and the R2/Q2 regression line slopes of SVM-RFE are higher than that of the PLS-DA based VIP method. Orthogonal partial least-squares-discriminant analysis (OPLSDA) is a major development of the PLS-DA technique which was proposed to handle the class orthogonal variation in the data matrix.27,28 OPLS-DA augments the classification performance only in cases where individual classes exhibit divergence in withinclass variation. In terms of regression prediction results, OPLSDA is nearly equivalent to PLS-DA.27 However, the main advantage of the OPLS-DA method lies in the fact that there is more transparency in the generated model. This is because OPLS-DA can separate the predictive from nonpredictive (orthogonal) variation. In order to verify these statements, O-PLSDA algorithm (26) Ambroise, C.; McLachlan, G. Proc. Natl. Acad. Sci. U.S.A. 2002, 99, 6562– 6566. (27) Bylesjo, M.; Rantalainen, M.; Cloarec, O.; Nicholson, J.; Holmes, E.; Trygg, J. J. Chemom. 2006, 20, 341–351. (28) Trygg, J.; Wold, S. J. Chemom. 2002, 16, 119–128. (29) Weston, J.; Elisseeff, A.; BakIr, G.; Sinz, F. Software available at http:// www.kyb.tuebingen.mpg.de/bs/people/spider/main.html, 2006.
was coded using MATLAB and was applied to the male-female concentration data set. Summary of O-PLSDA results have been included in the Supporting Information file. It was observed that the classification performance was very similar to that of PLSDA. The top 10 features (metabolites) that contributed the most to the separation in PLS-DA and OPLS-DA was also compared. It was observed that the list of features and their ordering were also very similar. Hence, it can be concluded that the OPLS-DA method performance was very similar to that of PLS-DA. This similarity in performance may be because of the absence of divergence in within-class variation between the two classes. Feature Selection. In order to extract meaningful information about the features (bins or metabolites) giving rise to the separation, the PLS-DA based VIP method and the SVM-RFE method were implemented and compared. The goal was to find the optimum number of features that would result in the best classification. For the PLS-DA based feature selection, this was done as follows: based on the PLS loadings, the VIP values were calculated for each feature. These features were then arranged in decreasing order of VIP (or decreasing order of importance). Starting from the most important feature, the features were included sequentially and the corresponding accuracy rate for that set of features was calculated. In the case of SVM-RFE, the algorithm is a backward elimination procedure, where the entire data set is used at the beginning. Depending on the weights of the individual features, each feature was eliminated sequentially and the corresponding accuracy rate for that set of features was calculated. Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
7567
Figure 3. Comparison of feature selection results using SVM-RFE (solid line) and PLS-DA VIP (dashed line): (a) controls-pneumonia spectral data; (b) controls-pneumonia concentration data; (c) gender spectral data; (d) gender concentration data. Table 2. Significant Features (CP ) Controls Pneumonia Data Set, MF ) Gender Data Set) CP spectral (bin regions)
concentration (metabolites)
PLS-DA
SVM-RFE
PLS-DA
3.025-3.05 2.675-2.70 4.025-4.05 9.1-9.125 2.7-2.725 citrate (1) O-acetylcarnitine (3) trigonelline (5) fumarate (4) 1-methylnicotinamide (2)
3.025-3.05 2.675-2.70 1.25-1.275 2.525-2.55 2.5-2.525 citrate (1) O-acetylcarnitine (2) trigonelline (6) fumarate (3) 1-methylnicotinamide (7)
4.025-4.05 6.95-6.975 7.15-7.175 3-3.025 3.9-3.925 carnitine (1) citrate (2) succinate (6) creatinine (8) fumarate (9)
3.9-3.925 2.525-2.55 2.675-2.7 2.5-2.525 3-3.025 carnitine (1) citrate (4) succinate (5) creatinine (-) fumarate (11)
For the case of healthy versus pneumococcal pneumonia spectral data, classification accuracy rate was plotted against the corresponding number of features (Figure 3a). From Figure 3a, it can be seen that the feature selection curve initially rises steeply, attains a maximum, then becomes stable, and decreases slightly at the end. This is because the first few features are most important in terms of classification. The point at which the curve reaches a maximum corresponds to the optimum number of metabolites. Further addition of features to the classifier slightly degrades the performance because of the addition of redundant “noise” to the system. SVM-RFE significantly outperforms PLSDA as SVM-RFE gives a classification accuracy rate of 99.4% for as low as 30 features, whereas the maximum accuracy rate achieved by PLS-DA is 98.4% requiring 50 features. A few of the significant bins (ppm regions) identified by both the methods are listed in Table 2 in the order of importance. Table 2 compares 7568
MF
SVM-RFE
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
the significant features (metabolites for concentration data and bin regions for spectral data) identified by PLS-DA and SVM-RFE for each of the four case studies. For the case of healthy versus pneumococcal pneumonia concentration data (Figure 3b), it can be observed that SVM-RFE significantly outperforms PLS-DA. SVM-RFE gives a classification accuracy rate of 98.5% for as low as eight features, whereas the maximum accuracy rate achieved by PLS-DA is 95%. A few of the significant metabolites for this concentration data identified by both methods are listed in Table 2 with their rank order of importance included in parentheses. However, both methods perform better when compared to the case where no feature selection is done. Hence it can be concluded that feature selection reduces the dimension of the system as well as increases the separation between the two classes. In the case of healthy versus pneumococcal pneumonia, very good separation was achieved. However, in the case of data that
is more overlapped, the differences in the performance and feature selection are more pronounced. Figure 3c shows that SVM-RFE outperforms PLS-DA when classifying gender spectral data. For the case when all the features are used for classification, the accuracy rate is 84%. This low accuracy rate is likely due to the presence of irrelevant features or “noise” that can potentially distort the classification. As these irrelevant features are removed sequentially, the accuracy improves until it attains the maximum of 89% at 40 bins. Further removal of features causes a reduction in the accuracy rate due to a lack of sufficient information for the model to classify the data. In the case of PLS-DA, the maximum accuracy rate achieved is approximately 86% and requires almost 70 features to attain this maximum. For the gender spectral data, some of the significant bin regions identified by both PLS-DA and SVM-RFE have been listed in Table 2. Bin regions 3.9-3.925 ppm (creatine), 3.0-3.025, and a few others were found to be important by both SVM-RFE and PLS-DA based methods. However, their rank order of importance was found to be different. The metabolites to which these bin regions correspond can be identified from the literature. Nevertheless, for some of the bin regions, this mapping is not unique, due to severe overlapping of these metabolite bin regions. Several metabolites such as creatinine, citrate, carnitine, and citrate were found to be significant for classification in both concentration and spectral data. Further investigation of these bin regions may lead to important findings of potentially significant biomarkers which may be missed in the concentration data. For the case of gender separation concentration data, Figure 3d shows that both the methods give a similar maximum classification accuracy rate of 91%. Nevertheless, on careful observation, it can be seen that SVM-RFE achieves this maximum at 30 features whereas PLS-DA requires almost 40 features to achieve this maximum. Moreover, when the number of features selected is low (initial phase of the graph), SVM-RFE performs better than its counterpart. A few of the significant metabolites have been listed in Table 2 with their rank order of importance included in parentheses. Further study has to be done to investigate the biological significance of these metabolites to get a better understanding as to why these metabolites were classified as important for the classification. DISCUSSION The tabulated results suggest that SVM performs marginally better than PLS-DA in terms of the leave-one-out classification accuracy rate on the entire data set. Cross validation for calculating accuracy rates is preferred over computing the same using a single blinded data set since the latter case is a biased approach. The reason for this is because it cannot be guaranteed that the test set has been sampled from the same population distribution as the training set. Moreover, if a different blinded set is randomly chosen from the original data set, the same accuracy rate may not be obtained every time. Thus, comparison of classification accuracy rates based on a cross-validation approach is preferred over comparison to a single blinded data set. In general, the LOOCV method generates a better estimate than the 4-fold cross validation method because the latter gives a pessimistically biased estimate (the estimate is always lower than the true value). In terms of classification on the full data set, SVM performs only marginally better than PLS-DA. However, the main advantage
Table 3. Case Studies: Summary of Feature Selection Results case study case study 1a: control-pneumoniae spectral data case study 1b: control-pneumoniae concentration data case study 2a: male-female spectral data case study 2b: male-female concentration data
optimum no. of features
accuracy rate
SVM
30
99.4
PLS-DA SVM
50 8
98.4 98
PLS-DA SVM
7 40
95 89
PLS-DA SVM
70 30
86 91
PLS-DA
45
91
method
of SVM over PLS-DA is that, after doing feature selection, SVM generates a better predictive model than PLS-DA from the subset of features describing the phenotype. The SVM-RFE method gives a much better classification accuracy rate with fewer features when compared to the PLS-DA based VIP method. This is important when a number of markers are being used for disease diagnostics. The summary of feature selection results for the two different case studies is tabulated in Table 3. Table 2 shows that the order of importance of metabolites ranked by SVM and PLS-DA is very different. These metabolite subsets are crucial for the development of a predictive model for disease diagnostics. With better classification accuracy using the least number of features possible means SVMs are more efficient than PLS-DA for diagnostic purposes. The other advantage of SVMs over PLS-DA is that the latter is essentially a linear classifier whereas the former can be used for nonlinear cases with the help of appropriate kernel functions. Moreover, SVMs have an excellent generalization performance because of the usage of the regularization parameter c. Furthermore, in PLS-DA there is an inherent assumption that the predicted class (y) values of each class are normally distributed. This assumption is made to calculate the prediction probabilities of different classes of data and the classification threshold according to the Bayesian method. A normal distribution is fitted to the predicted class (y) values, and the value at which the probability of a sample belonging to either of the class (in a binary class problem) is equal is chosen as the classification threshold. In order to verify this normality assumption, the histogram of of predicted class values for the controls class in case study 1 is plotted as in Figure 4. It can be clearly seen that the assumption of normality is not valid. In the case of SVMs, there are no such assumptions on the data set distribution. Clearly, further study has to be done to investigate the biological significance of these metabolites. However, it is of interest that the ordering of the metabolites is different in both PLS-DA and SVMs. Moreover, the PLS-DA based method may have missed one or more important distinguishing factors. One such factor is creatinine. It has been shown24 that creatinine is related to body mass, and since males in general have a higher body mass than females, it would seem appropriate to assume the creatinine is a key differentiator between males and females. PLS-DA did not indicate creatinine as an important indicator whereas SVMs did. A detailed analysis of all the significant metabolites identified in either of the methods is beyond the scope Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
7569
SVM-RFE over PLS-DA based method was demonstrated with the help of two case studies. Unlike PLS-DA, SVM does not require any implicit assumptions on the distribution of the data set. However, in the case of metabolomic data sets where there is a significant divergence in the within-class variation between the two classes, OPLS-DA might perform better than PLS-DA. Further work needs to be done to compare the performances of SVM and OPLS-DA in such cases.
Figure 4. Histogram of PLS-DA ypredicted for normal class in normalpneumonia case study. PLS-DA model is built on the controlspneumonia data and the predicted controls class (y) values is computed and its histogram is plotted.
of this work and will be described elsewhere.24 It should be noted that this work does not focus on giving a diagnostic for detecting a particular phenotype. The main focus of this work is to introduce support vector machines as an efficient tool for doing metabolomic analysis. CONCLUSIONS In this work, we have shown that support vector machines (SVMs) are an effective tool to analyze metabolomic data. Its performance and ability to find fewer features is better than the traditionally used PLS-DA technique. Moreover, SVMs are able to generate better predictive models after doing the appropriate feature selection. This is done to identify the potential biomarkers or the significant distinguishing features which causes the classification. With fewer number of features selected, SVM-RFE gives a much better classification accuracy rate when compared to the PLS-DA based VIP method. The superior performance of
7570
Analytical Chemistry, Vol. 80, No. 19, October 1, 2008
ACKNOWLEDGMENT The authors thank Ms. Shana Regush for preparing and acquiring the NMR data and the Canadian National High Field NMR Centre (NANUC) for use of the facilities for collection of the NMR data. The authors also want to express their gratitude to Chenomx Inc. and Varian Inc. for their support. The authors also acknowledge Allison McGeer for providing the S. pneumoniae samples, Kathryn Rankin and Hao Fu for analyzing the same, and NSERC for providing financial support. Part of this work was supported by an establishment grant from Alberta Heritage Foundation for Medical Research, and from The Lung Association of Alberta and the Northwest Territories. This work was also supported by funds from Western Economic Development, and Alberta Advanced Education and Technology. The authors wish to thank all volunteers and patients for providing urine samples for this study. SUPPORTING INFORMATION AVAILABLE Details regarding the kernel and the parameters used for SVMRFE; the Matlab codes that were used to generate the results; summary of O-PLSDA results and details of age matching statistical analysis. This material is available free of charge via the Internet at http://pubs.acs.org. Received for review June 4, 2008. Accepted July 22, 2008. AC800954C