Performance Evaluation of Algorithms for the Classification of

Jochen Hochrein , Helena U. Zacharias , Franziska Taruttis , Claudia Samol ... Matthias S. Klein , Kimberly E. Connors , Jane Shearer , Hans J. Vogel ...
0 downloads 0 Views 552KB Size
Subscriber access provided by University of Sussex Library

Article

Performance Evaluation of Algorithms for the Classification of Metabolic H-NMR Fingerprints 1

Jochen Hochrein, Matthias S. Klein, Helena U. Zacharias, Juan Li, Gene Wijffels, Horst Joachim Schirra, Rainer Spang, Peter Josef Oefner, and Wolfram Gronwald J. Proteome Res., Just Accepted Manuscript • Publication Date (Web): 01 Nov 2012 Downloaded from http://pubs.acs.org on November 5, 2012

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Proteome Research is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Performance Evaluation of Algorithms for the Classification of Metabolic 1H-NMR Fingerprints

Jochen Hochrein†, Matthias S. Klein†, Helena U. Zacharias†, Juan Li‡, Gene Wijffels‡, Horst Joachim Schirra$, Rainer Spang†, Peter J. Oefner†, Wolfram Gronwald†*



Institute of Functional Genomics, University of Regensburg, Josef-Engert-Strasse 9, 93053

Regensburg, Germany ‡

CSIRO Livestock Industries, Queensland Bioscience Precinct, 306 Carmody Rd., St Lucia,

Qld 4067, Australia $

The University of Queensland, Centre for Advanced Imaging, Brisbane, Qld 4072, Australia

*Contact: Wolfram Gronwald, phone: +49-941-943-5015, Fax: +49-941-943-5020. Email: [email protected]

1 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Abstract Non-targeted metabolite fingerprinting is increasingly applied to biomedical classification. The choice of classification algorithm may have a considerable impact on outcome. In this study, employing nested cross-validation for assessing predictive performance, six binary classification algorithms in combination with different strategies for data-driven feature selection were systematically compared on five datasets of urine, serum, plasma and milk one-dimensional fingerprints obtained by proton nuclear magnetic resonance (NMR) spectroscopy. Support Vector Machines and Random Forests combined with t-score based feature filtering performed well on most datasets, while the performance of the other tested methods varied between datasets.

Keywords: Metabolomics, NMR, Classification, Cross-validation, Fingerprinting

2 Environment ACS Paragon Plus

Page 2 of 36

Page 3 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Introduction The comprehensive analysis of small molecules present in body fluids and tissues as a function of age, gender, health status, nutrition, and genetic background is the general aim of metabolomics and includes screening for differentially produced metabolites, estimation of fold changes, and classification of samples.1 Proton nuclear magnetic resonance (1H-NMR) spectroscopy is particularly suited for non-targeted metabolite fingerprinting, as it allows the simultaneous detection of almost all proton-containing metabolites present at sufficient concentrations in a single experiment. Typical biological specimens such as urine and serum contain hundreds to thousands of endogenous metabolites and xenobiotics2 leading to a correspondingly large number of features present in the NMR spectra. After preprocessing of data including proper data scaling and normalization,3 mostly multivariate statistics are employed.4 Typical applications include the identification of clusters of samples that share common features and the classification of samples into known classes of disease. Here, we focus on sample classification using supervised techniques from machine learning. In NMR based metabolomic studies, to date, predominantly algorithms based on Partial Least SquaresDiscriminant Analysis (PLS-DA)5 have been employed. However, other algorithms may also yield suitable results.

A number of classifiers have been developed for the classification of biomedical samples based on genome-wide expression data, including linear discriminant analysis, nearest neighbor classifiers and classification trees.6 In principle, such classifiers may also be applied to metabolite fingerprints, but to date no systematic evaluation of their performance on highdimensional metabolomics data has been conducted. Performance may vary significantly depending on the type of -omics data employed due to differences in the joint distribution of features. For the same reason, conclusions drawn from the performance of eight different binary machine-learning algorithms on a comparatively low-dimensional dataset of 63 pre3 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

selected metabolites as published previously7 may not necessarily hold for high-dimensional NMR or hyphenated mass spectrometric fingerprints comprising hundreds to several thousand metabolite features.

Here, we compare the performance of six binary classifiers that were applied to five different high-dimensional urine, plasma, serum and milk1H-NMR fingerprinting datasets, in terms of prediction accuracy, area under the receiver operating characteristic (ROC) curve, as well as number, signal intensity, and chemical identity of features selected. The algorithms tested included Elastic Net,8 Nearest Shrunken Centroids,9 Partial Least Squares-Discriminant Analysis (PLS-DA),5 Random Forests (RF),10 Support Vector Machines (SVM) employing linear or radial kernels,11 and Top Scoring Pairs (TSP).12 Some classification algorithms can deal in principle with high-dimensional feature vectors directly, but it has often proven helpful to combine them with data-driven feature selection procedures that differ from the a priori pre-selection of metabolites to be measured. A large variety of different feature selection algorithms are available from the literature. Methods are usually grouped into (i) filter approaches that select subsets of the entire dataset upstream of the actual classification, (ii) wrapper methods that use the actual classification algorithm to score subsets of features according to their predictive power, and (iii) embedded methods that perform variable selection in the process of training.13 Of the classifiers tested, the Elastic Net conducts an embedded internal feature selection procedure, while the Nearest Shrunken Centroids classifier belongs to the class of wrapper methods. An example for a filter approach is a tscore based feature selection strategy.14

4 Environment ACS Paragon Plus

Page 4 of 36

Page 5 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Experimental Section

Evaluation Strategy Predictive performance was evaluated in cross-validation (CV) by computing the rate of correct classifications and the average number of selected variables. Moreover, the CVclassification performance was assessed using Receiver Operating Characteristic (ROC) analysis by calculating the area under the ROC curve (AUC). Most of the algorithms contain internal parameters that are subject to calibration. Typically, these parameters control model complexity and are used to balance over- and under-fitting. Proper calibration can exert a higher impact on predictive performance than the choice of the underlying algorithm. It is important that both, feature selection and parameter calibration, do not bias performance evaluation.15 To ensure this, every algorithm was tested in a leave-five-out nested crossvalidation-scheme. Figure 1 shows schematically the three-fold nested cross-validation scheme used for the evaluation of the SVM based approaches. Note that the data in each loop are split iteratively in training and test data to ensure that all data of a given loop are once used for testing. The outermost loop, shown on top of Figure 1, contains all data. The current training data of the outmost loop define the data of the middle loop, which are split again iteratively in test and training data. The same procedure is repeated with the innermost loop. The nested cross-validation approach employed here has been shown to yield an almost unbiased assessment of the true classification error.16 Calibration of internal parameters. The innermost loop was used to calibrate internal parameters, such as the cost function C and the width of the kernel function γ for SVMs with a radial basis function kernel. If more than one parameter had to be optimized, a grid-search procedure was performed as previously proposed.17 For each grid-point or parameter setting a complete cross-validation was run with the classification algorithm in question and the best

5 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

performing parameter setting was then applied to the middle loop. For classifiers that did not require calibration of internal parameters, the innermost loop was omitted. Classifier sparsity. In the middle loop, the minimum number of features needed for optimal classification was determined. In the work presented here, the term “features” refers to the spectral bins of the 1H-NMR-spectra employed. The middle loop is not required for recursive SVM (R-SVM) based feature selection, because R-SVM conducts an internal determination of the optimal feature number. Also, for classifiers employing a fixed number of features, such as Top Scoring Pairs, the middle loop was not used. For the optimization of classifier sparsity, the data of the middle loop were iteratively split into training and test data and using the current training data, features were ranked for example according to their t-statistics value. Starting with a single feature, the number of features was incremented by one at a time and the optimum number of features was determined, as described above, by performing a complete cross-validation for each selection within the middle loop. Predictive performance was evaluated in terms of prediction accuracy. Note that the innermost loop is nested within the middle loop and, therefore, for every splitting into test and training data in the middle loop an individual set of internal parameters is obtained from the innermost loop. These data are averaged to determine the overall best internal parameters for this cross-validation run. As a consequence for every feature number an individually calibrated set of internal parameters is calculated. Validation of classifiers. In case of t-score based feature selection, an optimized feature number with the corresponding calibrated internal parameters was obtained by the two nested inner loops for every training set of the outermost loop. With these parameters, employing the current training data, features are selected and the classifier is trained with the current feature selection and classification algorithm, respectively. In case of R-SVM based feature selection, only calibrated internal parameters are transferred to the outermost loop, whereas the optimum number of features and the features themselves are determined by the R-SVM 6 Environment ACS Paragon Plus

Page 6 of 36

Page 7 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

approach on the current training data of the outermost loop. The different algorithms were then validated on the current test data of the outermost loop and thereby, performing again a complete cross-validation within this loop. This procedure ensures that the test data of the outmost loop were not used for parameter calibration and classifier training. Predictive performance was analyzed in terms of prediction accuracy and area under the receiver operating characteristic (ROC) curve.

Preprocessing of Datasets The general goal of data preprocessing is to minimize contributions from unwanted biases and experimental variance. Obviously, initial data processing may impact classification performance as shown in a previous publication.3 Hence, all datasets were subjected to a comparable preprocessing routine, starting off with equidistant binning of the spectra, which is still the most widely used method. Alternative approaches include peak alignment,18,19 methods using the full spectral resolution employing statistical correlation spectroscopy20 or orthogonal projection to latent structures.21 Alternatively, in so-called targeted profiling approaches a pre-selected set of metabolites is quantified from the 1D22 or 2D23,24 spectra and subsequent data analysis is then based on these quantitative values. Next, spectra were normalized employing variance stabilization normalization (VSN),25 a normalization procedure previously shown to be well suited for 1H NMR metabolite fingerprints.3 Analysis of typical metabolic data shows for signals of strong and medium intensity that the variance is proportional to the mean. However, for values close to the detection limit the variance does not decrease any more and stays rather constant. VSN exactly tackles this problem by applying the inverse hyperbolic sine. For large intensity values the normalization behaves like a logarithmic transformation, thereby removing heteroscedasticity from the data. At the same time, the variance of small values remains unchanged due to the linear behavior of the

7 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 36

transformation for small input values. An additional advantage in comparison to a logtransformation is the ability of VSN to cope with negative input values.

Datasets The first dataset used to evaluate the performance of the classifiers comprised in-house generated 1H-NMR spectra of urine specimens collected from n=54 patients suffering from autosomal dominant polycystic kidney disease (ADPKD), including n=35 patients receiving medication against hypertension (group 1A) and n=19 non-medicated patients (group 1B), as well as spectra from n=46 healthy volunteers (group 2).1 The 701 spectral bins of this dataset, which excluded spectral regions containing the broad urea signal and the water artifact, had a width of 0.01 ppm. The data had been normalized relative to creatinine followed by VSNtransformation.25 In the Bioconductor-VSN R-package applied, a combination of methods that correct for between-sample variation by linearly mapping all spectra to the first spectrum followed by adjustment of the variance of the data is used. This reduces biases due to differences in fluid intake and ensures stable variances across feature intensities. For evaluation of binary classifiers, group 1A and 1B were merged.

The

second

dataset

was

retrieved

from

the

website

(http://www.metaboanalyst.ca/MetaboAnalyst/faces/Docs/Format.jsp)

of with

MetaboAnalyst no

further

preprocessing. This second dataset comprised n=25 each urinary NMR fingerprints from patients affected with glomerulonephritis (GN) and healthy volunteers.26 The n=200 spectral bins had a width of 0.04 ppm. To ensure comparability across spectra, providers had normalized each spectrum to an integral of one.

The third dataset comprised a total of n=80 serum NMR fingerprints obtained from n=40 sheep before and after 48 h of road transport (RT).27 One-dimensional 1H NMR data were 8 Environment ACS Paragon Plus

Page 9 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

acquired employing a CPMG pulsesequence for the suppression of broad signals originating from macromolecules. The 881 spectral bins of this dataset excluded the spectral region containing the water artifact and had a width of 0.01 ppm. The data for each spectrum were normalized relative to an integral of one to account for dehydration effects during road transport followed by VSN-transformation.

The fourth dataset included NMR fingerprints of n=105 EDTA-plasma specimens that had been drawn from patients 24 hours after cardiac surgery with cardiopulmonary use. Of the 105 patients, 33 developed post-operative acute kidney injury (AKI) and 72 did not. The EDTA-plasma specimens were ultrafiltrated with a cutoff of 10 kD to remove macromolecules. Subsequently, 400 µL of the ultrafiltrate were mixed with 200 µL of 0.1 mol/L phosphate buffer at pH 7.4 and 50 µL of 29.02 mmol/L 3-trimethylsilyl-2,2,3,3tetradeuteropropionate (TSP) in deuterium oxide as internal standard (Sigma-Aldrich, Taufkirchen, Germany). 1D NMR experiments were carried out on a 600 MHz Avance III spectrometer (BrukerBioSpin, Rheinstetten, Germany) at 298K as described previously.24 The 863 spectral bins of this dataset, which excluded the spectral regions containing the water artifact and the glycerol signals originating from the filtering device, had a width of 0.01 ppm. Data were further pre-processed as described for the third dataset. Metabolite identification was performed by generation of representative 1D

1

H and high resolution

1

H-13C

heteronuclear single-quantum coherence (HSQC) spectra aided by corresponding 2D 1H-13C heteronuclear multiple bond correlation (HMBC) and 2D 1H-1H total correlation spectroscopy (TOCSY) spectra. For signal assignment, these spectra were overlaid with reference spectra of pure compounds taken mainly from the commercially available Bruker Biofluid Reference Compound Database BBIOREFCODE 2-0-3 applying AMIX 3.9.13 (Bruker BioSpin, Rheinstetten). Written declarations of informed consent had been obtained from all study participants before inclusion. 9 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

The fifth dataset included n=35 milk specimens collected in the last lactation third of two breeds of dairy cows (DCs).28 Seventeen specimens from highly productive Brown Swiss cows were compared to 18 specimens from Simmental cows with average 305-d milk yields of 9,200 kg and 8,300 kg, respectively. Milk samples were defatted by centrifugation followed by ultrafiltration with a 10-kD cutoff. NMR data were acquired and preprocessed as described for the fourth dataset resulting in a total of 838 spectral bins. Note that the excluded region containing the water artifact was separately adjusted for each dataset leading to different bin numbers across datasets.

Classifiers In total, six binary classifiers were tested. They included Elastic Net,8 Nearest Shrunken Centroids,9 Partial Least Squares-Discriminant Analysis,5 Random Forests,10 Support Vector Machines,11 and Top Scoring Pairs.12 To our knowledge, of these only SVM, PLS-DA and Random Forests have been used in metabolomics before.7,29,30

Elastic Net. The Elastic Net8 is a high-dimensional linear penalized likelihood classifier that employs both a L1- and L2-penalty term for controlling model complexity. While the L1 penalty results in an internal feature selection (most features receive weights of zero), the additional L2-penalty ensures that the remaining features form clusters of strongly correlated features. Due to its built-in variable selection, no upstream feature-selection has to be performed. We used the implementation provided in the R-library elasticnet.31

Nearest Shrunken Centroids. This classifier was originally developed for the classification of gene expression profiles.9 The algorithm estimates class-centroids for each class of the training data. Beforehand, the data is modified by a shrinkage procedure so that only 10 Environment ACS Paragon Plus

Page 10 of 36

Page 11 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

informative features contribute to the class-centroids. Classification is based on a modified, squared distance to the centroids. An unseen object is classified to the class with the nearest centroid. For computation the R-library pamr was used.32

Partial Least Squares-Discriminant Analysis. Partial Least Squares discriminant analysis (PLS-DA) is widely used in metabolomics,4,33,34 but less frequently in other -omics fields. PLS-DA linearly condenses features to a small set of mutually orthogonal latent variables that are subsequently used in a low dimensional classification approach. The functions employed by us for classification using PLS-DA can be found in the R-library caret.35 In this approach, a PLS-DA model is fitted to the training data. The class-probabilities for an unseen sample are estimated via the softmax function from the PLS-DA model.

Random Forests. Random Forests are used in many different applications ranging from metabolomics to economics.10 A Random Forest is a set of tree predictors. Each tree is constructed by using a different bootstrap sample of the training data. The splitting at each node is based on a random selection of input variables. The class labels ascribed to unseen samples are the result of a majority vote. For classification of datasets we used the R-library randomForest.36

Support Vector Machines. Support Vector Machines (SVM) are a very popular family of classifiers that belong to the so-called large-margin classifiers, a term stemming from the maximization of the distance between the different classes of the training data.11 The decision surface is a hyperplane, which is constructed in a high-dimensional vector space. For the calculation of the separating hyperplane, SVMs incorporate kernel-functions, which implicitly provide a mapping of the data to a high-dimensional space. There are different parameters for the cost of training errors and the kernel function, which need to be optimized by suitable 11 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

methods. Here, we used both linear and radial basis function kernels. Note that in the linear case, SVMs are closely related to Elastic Net without the L1-penalty. SVM-based classifications were implemented using the R-library e1071.37

Top Scoring Pairs. The Top Scoring Pairs classifier12 leaves all model flexibility to feature selection and almost none to parameter fitting. Only two different features are selected and classification is based on thresholding their ratio without any further weighting. Surprisingly, this extreme and simplistic concept performs well in whole-genome expression analysis. We used the R-library tspair.38 The Top Scoring Pairs classifier possesses an inherent feature selection. However, initial tests yielded unsatisfactory results. Therefore, we combined it with t-test based feature selection.

Variable Selection For the purpose of variable selection, we used two different approaches to achieve an optimal balance between over- and underfitting. The first one, R-SVM,14,39 is used in the analysis of mass-spectrometry and microarray data. A recursive procedure calculates the contribution of each feature to the definition of the separating hyperplane in a linear SVM-model. Next, starting from all features, features are ranked according to their contribution and, recursively, eliminated from the classification process until the optimal set of features is obtained. For this R-SVM performs an internal cross-validation. For the evaluation of R-SVM we used the Rcode of Lu (2005).

The second method employs a filtering of features based on t-values. In brief, first the features are ranked in the middle loop of the nested cross-validation scheme by means of a t-test. Starting with the most discriminative feature, more and more features are included successively using a step size of one until the optimal predictive results are obtained for the 12 Environment ACS Paragon Plus

Page 12 of 36

Page 13 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

middle test samples. Next, employing a t-test based ranking scheme again the previously obtained optimal number of features is selected in the outermost loop of the cross-validation.

Parameter Calibration and Cross-Validation All parameters used in each algorithm including the number of selected features were individually calibrated in the inner loop of a nested cross-validation. Note that for each split of the outer cross-validation loop a different set of parameters was obtained by the inner cross-validation. Therefore, an overview of the obtained average values of the involved parameters for every combination of feature selection and classification algorithm is given below. We use the following notation to refer to the different combinations of feature selection and classification algorithms: ”A+B” denotes the combination of feature selection method ”A” and classifier ”B”. •

Elastic Net: The penalty parameter λ was optimized to 0.00100, 0.00055, 0.00001, 1.00000, 0.00100 for the ADPKD, GN, RT, AKI and DC dataset, respectively.



Nearest Shrunken Centroids: The shrinkage parameter ∆ controls the number of selected features, which are given in Table 3.



PLS-DA: The best accuracy in classification was achieved with 4, 5, 5, 3 and 5 latent variables for the ADPKD, GN, RT, AKI and DC dataset, respectively.



R-SVM+SVM(linear): The parameter for the cost function was optimized to C = 4.0, 0.5, 2-4, 16.0 and 0.25 for the ADPKD, GN, RT, AKI and DC dataset, respectively.



R-SVM+SVM(radial): For the cost function C = 8.0, 2.0, 2.0, 8.0 and 16.0 and for the width of the kernel γ = 2−5, 0.046875, 2-4, 2-6 and 2-6 were obtained for the ADPKD, GN, RT, AKI and DC dataset, respectively.



t-test+PLS-DA: The best accuracy in classification was achieved with 7, 3, 6, 7 and 5 latent variables for the ADPKD, GN, RT, AKI and DC dataset, respectively.

13 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60



t-test+RF: For the number of trees Ntree = 300, 200, 200, 300 and 200 and for the number of variables in each node mtry = 6, 1, 1, 2 and 3.5 were obtained for the ADPKD, GN, RT, AKI and DC dataset, respectively.



t-test+SVM(linear): The parameter for the cost function was optimized to C = 0.125, 0.25, 2-4, 1.0 and 1.25 for the ADPKD, GN, RT, AKI and DC dataset, respectively.



t-test+SVM(radial): For the cost function C = 8.0, 2.0, 1.0, 6.0 and 6.0and for the width of the kernel γ = 2−6, 2−4, 2-4, 2−4 and 2−5 were obtained for the ADPKD, GN, RT, AKI and DC dataset, respectively.



t-test+Top Scoring Pairs: As described above, this classifier needs no tuning of hyperparameters.

14 Environment ACS Paragon Plus

Page 14 of 36

Page 15 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Results Table 1 gives the respective classification accuracies obtained for the fingerprinting datasets employing the different combinations of classification algorithms and feature selection methods. Of the different combinations, t-test+RF yielded the best average rank (2.6) with accuracy values of 77 %, 98 %, 95 %, 84 % and 97 % for the ADPKD, GN, RT, AKI and DC dataset, respectively. With a respective average rank of 3.2, t-test+SVM(radial) also performed well. The classifiers with the worst average ranks were R-SVM+SVM(radial) with a value of 8.6, and t-Test+Top Scoring Pairs with a value of 7.4. Analysis of the corresponding ranks listed in Table 1 shows that t-test+RF and t-test+SVM(radial) performed well for most of the datasets, while R-SVM+SVM(radial) performed poorly in most cases. In contrast, the dataset used had a significant impact on the performance of PLS-DA (and to a lesser extent on Elastic Net), which achieved average ranks of 5.0 and 3.6, respectively.

The performance of the different algorithms was also evaluated based on receiver operating characteristic (ROC)-curves (Figure 2). Table 2 shows the calculated area under the curve (AUC)-values for the classification of the datasets. With regard to AUC, t-test+SVM(radial) on average performed best with AUC values of 0.938, 1.000, 0.943, 0.879 and 0.967 for the ADPKD, GN, RT, AKI and DC dataset, respectively (average rank 3.2). With an average rank of 3.4 t-test+RF also performed well in terms of AUC. The worst performing algorithm was R-SVM+SVM(radial) with AUC values of 0.693, 0.770, 0.930, 0.773 and 0.503 for the ADPKD, GN, RT, AKI and DC dataset, respectively (average rank 9.2).

Table 3 lists the average numbers of selected features per method. Average values are given, because the optimum number of features and the actual feature selection is individually determined for each splitting into training and test data of the outer cross-validation loop. In addition, the number of selected features identical to features selected by the best performing 15 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

method (t-test+RF) is given for the ADPKD dataset. It is obvious that with the exception of Elastic Net the different methods tested predominantly select features also used by t-test+RF. For the Elastic Net only 10 of the selected 21 features are identical to the best performing method. In contrast, large variations between the different methods are observed when considering the number of used features. The TSP-classifier, for example, uses by definition only two features, whereas PLS-DA takes all of them into account. However, the number of selected features per se is not a quality criterion for a given algorithm.

Next, we analyzed the chemical identity of features selected on average from the ADPKD dataset by the different approaches. It is obvious from Table 4 that all methods with the exception of PLS-DA, which used all features, selected their variables from a common pool of 67 different features from a total of 701 features. Of the most frequently selected features, none were chosen by all methods, while 4, 11, and 20 features, respectively, were used by 9, 8, and 7 methods. In the penultimate column of Table 4, the approximate intensities of the NMR signals contributing to the different features are given. Metabolite assignments, which were performed as described previously,1 are given in the last column.

Finally, we evaluated whether certain algorithms are prone to produce high accuracy values just by chance. To that end, the statistical significance of the obtained classification accuracies was estimated.40 We randomly permuted the class-labels of the training data prior to training thereby removing any correlation between label and data. Consequently, no real biology can be learned on this randomized data. Nevertheless, good classification accuracies can occur by chance. Repeating the whole cross-validation procedure with the re-labeled data gives the distribution of random classification accuracies. Comparing the diagnostic accuracy obtained for the original non-permuted data with this distribution yields the significance of the accuracy. From the GN dataset, 1000 differently permuted datasets were generated before re16 Environment ACS Paragon Plus

Page 16 of 36

Page 17 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

testing all investigated algorithms. The parameters and number of features used by each algorithm were fixed to the average optimal values obtained previously for the non-permuted data. The results are given in Table 1. It can be seen that all algorithms gave on average classification accuracies slightly below 50% on the permuted data. None of the classifiers, when trained on the 1000 randomized datasets, produced prediction accuracies equal or higher than those obtained on non-permuted data. Therefore, the probability of obtaining the results for the non-permuted dataset by pure chance was estimated to be less than 0.1% (p < 0.001).

17 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Discussion The aim of this work was to systematically assess the applicability of a set of classification and feature selection algorithms commonly used for microarray-based gene expression and metabolomic data analysis to proton NMR fingerprinting data. The choice of algorithm was also driven by the ease of subsequent identification of group-separating variables. For that reason, methods such as Neural Networks and Self Organizing Maps were not considered. Obviously, the selection of classification algorithms compared is far from being exhaustive, and the same applies to the data-driven feature selection strategies evaluated, other representatives of which include genetic algorithms41 and statistical scores such as the MannWhitney-Score.42 Nevertheless, the study provides, to our knowledge, the first systematic evaluation of feature selection and classification algorithms in the analysis of highdimensional NMR fingerprinting data.

From the results obtained, it is obvious that on average t-test+RF performed the best with regard to prediction accuracy. Considering the AUC values, t-test+SVM(radial) showed the overall best performance across datasets. However, none of the algorithms tested performed best in all instances. In agreement with our results, excellent performance of t-test based feature selection had been observed previously in a systematic comparison of eight different feature selection methods on four different gene expression datasets.13 In an evaluation of different classification algorithms on metabolomic data, Eisner et al. (2011) ranked an SVMbased classifier top among eight algorithms that were evaluated on a comparatively lowdimensional set of 63 pre-selected urinary metabolites. In contrast to the contribution presented here, Eisner et al. had not included Random Forests and tested only a SVM with a linear kernel. Based on work by Hsu et al. (2003), a radial basis function kernel would have probably yielded improved accuracy provided that proper tuning of the kernel- and costparameters had been performed. In comparison to a linear kernel,1 use of a radial basis 18 Environment ACS Paragon Plus

Page 18 of 36

Page 19 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

function kernel improved in our hands the prediction accuracy for the ADPKD dataset from 85% to 90%. Analysis of the ADPKD data further revealed that features selected by ttest+SVM(radial) and t-test+RF included mostly variables also selected by other methods with the exception of the Elastic Net (Table 4).

The classifiers with the worst average ranks and correspondingly low prediction accuracies across all datasets tested were t-test+Top Scoring Pairs, R-SVM+SVM(linear), and RSVM+SVM(radial). While these methods shared variables with the better performing methods, the comparatively restrictive selection of features apparently eliminated variables required for a good group separation. In addition, features selected by R-SVM on the ADPKD dataset corresponded mostly to signals of macromolecules of relatively weak intensity (Table 4), which might lead to unstable prediction results. At least for the metabolomic datasets investigated here, R-SVM was not well suited for feature selection. Top Scoring Pairs only used two features, the pair with the best internal score. While intriguingly simple, in its current form Top Scoring Pairs appears not to be well suited for the analysis of metabolomic data. A possible alternative might be the k-Top Scoring Pairs classifier.43 A classifier that showed average performance across datasets was Nearest Shrunken Centroids, finishing neither among the best nor the least performing methods.

The impact of the investigated dataset on the performance of the classifiers is obvious in case of PLS-DA. It performed very well for the GN and RT datasets, while for the other datasets only average performance was achieved. For the ADPKD, AKI and DC dataset, its performance improved when combined with t-test feature selection, while the opposite was true for the GN and RT data. Consequently, no general recommendation can be made on the benefit of combining PLS-DA with a preceding feature selection step.

19 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

On average, t-test+PLS-DA and Elastic Net performed well in our hands, albeit inferior to ttest+SVM(radial) and t-test+RF. Interestingly, as far as the ADPKD dataset is concerned, Elastic Net required on average only half the number of features compared to ttest+SVM(radial) yielding the second best value of prediction accuracy. In addition, as can be seen from Table 4, only 10 of the features used by Elastic Net were identical to those selected by t-test+SVM(radial). Table 4 also reveals that 11 of the variables selected by Elastic Net were not used by any other method, except PLS-DA, which used all features. In contrast to most other methods, predictions performed by Elastic Net relied mostly on signals originating from low-molecular weight metabolites and only to a minor extent on signals of macromolecular origin. A detailed analysis of the prediction model obtained for the ADPKD dataset showed that Elastic Net not only selected features that separated medicated patients (group 1A) from healthy controls (group 2), but also took into account features that discriminated non-medicated patients (group 1B) from healthy controls.1 The latter differences were rather small and, thus, often ignored by other algorithms, which compromised their classification performance for early disease stages.

Further inspection of Table 4 yielded several compounds of exogenous origin, i. e. theobromine, catechol and salicylic acid. Examination of patient data showed that none of these substances had been given to ADPKD patients as part of their regular medication. Therefore, they are most likely of dietary origin. Theobromine, for example, is found in chocolate and cocoa, while fruits, vegetables, spices and nuts are dietary sources of salicylic acid.

From Table 4 it is obvious that the choice of feature selection and classification algorithm affects the number and nature of compounds identified, which in turn may lead to different conclusions regarding the metabolic pathways involved in the origin and progression of 20 Environment ACS Paragon Plus

Page 20 of 36

Page 21 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

disease. In this context, it is important to keep in mind that measures of abundance of metabolites in body fluids and tissues of affected and controls carry only limited informative value with regard to the underlying pathobiochemical mechanisms, the identification of which usually demands the targeted perturbation of metabolic pathways and the use of additional modeling algorithms.

Generally, classifiers performed better on the GN and RT than on the other datasets. Apart from the fact that different diseases and metabolic conditions were investigated, it cannot be ruled out that the publicly available NMR spectra of the 25 controls and 25 severe GN cases, which constituted only a fraction of the complete dataset of 162 samples,26 were the most representative for either group. The results for the AKI dataset consisting of 33 AKI patients and 72 non-AKI patients were in most cases slightly less accurate. Typically, severity of AKI is defined by the increase in serum creatinine over baseline as stage I (increase >50%), II (>100%), and III (>200%), respectively.44 Out of the 33 AKI patients included in the dataset, 25 had been classified as AKIN stage 1, while only three and five had been classified as AKIN stage 2 and 3, respectively. Therefore, the predominance of patients presenting with only a slightly impaired kidney function might explain the worse performance of classifiers for this dataset. Nevertheless, despite the obvious differences between the five datasets, in most cases the same methods were ranked top and bottom, respectively. The ROC curves (Figure 2) show that irrespectively of the dataset used the SVM-based methods in combination with t-test based feature selection performed relatively similar.

21 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Conclusions In conclusion, feature selection via t-test improved sample classification performance for all datasets. However, this does not guarantee that sequential t-test based feature filtering will always find the optimal set of predictive variables. The internal variable selection of Elastic Net also performed well. Across the datasets investigated, t-test + Support Vector Machine (radial) and t-test + Random Forest showed the best performance in terms of prediction accuracy and AUC values. Advantages of Support Vector Machine-based approaches include finding of a global minimum and simple geometric interpretation.11 The main advantages of the Random Forests methodology are its robustness against over-fitting and the suitability for parallel computing.10,36

Acknowledgements The authors thank Drs. Raoul Zeltner, Bernd-Detlef Schulze and Kai-Uwe Eckardt for providing the urine specimens used for generating the ADPKD dataset, as well as Drs. Dominic Niemeyer, Andrew Fisher and Drewe Ferguson for providing the serum samples used to generate the sheep transport dataset. The authors also thank Drs. Gunnar Schley, Carsten Willam and Kai-Uwe Eckardt for providing the plasma samples of the AKI dataset, as well as Drs. Gregor Sigl, Steffi Wiedemann and the late Heinrich H. D. Meyer for providing the milk samples of the dairy cow dataset. The authors are grateful to Dr. Claudio Lottaz for helpful discussions regarding statistics and to Ms. Nadine Nürnberger and Ms. Caridad Louis for assistance in sample preparation.

Funding Financial support was provided by the German Federal Ministry of Education and Research (Grant no. 01 ER 0821) and the German Research Foundation (KFO262). Additional funding came from BayGene and the ReForM program of the Medical Faculty at the University of 22 Environment ACS Paragon Plus

Page 22 of 36

Page 23 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Regensburg. The authors of this study declare that they do not have anything to disclose regarding funding from industry or conflict of interest with respect to this manuscript.

23 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

References (1) Gronwald, W.; Klein, M. S.; Zeltner, R.; Schulze, B.-D.; Reinhold, S. W.; Deutschmann, M.; Immervoll, A.-K.; Böger, C. A.; Banas, B.; Eckardt, K.-U.; Oefner, P. J. Detection of Autosomal Polycystic Kidney Disease Using NMR Spectroscopic Fingerprints of Urine. Kidney Int. 2011, 79, 1244-1253. (2) Holmes, E.; Foxall, P. J. D.; Spraul, M.; Farrant, R. D.; Nicholson, J. K.; Lindon, J. C. 750 MHz 1H NMR Spectroscopy Characterisation of the Complex Metabolic Pattern of Urine from Patients with Inborn Errors of Metabolism: 2-Hydroxyglutaric Aciduria and Maple Syrup Urine Disease. J. Pharm. Biomed. Anal. 1997, 15, 1647-1659. (3) Kohl, S. M.; Klein, M. S.; Hochrein, J.; Oefner, P. J.; Spang, R.; Gronwald, W. Stateof-the Art Data Normalization Methods Improve NMR-Based Metabolomic Analysis. Metabolomics 2012, 8, 146-160. (4) Wishart, D. S. Computational Approaches to Metabolomics. Methods Mol. Biol. 2010, 593, 283-313. (5) Barker, M.; Rayens, W. Partial Least Squares for Discrimination. J. Chemometrics 2003, 17, 166-173. (6) Dudoit, S.; Fridlyand, J.; Speed, T. P. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. J. Am. Stat. Assoc. 2002, 97, 77-87. (7) Eisner, R.; Stretch, C.; Eastman, T.; Xia, J.; Hau, D.; Damaraju, S.; Greiner, R.; Wishart, D. S.; Baracos, V. E. Learning to Predict Cancer-Associated Skeletal Muscle Wasting from 1H-NMR Profiles of Urinary Metabolites. Metabolomics 2011, 7, 25-34. (8) Zou, H.; Hastie, T. Regularization and Variable Selection via the Elastic Net. J. R. Stat. Soc. B 2005, 67, 301-320. (9) Tibshirani, R.; Hastie, T.; Narasimhan, B.; Chu, G. Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression. Proc. Natl. Acad. Sci. USA 2002, 99, 6567-6572. (10) Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5-32. (11) Burges, C. J. C. A Tutorial on Support Vector Machines for Pattern Recognition. Data. Min. Knowl. Disc. 1998, 2, 121-167. (12) Geman, D.; d'Avignon, C.; Naiman, D. Q.; Winslow, R. L. Classifying gene expression profiles from pairwise mRNA comparisons. Stat. Appl. Genet. Mol. Biol. 2004, 3, Article19. (13) Haury, A.-C.; Gestraud, P.; Vert, J.-P. The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures. PLoS ONE 2011, 6, e28210. (14) Zhang, X.; Lu, X.; Shi, Q.; Xu, X. Q.; Leung, H. C.; Harris, L. N.; Iglehart, J. D.; Miron, A.; Liu, J. S.; Wong, W. H. Recursive SVM Feature Selection and Sample

24 Environment ACS Paragon Plus

Page 24 of 36

Page 25 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Classification for Mass-Spectrometry and Microarray Data. BMC Bioinformatics 2006, 7, 197. (15) Ambroise, C.; McLachlan, G. Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data. Proc. Natl. Acad. Sci. USA 2002, 99, 6562-6566. (16) Varma, S.; Simon, R. Bias in Error Estimation when Using Cross-Validation for Model Selection. BMC-Bioinformatics 2006, 7, 91. (17) Hsu, C.-W.; Chang, C.-C.; Lin, C.-J. A Practical Guide to Support Vector Classification. http://www. csie. ntu. edu. tw/cjlin/libsvm/ 2003. (18) Forshed, J.; Schuppe-Koistinen, I.; Jacobsson, S. P. Peak Alignment of NMR Signals by Means of a Genetic Algorithm. Anal. Chim. Acta 2003, 487, 189-199. (19) Stoyanova, R.; Nicholls, A. W.; Nicholson, J. K.; Lindon, J. C.; Brown, T. R. Automatic Alignment of Individual Peaks in Large High-Resolution Spectral Data Sets. J. Magn. Reson. 2004, 170 (2), 329-335. (20) Cloarec, O.; Dumas, M.-E.; Craig, A.; Barton, R.; Trygg, J.; Hudson, J.; Blancher, C.; Gauguier, D.; Lindon, J. C.; Holmes, E.; Nicholson, J. K. Statistical Total Correlation Spectroscopy: An Exploratory Approach for Latent Biomarker Identification from Metabolomic 1H NMR Data Sets. Anal. Chem. 2005, 77, 1282-1289. (21) Cloarec, O.; Dumas, M. E.; Trygg, J.; Craig, A.; Barton, R. H.; Lindon, J. C.; Nicholson, J. K.; Holmes, E. Evaluation of the Orthogonal Projection on Latent Structure Model Limitations Caused by Chemical Shift Variability and Improved Visualization of Biomarker Changes in 1H NMR Spectroscopic Metabonomic Studies. Anal. Chem. 2005, 77, 517-526. (22) Weljie, A. M.; Newton, J.; Mercier, P.; Carlson, E.; Slupsky, C. M. Targeted Profiling: Quantitative Analysis of 1H NMR Metabolomics Data. Anal. Chem. 2006, 78, 4430-4442. (23) Lewis, I. A.; Schommer, S. C.; Hodis, B.; Robb, K. A.; Tonelli, M.; Westler, W. M.; Sussman, M. R.; Markley, J. L. Method for Determining Molar Concentrations of Metabolites in Complex Solutions from Two-Dimensional 1H-13C NMR Spectra. Anal. Chem. 2007, 79, 9385-9390. (24) Gronwald, W.; Klein, M. S.; Kaspar, H.; Fagerer, S.; Nürnberger, N.; Dettmer, K.; Bertsch, T.; Oefner, P. J. Urinary Metabolite Quantification Employing 2D NMR Spectroscopy. Anal. Chem. 2008, 80, 9288-9297. (25) Huber, W.; Heydebreck, A. V.; Sültmann, H.; Poustka, A.; Vingron, M. Varianc Stabilisation Applied to Microarray Data Calibration and to the Quantification of Differential Expression. Bioinformatics 2002, 18, S96-S104. (26) Psihogios, N. G.; Kalaitzidis, R. G.; Dimou, S.; Seferiadis, K. I.; Siamopoulos, K. C.; Bairaktari, E. T. Evaluation of tubulointerstitial lesions' severity in patients with glomerulonephritides: an NMR-based metabonomic study. J. Proteome Res. 2007, 6, 3760-3770.

25 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(27) Li, J.; Wijffels, G.; Yu, Y.; Nielsen, L. K.; Niemeyer, D. O.; Fisher, A. D.; Ferguson, D. M.; Schirra, H. J. Altered Fatty Acid Metabolism in Long Duration Road Transport: An NMR-Based Metabonomics Study in Sheep. J. Proteome Res. 2011, 10, 1073-1087. (28) Klein, M. S.; Almstetter, M.; Schlamberger, G.; Nürnberger, N.; Dettmer, K.; Oefner, P. J.; Meyer, H. H. D.; Wiedemann, S.; Gronwald, W. Nuclear Magnetic and Mass Spectrometry-based Milk Metabolomics in Dairy Cows During Early and Late Lactation. J. Dairy Sci. 2010, 93, 1539-1550. (29) Madsen, R.; Lundstedt, T.; Trygg, J. Chemometrics in Metabolomics - A Review in Human Disease Diagnosis. Anal. Chim. Acta 2010, 659, 23-33. (30) Truong, Y.; Lin, X.; Beecher, C. Learning a Complex Metabolomic Data set Using Random Forests and Support Vector Machines. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2004, 835-840. (31) Zou, H.; Hastie, T. elasticnet: Elastic-Net for Sparse Estimation and Sparse PCA. http://www. stan. umn. edu/~hzou 2008. (32) Hastie, T.; Tibshirani, R.; Narasimhan, B.; Chu, G. pamr: Pam: Prediction Analysis for Microarrays. http://CRAN. R-project. org/package=pamr 2010. (33) Dieterle, F.; Riefke, B.; Schlotterbeck, G.; Ross, A.; Senn, H.; Amberg, A. NMR and MS Methods for Metabolomics. Methods Mol. Biol. 2011, 691, 385-415. (34) Holmes, E.; Antti, H. Chemometric Contributions to the Evolution of Metabonomics: Mathematical Solutions to Characterising and Interpreting Complex Biological NMR Spectra. Analyst 2002, 127, 1549-1557. (35) Kuhn, M. Building Predictive Models in R Using the caret Package. J. Stat. Softw. 2008, 28, 1-26. (36) Liaw, A.; Wiener, M. Classification and Regression by randomForest. R News 2002, 2/3, 18-22. (37) Dimitriadou, E.; Hornik, K.; Leisch, F.; Meyer, D.; Weingessel, A. e1071: Misc. Functions of the Department of Statistics (e1071), TU Wien. http://CRAN. R-project. org/package=e1071 [R package version 1. 5-24] 2010. (38) Leek, J. The Tspair Package for Finding Top Scoring Pair Classifiers in R. Bioinformatics 2009, 25, 1203-1204. (39) Lu, X. R-SVM: Recursive Sample Classification and Gene Selection with SVM for Microarray Data. http://www. stanford. edu/group/wonglab/RSVMpage/R-SVM. html 2005. (40) Mukherjee, S.; Golland, P.; Panchenko, P. Permutation Tests for Classification. Massachusetts Institute of Technology 2003, AI-Memo, 2003-19.

26 Environment ACS Paragon Plus

Page 26 of 36

Page 27 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

(41) Jirapech-Umpai, T.; Aitken, S. Feature Selection and Classification for Microarray Data Analysis: Evolutionary Methods for Identifying Predictive Genes. BMCBioinformatics 2005, 6, 148. (42) Kote-Jarai, Z.; Matthews, L.; Osorio, A.; Shanley, S.; Giddings, I.; Moreews, F.; Locke, I.; Evans, D.; Eccles, D.; Collaborators, T. C. C.; Williams, R.; Girolami, M.; Campbell, C.; Eeles, R. Accurate Prediction of BRCA1 and BRCA2 Heterozygous Genotype using Expression Profiling after Induced DNA Damage. Clin. Cancer Res. 2006, 12, 3896-3901. (43) Tan, A. C.; Naiman, D. Q.; Xu, L.; Wishart, D. S.; Geman, D. Simple Decision Rules for Classifying Human Cancers from Gene Expression Profiles. Bioinformatics 2005, 21, 3896-3904. (44) Bellomo, R.; Ronco, C.; Kellum, J. A.; Mehta, R. L.; Palevsky, P.; ADQI workgroup. Acute Renal Failure - Definition, Outcome Measures, Animal Models, Fluid Therapy and Information Technology Needs: the Second International Consensus Conference of the Acute Dialysis Quality Initiative (ADQI) Group. Critical Care, 2004, 8, R204– R212.

27 Environment ACS Paragon Plus

Journal of Proteome Research

Tables Table 1. Classification performance in terms of prediction accuracy. RT1,7

AKI1,8

DC1,9

ADPKD2,5

GN2,6

RT2,7

AKI2,8

DC2,9

Accuracy4 (random GN) Average Rank3

85

96

96

71

97

2

3

2

10

1

3.6

47

Nearest Shrunken Centroids

77

96

93

74

91

5

3

8

7

4

5.4

48

PLS−DA

72

100

98

74

69

8

1

1

7

8

5.0

49

R−SVM+SVM(lin.)

74

84

95

79

49

7

9

4

3

10

6.6

47

R−SVM+SVM(rad.)

69

72

85

78

51

9

10

10

5

9

8.6

45

t−test+PLS-DA

80

96

95

76

97

4

3

4

6

1

3.6

48

t−test+RF

77

98

95

84

97

5

2

4

1

1

2.6

49

t−test+SVM(lin.)

85

94

96

74

91

2

7

2

7

4

4.4

49

t−test+SVM(rad.)

90

96

95

81

89

1

3

4

2

6

3.2

48

t−test+Top Scoring Pairs

67

88

89

79

74

10

8

9

3

7

7.4

50

GN1,6

Elastic Net

ADPKD1,5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 36

Each value was computed from a nested leave-five-out cross-validation. 1

Columns 2 to 6 contain the prediction accuracies in %for each classifier.

2

Columns 7 to 11 list the respective rank of each classifier depending on accuracy achieved.

3

Column 12 gives the average rank obtained across all investigated datasets.

4

The last column gives the average prediction accuracies obtained from the randomized GN

data. 5

Human urinary ADPKD data, 6human urinary glomerulonephritis (GN) data, 7serum data

from Australian sheep subjected to road transport (RT), 8human plasma data from an unpublished study on acute kidney injury (AKI) after cardiac surgery, 9milk data obtained from two breeds of dairy cows (DCs).

28 Environment ACS Paragon Plus

Page 29 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Table 2. Classification performance in terms of AUC values. ADPKD1,3

GN1,4

RT1,5

AKI1,6

DC1,7

ADPKD2,3

GN2,4

RT2,5

AKI2,6

DC2,7

Average Rank

Elastic Net

0.921

0.995

0.956

0.776

0.997

3

5

3

8

2

4.2

Nearest Shrunken

0.807

0.984

0.923

0.760

0.983

7

7

9

10

4

7.4

PLS−DA

0.806

1.000

0.978

0.816

0.706

8

1

1

6

7

4.6

R−SVM+SVM(lin.)

0.818

0.926

0.959

0.782

0.458

6

8

2

7

10

6.6

R−SVM+SVM(rad.)

0.693

0.770

0.930

0.773

0.503

10

10

8

9

9

9.2

t−test+PLS-DA

0.883

0.997

0.954

0.850

1.000

4

4

4

5

1

3.6

t−test+RF

0.825

0.998

0.946

0.881

0.997

5

3

6

1

2

3.4

t−test+SVM(lin.)

0.929

0.989

0.954

0.861

0.954

2

6

4

4

6

4.4

t−test+SVM(rad.)

0.938

1.000

0.943

0.879

0.967

1

1

7

2

5

3.2

t−test+Top Scoring

0.774

0.880

0.946

0.866

0.571

9

9

6

3

8

7.0

Centroids

Pairs

Each value was computed from a nested leave-five-out cross-validation. 1

Columns 2 to 6 show the AUC values

2

Columns 7 to 11 show the respective rank of each classifier depending on AUC. The

corresponding ROC curves are given in Figure 2. 3

Human urinary ADPKD data, 4human urinary glomerulonephritis (GN) data, 5serum data

from Australian sheep subjected to road transport (RT), 6human plasma data from an unpublished study on acute kidney injury (AKI) after cardiac surgery, 7milk data obtained from two breeds of dairy cows (DCs).

29 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 36

Table 3. Feature selection Features

1

(ADPKD3)

Features (GN4)

1

Features (RT5)

1

Features (AKI6)

1

Features (DC7)

1

Identical with t-test+RF2 (ADPKD3)

Elastic Net

21

10

20

14

4

10

Nearest Shrunken

14

18

484.5

12

1

13

PLS-DA

701

200

881

863

838

45

R-SVM+SVM(linear)

26

11

27.5

5

6

20

R-SVM+SVM(radial)

22.5

6.5

5

5

6

19

t-test+PLS-DA

45

20

39

113

12

45

t-test+RF

45

7

9

7

6

-

t-test+SVM(linear)

48.5

9.5

70.5

2

15

45

t-test+SVM(radial)

45.5

20

6

2

5

45

t-test+Top Scoring

2

2

2

2

2

2

Centroids

Pairs

Each value was computed from a nested leave-five-out cross-validation. 1

Columns 2 to 6 give the average number of selected variables used for classification.

2

Column 7 gives the number of selected features identical to those selected by t-test+RF,

which was on average the best performing classifier across all datasets (only calculated for the ADPKD data). 3

Human urinary ADPKD data, 4human urinary glomerulonephritis (GN) data, 5serum data

from Australian sheep subjected to road transport (RT), 6human plasma data from an unpublished study on acute kidney injury (AKI) after cardiac surgery, 7milk data obtained from two breeds of dairy cows (DCs).

30 Environment ACS Paragon Plus

Page 31 of 36

Table 4. Selected features of the ADPKD dataset NMR

1

1

6

S

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

6 5 5 5 5

M M M S W

1 1

1 1

1 1

1 1

1 1

1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1

1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1

0.735 0.685 0.675 0.655 0.635 0.625 0.595 0.575 8.465 6.675 4.215 2.695 2.685 2.375 1.105

1 1 1 1 1 1 1 1

1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1.095

1

1

1

1

1 1 1 1 1

1 1

1 1

1 1 1 1 1 1 1 1

0.695 6.945 6.935 6.655 4.375

1 1 1 1 1

4.365

1

1

1

1

1

5

W

4.255 4.235 4.225 4.145 4.095 3.915 3.365 1.495 0.815 0.805 0.795 9.135 7.915 2.735 0.745 0.565 9.365 9.245 8.025 8.005

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1

1 1 1 1 1 1 1 1 1 1 1

1

1

5 5 5 5 5 5 5 5 5 5 5 3 3 3 3 3 2 2 2 2

S S M S S S S S M M M S S S M W M W S S

1 1 1

1

1 1

1 1

1 1

1 1

1

Metabolite2

Sum

M M M M M M M M S S S S S S S

1 1

1

TSP

t+SVM(rad.)

t+SVM(lin.)

7 7 7 7 7 7 7 7 6 6 6 6 6 6 6

1 1

0.615 0.605 0.645 0.585 0.555 0.535 0.525 0.515 0.505 1.115

t+RF

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

1

4.335 4.245

t+PLSDA

M M M M M M M M M S

RSVM(rad.)

1

9 9 8 8 8 8 8 8 8 7

RSVM(lin.)

M M

PLSDA

9 9

NSC

1

EN

Feature

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Tartaricacid Threonine/ guadinosuccinic acid3 Proteins Proteins Proteins Proteins Proteins Proteins Proteins Proteins Proteins 2-Methylbutyrylglycine/ 3-methyl-2-oxovalericacid/ 2-Methylbutyroyl-carnitine3 Proteins/bile acids3 Proteins/bile acids3 Proteins Proteins Proteins Proteins Proteins Proteins Formate 6-Hydroxynicotinic acid Sucrose Citrate Citrate 3-Hydroxy-isovalerate 2-Methylbutyrylglycine/ 3-methyl-2-oxovaleric acid/ 2-methylbutyroyl-carnitine3 2-Methylbutyrylgly/ 3-methyl-2-oxovaleric acid/ 2-methylbutyroyl-carnitine3 Proteins Catechol/salicylic acid3 Catechol/salicylic acid3 6-Hydroxynicotinic acid Hydroxyacetone/ N-acetyltyrosine3 Hydroxyacetone/ N-acetyltyrosine3 Threonine Sucrose Sucrose D-saccharate D-saccharate/glucosan3 Carbohydrates Methanol Alanine Proteins Proteins Proteins Trigonelline Theobromine Dimethylamine/citrate3 Proteins/bile acids3 Proteins Nicotinamideriboside n.a.4 Methylxanthine n.a.4

31 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

7.595 6.885 4.505 4.425

1

1 1 1 1 1

1 1 1

2 2 2 2

S M M S

4.285 3.225

1

1 1

2 2

S S

3.155

1

1

2

S

2.745 2.555 2.545 2.425

1 1 1

1 1 1 1

2 2 2 2

S S S S

1.835 1.065

1 1

1 1

2 2

S S

1

Page 32 of 36

n.a. Tyrosine n.a.4 Dihydroxyacetone/ N-acetyltyrosine3 Pseudouridine/threonine3 Histidine/1-methylhistidine/ carnitine/D-xylose3 Methylmalonicacid/isethionicacid/ piperidine/ethanolamine/ 1-methylhistidine3 Citrate Acetyl-L-carnitine Citrate/acetyl-L-carnitine3 3-hydroxy-3-methylglutaric acid/ 3-hydroxypropionic acid/glutamine3 4-guanidinobutyricacid n.a.4

Features selected by the various methods are marked by a “1”. The column “Sum” counts how often a given feature was selected by the different methods. Features are ordered according to these counts. Only features selected by at least two methods are listed. Used abbreviations: EN, Elastic Net; NSC, Nearest Shrunken Centroids; PLSDA, PLS-DA; RSVM(lin.), R-SVM+SVM(linear); RSVM(rad.), R-SVM+SVM(radial); t+PLSDA, ttest+PLS-DA;

t+RF,

t-test+RF;

t+SVM(lin.),

t-test+SVM(linear);

t+SVM(rad.),

t-

test+SVM(radial). 1Given are the approximate average NMR intensities of the respective bins as determined by manual inspection. Bins containing strong signals clearly above the noise level are marked by an S, while bins with weak signals barely exceeding noise are marked by an M, and bins containing mostly noise are marked by a W. 2Metabolite assignments to a given feature. 3Assignments where a signal could be attributed to more than one metabolite or where in case of very weak signals an unambiguous assignment was not possible. In case that a feature could be assigned to more than one metabolite, all possible assignments are given. 4

Features, for which no assignments were obtained, are marked by n.a.

32 Environment ACS Paragon Plus

Page 33 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure Legends

Figure 1. Schematic representation of a 3-fold nested cross-validation scheme. Each loop is represented by a bar of different length. The innermost loop, indicated by the shortest bar, was used to calibrate internal parameters, such as the cost parameter C and the width of the kernel γ in case of SVMs with a radial basis function kernel. In the next loop, classifier sparsity was optimized, while the different algorithms were validated in the outermost loop. Data in each loop are split iteratively in training and test data (shown in grey and green, respectively), to ensure that all data of a given loop are once used for testing. The outermost loop, shown on top of Figure 1, contains all data. The current training data of the outermost loop define the data of the middle loop, where they are again iteratively split in test and training data. The same procedure is repeated in the innermost loop. The arrows indicate the transfer of the optimized parameters from the innermost loop to the middle and the outermost loop, respectively.

Figure 2. Receiver operating characteristics for the different algorithms and datasets investigated. A = Elastic Net, B = Nearest Shrunken Centroids, C = PLS-DA, D = RSVM+SVM(linear), E = R-SVM+SVM(radial), F = t-Test+PLS-DA, G = t-Test+RF, H = tTest+SVM(linear), I = t-Test+SVM(radial), K = t-Test+Top Scoring Pairs. ROC-curves for the GN, ADPKD, AKI, DC and RT datasets are indicated by orange, red, green, blue and black lines, respectively. The corresponding AUC values are given in Table 2.

33 Environment ACS Paragon Plus

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Synopsis Classification of samples is of great importance in biomedicine. The choice of classification algorithm can exert a considerable impact on outcome. In this study, six binary classification algorithms in combination with different strategies for data-driven feature selection were analyzed on metabolomic NMR fingerprints. The best overall performances were obtained for Random Forests and a Support Vector Machine using a radial basis function as kernel in combination with t-score-based feature filtering.

34 Environment ACS Paragon Plus

Page 34 of 36

Page 35 of 36

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Journal of Proteome Research

Test

Training Test

Validation

Training Test

Training

ACS Paragon Plus Environment

Tune classifier sparsity e.g. optimal number of features Tune inherent parameters of classifier

Journal of Proteome Research

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

ACS Paragon Plus Environment

Page 36 of 36