Comprehensive and empirical evaluation of machine learning

Jan 31, 2019 - Liquid chromatography is a core component of almost all mass spectrometric analyses of (bio)molecules. Because of the high-throughput ...
0 downloads 0 Views 1MB Size
Subscriber access provided by WESTERN SYDNEY U

Article

Comprehensive and empirical evaluation of machine learning algorithms for small molecule LC retention time prediction Robbin Bouwmeester, Lennart Martens, and Sven Degroeve Anal. Chem., Just Accepted Manuscript • DOI: 10.1021/acs.analchem.8b05820 • Publication Date (Web): 31 Jan 2019 Downloaded from http://pubs.acs.org on February 2, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Comprehensive and empirical evaluation of machine learning algorithms for small molecule LC retention time prediction Robbin Bouwmeester,

†,‡

∗,†,‡

Lennart Martens,

†,‡

and Sven Degroeve

†VIB-UGent Center for Medical Biotechnology, VIB, Ghent, Belgium ‡Department of Biochemistry, Ghent University, Ghent, Belgium E-mail: [email protected]

Abstract Liquid chromatography is a core component of almost all mass spectrometric analyses of (bio)molecules. Because of the high-throughput nature of mass spectrometric analyses, the interpretation of these chromatographic data increasingly relies on informatics solutions that attempt to predict an analyte's retention time. The key components of such predictive algorithms are the features these are supplies with, and the actual machine learning algorithm used to t the model parameters. Therefore, we have evaluated the performance of seven machine learning algorithms on 36 distinct metabolomics data sets, using two distinct feature sets. Interestingly, the results show that no single learning algorithm performs optimally for all data sets, with dierent types of algorithms achieving top performance for dierent types of analytes or dierent protocols. Our results thus show that an evaluation of machine learning algorithms for retention time prediction is needed to nd a suitable algorithm for specic analytes or protocols. Importantly, however, our results also show that blending dierent types of models together decreases the error on outliers, indicating that the 1

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

combination of several approaches holds substantial promise for the development of more generic, high-performing algorithms.

Introduction Mass Spectrometry (MS) coupled to Liquid Chromatography (LC) is a popular technique for the high-throughput analysis of the metabolome and lipidome, as it separates analytes based on their physicochemical properties. 1,2 This is important, because analytes compete for charges during ionization, leading to a strong bias against low abundant analytes. 3 Moreover, LC is also capable of separating isobaric analytes, which plays a particularly important role in lipidomics. 4 LC-coupling thus provides analyte information that is complementary to the mass-overcharge (m/z) measurement of the MS. In many cases, high performance LC is used, where a solvent (the mobile phase) is pumped over a column (the stationary phase) under high pressure. The time an analyte takes to travel across the column is then determined by the degree of analyte interaction with the stationary and mobile phases, respectively, and is called the retention time (tR ). However, this LC retention time is usually not incorporated in the downstream analysis, because it is either unknown a priori, or only known a priori for a very specic experimental setup. To ll this knowledge gap, researches typically use regression models to predict retention times for known metabolite structures. These tR predictions are based on a mapping between known structural, chemical and physical descriptors (or features) of the metabolites with their experimentally observed retention times (the targets). These predictions have previously been applied in both targeted and untargeted MS experiments to aid the analysis of lipids, 4 metabolites 3,5,6 and peptides. 711 For targeted MS the predicted tR has been applied to reduce the number of experiments needed to study specic analytes of interest, 12 while in untargeted MS these predictions

2

ACS Paragon Plus Environment

Page 2 of 28

Page 3 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

have been used to dierentiate between isobaric lipids, 4 to lter false identications for small metabolites (< 400 Da), 5 and to increase the number of condently identied peptides. 13 As shown in the literature, retention time prediction can be a useful source of information for an MS experiment, but due to the many dierent experimental setups used across labs, the modeling procedure is not trivial. 14,15 This is because of the dierences in setup signicantly inuence retention times, thus resulting in non-transferable knowledge between setups. To alleviate this lack of transferability of models, calibration approaches between dierent setups have been published, but these are only applicable to a limited set of metabolites 14 or setups. 15 As a result, researchers generally t a new model for every setup, which requires substantial eort and data each time. Moreover, multiple modelling decisions have to be made for each new model that is trained, which leads to substantial heterogeneity and possible suboptimal modelling choices. Retention time prediction of peptides is, in most models, based on the primary sequence of the peptide and the rst widely used models dating back to the 1980's. 16 Currently models for peptide retention time prediction are still being developed 711 and has recently gained interest in the eld due to its applications in data independent LC-MS approaches. 17 Prediction of the Gas Chromatography (GC) retention time for small molecules is well documented and applied often in the downstream analysis due to its high accuracy. 18,19 Here we will describe the tR prediction of metabolites using a mapping between the experimental tR and physicochemical descriptors. The regression model used for tR prediction is generally tted using a machine learning algorithm, and there is a large variety of algorithms to choose from. 20 For tR prediction, the support vector regression model is the most popular option, 4,7,10,2125 but other types of machine learning algorithms have been used as well, including linear regression, neural networks and random forest. 3,26,27 The choice of the algorithm is often guided by existing experience of the researcher with particular machine learning algorithms, which means that suboptimal approaches are often chosen by default. Indeed, many tR prediction publications 3

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

do not even justify the choice for their algorithm. The eld would therefore benet from a comprehensive overview of the performance of dierent machine learning algorithms for tR prediction, tested on a variety of experimental setups. We here therefore evaluate the performance of seven machine learning algorithms, applied to 36 distinct metabolomics data sets, using both a comprehensive feature set, as well as a minimal feature set. Our results show that the choice of one machine learning algorithm over another can signicantly inuence the performance, and that this choice is dependent on the characteristics of the data set.

Experimental section Machine learning algorithms A diverse set of seven machine learning algorithms is evaluated for their ability to compute accurate tR prediction models from relatively limited amounts of source data. These algorithms all produce a regression model that relies on molecular descriptors (features) to predict the tR (target). These algorithms were selected based on the fundamental dierences in their prediction models, and on their popularity within both the LC and machine learning community. Of these seven algorithms, two employ linear models, four employ non-linear models, and one features a hyperparameter that allows it to employ either a linear or a non-linear model. Linear models assume the relation between features and target to be linear and as such cannot capture more complex relations. However these models are typically much more robust towards overtting of the data. Two popular linear models were chosen, which dier mainly in the way the linear model parameters are tted to the data. These are Bayesian Ridge Regression (BRR), 28 and Least Absolute Shrinkage and Selection Operator regression (LASSO). 29 In contrast, non-linear models are able to model more complex relationships between 4

ACS Paragon Plus Environment

Page 4 of 28

Page 5 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

features and target, but this exibility comes at the cost of an increased risk of overtting, especially on small data sets with many noisy features. Four models were chosen here: a feedforward Articial Neural Network (ANN), and three decision tree ensemble models that dier in the way the decision trees are constructed: through Adaptive Boosting (AB), 30 Gradient Boosting (GB) 31 or bagging in a Random Forest (RF). 32 The nal selected model is a Support Vector Regression (SVR), 33 which can take either the form of a linear SVR (LSVR) if no kernel is applied, or the form of a non-linear SVR (SVR) if a Radial Basis Function (RBF) kernel is applied.

Datasets The machine learning models were tted on 36 publicly available LC-MS datasets (Table S1): 19 were obtained from MoNA (http://mona.ehnlab.ucdavis.edu/), 16 from PredRet, 14 and one from Aicheler et al. 4 . Each data set contained measurements for at least 40 unique analytes. These 36 datasets were acquired in dierent labs, using dierent experimental setups. Across all datasets, 8305 molecules were observed, of which 6759 are unique. These molecules cover a broad range of masses and chemical compounds: from 59.07 Da to 2406.65 Da (see Figure S-1 for the mass distribution), and from acetamide to lipids. Duplicated molecules in the same dataset were removed based on their SMILES representation. 34 The most frequently used LC-type is Reverse Phase LC (RPHLC) with 33 datasets (Table S-2).

Molecular descriptors (features) The RDKIT 35 library is used to convert the SMILES representation of the molecules to 196 features. A selection of 151 features is made form this list, and these 151 are used for training (see Table S-3 and S-4 for the selected and removed features, respectively). Selection of these 151 features is based on a lter for standard deviation of a feature across the dierent molecules (stdev > 0.01), and on Pearson correlation between features (r2 < 0.96). In addition, a minimal subset of eleven features (obtained from Aicheler et al. 4 ) was 5

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 28

selected to evaluate the potential for overtting of the larger feature set. These eleven features include an estimate of the hydrophobicity using the octanol/water partition coecient (MolLogP, slogp_VSA1 and slogp _VSA2), molar refractivity (SMR), estimated surface area (LabuteASA and TPSA), average molecular weight (AMW), polarizability based on molar refractivity (smr_VSA1 and smr_VSA2) and electrostatic interactions (peoe_VSA1 and peoe_VSA2).

Performance metrics The generalization performance of a tted regression model is evaluated using three dierent performance metrics: Mean Absolute Error (MAE), Median Absolute Error (MedAE), and Pearson correlation (r). Let yˆ be the predictions of the model and y the experimentally observed retention times for all molecules n in a dataset; then the MAE is calculated using the following equation: n P

M AE =

|yi − yˆi |

i=1

n

(1)

While MAE can give a good indication of performance, it is sensitive to outliers. For this reason the related but more robust Median Absolute Error is here used as the main metric:

M edAE = median(|y − yˆ|)

(2)

The MAE or MedAE can be hard to compare between dierent classes of problems, especially when the error depends on the range of the elution times. Therefore, in addition to the previous two metrics, the Pearson correlation is calculated as well: n P

(yi − y¯)(ˆ yi − y¯ˆ) rn r=rn P P (yi − y¯)2 (ˆ yi − y¯ˆ)2 i=1

i=1

i=1

6

ACS Paragon Plus Environment

(3)

Page 7 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Even though the MAE, MedAE, and correlation can be calculated for each dataset, a direct comparison between these metrics across dierent datasets is not possible. For instance, a prediction error of ten seconds will have more impact when the run time is 129 seconds (as is the case for dataset RIKEN) as compared to a run time of 4089 seconds (as is the case for dataset Taguchi). This can be alleviated by normalization; specically by dividing by the error by the retention time of the last detected analyte:

M edAEnormalized =

M edAE(y, yˆ) max(y)

(4)

Learning Curves As tR prediction is specic to an experimental condition, the number of annotated molecules (the training set) available for tting (or training) a regression model is typically limited. It therefore is important to investigate how dierent regression models perform for dierent training set sizes. In order to compute the learning curves (which plot performance versus data set size), suciently large datasets are necessary. We therefore used datasets that contained at least 300 unique molecules for this specic analysis. The datasets that meet this requirement are Eawag_XBridgeC18, FEM_long, RIKEN, Stravs and Taguchi_12. To investigate the reproducibility of the obtained results, each experiment is repeated ten times with dierent random seeds. For each dataset, 160 unique molecules are selected at random to form the training set. The remaining molecules in that dataset constitute the test set, which is used to evaluate model performance. The training set of 160 molecules is then sampled for molecule sets of increasing sizes. The smallest such training subset contains 20 molecules, sampled at random without replacement. Regression models are optimized and tted on this training set and the resulting models are evaluated on the test set. In the next iteration, another 20 molecules are sampled at random without replacement from the remaining 140 selected 7

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

molecules, these are added to the training set. A model is again optimized on this larger training set, and evaluated on the same original test set. This procedure is repeated until the training set contains 160 molecules.

Algorithm performance evaluation The generalization performance of each algorithm is evaluated using 10-fold Cross-Validation (10CV), in which all unique molecules are randomly assigned to one of ten subsets of equal size, and each subset is in turn assigned as the test set, while the remaining nine subsets are used as the training set. This process is repeated ten times, such that each subset is used once for testing.

Hyperparameter optimization Learning algorithms have hyperparameters that are used to t the model parameters, and these hyperparameters need to be set by the user before tting the data. Here, the hyperparameters of the learning algorithms are optimized using values that are randomly drawn from a prespecied distribution (see Code Listing S-1 for the denition of these distributions). Random optimization is generally able to nd the optimal parameters signicantly faster than a grid search, because it is hypothesized that it samples more of the parameter space (for a given number of iterations). 36 For each machine learning algorithm, a total of 100 randomly selected hyperparameter sets are evaluated using a 10CV on the 9 folds from the training set (described in the previous section). This nested CV strategy means that the initial test fold is never used for model hyperparameter or parameter optimization. The test fold is used for the evaluation of the machine learning algorithms. The hyperparameter set with the best mean absolute error using Cross-Validation (CV) is then used for tting the complete training set.

8

ACS Paragon Plus Environment

Page 8 of 28

Page 9 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Code and dataset The code to generate the models implements the following libraries: scikit-learn V0.18.0, 37 Pandas V0.19.021, 38 RDKIT V2016.03.1 35 and XGBoost V0.4. 39 The code used to generate the regression models, to make predictions, and to produce the gures is available at: https://github.com/RobbinBouwmeester/tRPredictionOverview

Results and discussion The SVR algorithm is one of the most frequently applied algorithms for LC tR prediction. 4,7,10,2125 Even though a large variety of machine learning algorithms is available to researchers for training tR prediction models, there can be signicant performance dierences between them. In this section we will show this by evaluating the performance of dierent machine learning algorithms on dierent tR prediction tasks.

Prediction performance versus training set size First, the ability of dierent learning algorithms to generalize training sets with dierent sizes is investigated. Figure 1 shows learning curves for each of the regression models for the ve datasets with more than 300 molecules. As expected, the learning curves show that all regression models benet from larger training sets. For the ve datasets used here, the largest generalization performance gain is typically observed when doubling the dataset size from 20 to 40 training molecules. Still, most models show optimal performance when trained on the largest sample size of 160 molecules, with adaptive boosting as the notable exception. This because adaptive boosting tends to overt on the larger training sets, which is particularly evident for the RIKEN dataset. Although no clear performance dierence between the linear and non-linear prediction models can be observed for the Eawag_XBridgeC18 and Stravs datasets, there is a large performance increase for non-linear models for the FEM_long, RIKEN and Taguchi datasets as 9

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

training sets increase in size. For these latter datasets the GB algorithm clearly performs best overall, except for the Taguchi dataset where ANN clearly outperforms the other algorithms. This shows that the optimal algorithm is dataset dependent, and that boosting a decision tree forest (GB) performs signicantly better than bagging a forest (RF) for tR prediction. The standard deviation of the evaluation procedure is plotted in Figure S-2 and shows how the variation decreases when training set size increases. Between the algorithms there is no clear dierence in the variation, not even between the linear and non-linear models. Similar results are observed when evaluating with the MAE, and the Pearson correlation (see Figures S-3 and S-4, respectively). Aggregating results over the dierent datasets (Figures S-5 and S-6) shows no clear dierence in performance between the dierent algorithms. However, our results show that evaluating on individual datasets these performance dierences are highly signicant.

Prediction performance of the dierent algorithms on all datasets In this section performance dierences between the dierent regression models are evaluated on all 36 datasets. For each dataset, all unique molecules in that dataset are used in a nested 10CV approach to assess the generalization performance. The detailed CV results for mean absolute error, median absolute error, and Pearson correlation can be found in Tables S-5, S-6, and S-7, respectively. The predictions and their corresponding observed retention times are plotted in Figures S-7 through S-12. The time to train a single model for each algorithm per dataset is all below < 810 seconds and can be considered fast (Table S-8). Figure 2(a) shows the number of times an algorithm had the lowest median absolute error on any of the 36 datasets. The GB algorithm again clearly stands out here, as it is the best performing algorithm for thirteen datasets. Linear models such as BRR and LASSO are among the worst performing algorithms, achieving best performance on only one or two datasets. Note that the tree-based RF algorithm is able to t non-linear relations in a model but is together with the linear models part of the lowest performing ones. 10

ACS Paragon Plus Environment

Page 10 of 28

Page 11 of 28

Median absolute error (s)

10

200 150

80

15

400 350 300 250

Median absolute error (s)

140 120 100

Median absolute error (s)

RIKEN 20

FEM_long

160

Eawag_XBridgeC18

50

100

150

0

50

100

Training examples

Training examples

Stravs

Taguchi_12

150

0

50

100

150

Training examples

200

BRR LASSO ANN AB GB RF SVR

100

150

Median absolute error (s)

140 120 100

Median absolute error (s)

160

250

180

0

80 0

50

100

150

0

50

Training examples

100

150

Training examples

Figure 1: The learning curves for ve datasets that have at least 300 training examples. The median absolute error (vertical axis) is plotted against a specic number of training examples (horizontal axis). Training and testing for the points in the learning curve is repeated ten times, and the mean is plotted.

4 3 2 BRR LASSO ANN AB GB RF SVR

0

0

2

4

6

Mean rank of median error

8

5

c)

BRR LASSO ANN AB GB RF SVR

BRR LASSO ANN AB GB RF SVR

0

2

4

6

8

Highest median error on dataset (#)

10 12 14

b)

1

a) Lowest median error on dataset (#)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 2: The frequency with which an algorithm scored the lowest (a) and highest (b) absolute median error on a dataset. Panel (c) shows the mean rank over all datasets when the performance is sorted on absolute median error.

11

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2(b) shows the number of times an algorithm had the highest median absolute error on a dataset. The GB algorithm yet again performs signicantly better than the other algorithms, displaying worst performance for just one dataset (UniToyama_Atlantis). The SVR (linear/RBF) model follows closely, having the worst median error on only three datasets. The AB algorithm is ranked worst for eight out of 36 datasets, while achieving best performance for 5 datasets. This indicates that AB can achieve competitive performance, but is susceptible to overtting. Overall, this comparison shows that GB will not always perform the best, but out of all algorithms it is most likely to show the best generalization performance, while being least likely to show the worst performance. Note that, to the authors knowledge the GB algorithm has currently not been applied to train models for tR prediction. Even though there is a clear dierence between algorithms regarding best and worst performance, the mean rank for the dierent algorithms is more similar (Figure 2(c)). This similarity in mean rank indicates that high performing algorithms (GB, ANN, SVR) can at times perform poorly (but not the worst) on some of the datasets. These results show that dierent algorithms will show dierent performance on dierent datasets, thus illustrating the importance of evaluating multiple algorithms for individual datasets.

Pairwise performance comparison of the algorithms To obtain more insight into the robust performance of GB across the datasets, we compared its performance against the other learning algorithms for each of the 36 datasets. Figure 3 shows the dierence in median absolute error normalized to the maximum elution times for each dataset (eqn. 4). The GB algorithm improves the normalized median error by 0.74% to 1.51% on average for all datasets compared to the other six algorithms. For example, in the case of Taguchi a 1% improvement would translate to a 40.89 second lower median absolute error. The datasets Matsuura, Matsuura_15 and Beck are consistently a worst t for GB when compared to the other algorithms. For completeness, the pairwise comparisons are also 12

ACS Paragon Plus Environment

Page 12 of 28

Page 13 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

made for AB, BRR, ANN, Random Forest and SVR models (Figure S-13). The dierences in normalized median absolute error range from 0.01% to 0.77% and are overall much lower than the dierences observed in Figure 3 for GB. All algorithms perform signicantly better than a baseline linear model only using the molecular weight as a feature. The improvement compared to the baseline is around 4 to 5.5% in the median absolute error (Figure S-14). Even though these dierences in performance are only small, they can have a big impact on down-stream analysis. As an example we analyzed the discrimination capability between metabolites for dierent tR error rates. From the error rate we can t error windows and calculate the relative overlap between these windows (Figure S-15). Ideally, the overlap is as little as possible, allowing maximum discrimination. Our analysis shows that small error dierences (e.g. 1% relative median error) can have an impact that is greater than 5% on the relative overlap between analytes. Furthermore, the dierence between the algorithms with the lowest and highest discrimination power is large in terms of metabolite separability. This dierence usually results in an overlap that ranges from 6.2% to 68.2%, with most dierences around 10%. This concept of linking error rates to overlapping analytes shows that there is a clear dierence between the evaluated algorithms, and that choosing a random algorithm rather than the most suitable one can thus lead to much more overlap between predicted analyte retention windows, and therefore much poorer separation capability. Figure 4 shows the comparison of choosing the best performing algorithm compared to GB for each dataset. For the thirteen datasets where the GB model was best, the performance increase is, of course, zero. For the remaining datasets, choosing the highest performing model improved the relative median error with 0.72% on average (Figure 4(a)). Because of its popularity, the same comparison was made for SVR, where choosing the best performing algorithm rather than SVR resulted in an average performance improvement of 1.84% (Figure 4(b)). This again shows that evaluating multiple algorithms to select the best model can signicantly lower the prediction error, and that the popularity of SVR seems unwarranted in light of the overall superior performance of GB. 13

ACS Paragon Plus Environment

Krauss Toshimitsu Beck Cao_HILIC UniToyama_Atlantis Matsuura kohlbacher Matsuura_15 PFR−TK72 Takahashi Taguchi Krauss_21 UFZ_Phenomenex LIFE_old Stravs_22 Otto Taguchi_12 IPB_Halle Tohge Stravs FEM_short Eawag_XBridgeC18 FEM_orbitrap_plasma MTBLS87 MPI_Symmetry FEM_long Nikiforos Ales_18 FEM_lipids FEM_orbitrap_urine LIFE_new MTBLS36 MTBLS20 RIKEN MTBLS38 Mark

MPI_Symmetry Krauss UniToyama_Atlantis Takahashi Toshimitsu MTBLS87 UFZ_Phenomenex Taguchi_12 Matsuura Ales_18 LIFE_old FEM_long kohlbacher Krauss_21 Beck FEM_orbitrap_plasma Stravs_22 IPB_Halle Taguchi Matsuura_15 Stravs Nikiforos MTBLS20 FEM_orbitrap_urine Eawag_XBridgeC18 RIKEN LIFE_new FEM_lipids FEM_short PFR−TK72 Cao_HILIC Tohge Otto MTBLS36 MTBLS38 Mark

MPI_Symmetry Krauss Toshimitsu Beck UniToyama_Atlantis Otto PFR−TK72 kohlbacher Cao_HILIC Matsuura Takahashi IPB_Halle Stravs_22 Taguchi UFZ_Phenomenex LIFE_old Krauss_21 Tohge MTBLS87 Taguchi_12 Eawag_XBridgeC18 Stravs FEM_short FEM_orbitrap_plasma Ales_18 FEM_long Matsuura_15 LIFE_new FEM_lipids MTBLS20 FEM_orbitrap_urine MTBLS36 Nikiforos MTBLS38 Mark RIKEN

Otto UFZ_Phenomenex Matsuura_15 Ales_18 UniToyama_Atlantis Krauss_21 kohlbacher PFR−TK72 Krauss Toshimitsu Taguchi Beck Matsuura Stravs_22 Taguchi_12 MTBLS87 Nikiforos IPB_Halle Tohge Stravs Cao_HILIC MPI_Symmetry FEM_short LIFE_old FEM_orbitrap_plasma FEM_long Eawag_XBridgeC18 Takahashi FEM_lipids FEM_orbitrap_urine RIKEN LIFE_new MTBLS20 MTBLS36 Mark MTBLS38

Difference in median error relative to the total elution time (%)

Krauss MPI_Symmetry MTBLS36 Toshimitsu Takahashi Ales_18 MTBLS38 Matsuura_15 Mark MTBLS20 UniToyama_Atlantis FEM_lipids Nikiforos PFR−TK72 FEM_short IPB_Halle Taguchi Matsuura Beck FEM_orbitrap_urine kohlbacher FEM_orbitrap_plasma Stravs_22 Krauss_21 MTBLS87 UFZ_Phenomenex Taguchi_12 Cao_HILIC Eawag_XBridgeC18 FEM_long LIFE_new LIFE_old Otto Stravs Tohge RIKEN

PFR−TK72 IPB_Halle Beck Matsuura kohlbacher Cao_HILIC Taguchi_12 FEM_lipids UniToyama_Atlantis Toshimitsu Matsuura_15 MPI_Symmetry MTBLS87 Takahashi FEM_short LIFE_old Stravs Otto Eawag_XBridgeC18 MTBLS38 Nikiforos FEM_orbitrap_plasma LIFE_new MTBLS20 FEM_long Taguchi Stravs_22 FEM_orbitrap_urine Tohge Mark Ales_18 RIKEN Krauss_21 MTBLS36 UFZ_Phenomenex Krauss

Difference in median error relative to the total elution time (%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 Difference in median error relative to the total elution time (%)

Analytical Chemistry

AB − GB Average improvement (%): 0.74

10 10

8 8

6 6

4 4

2 2

0 0

−2 −2

−4 −4

10 10

8 8

6 6

4 4

2 2

0 0

−2 −2

−4 −4

10 10

8 8

6 6

4 4

2 2

0 0

−2 −2

−4 −4

ACS Paragon Plus Environment

14

Page 14 of 28

ANN − GB Average improvement (%): 1.08

BRR − GB Average improvement (%): 1.08 SVR − GB Average improvement (%): 1.12

RF − GB Average improvement (%): 1.48 LASSO − GB Average improvement (%): 1.51

Figure 3: Dierence in the absolute median error per dataset for the GB models compared to RF, LASSO, AB, BRR, ANN and SVR. Positive numbers indicate that the median absolute error was lowered (improved) by the indicated amount when using GB. Negative numbers indicate a lower median absolute error for the algorithm GB is compared to.

Page 15 of 28

b)

a)

Best out of seven  SVR Average improvement (%): 1.84 8

6

6

4

4

2

2

0

0 FEM_orbitrap_plasma FEM_long FEM_short MTBLS87 Nikiforos ag_XBridgeC18 FEM_orbitrap_urine Stravs Stravs_22 Tohge LIFE_old RIKEN LIFE_new Taguchi MTBLS20 Krauss_21 FEM_lipids Mark MTBLS38 Ales_18 Taguchi_12 Matsuura_15 Cao_HILIC kohlbacher Matsuura oyama_Atlantis UFZ_Phenomenex Takahashi IPB_Halle Beck PFR TK72 Toshimitsu MTBLS36 Otto MPI_Symmetry Krauss

8



Krauss_21 Taguchi UFZ_Phenomenex Otto Matsuura_15 Ales_18 Stravs_22 kohlbacher oyama_Atlantis MTBLS87 Nikiforos Matsuura PFR TK72 Tohge Stravs Beck FEM_short Taguchi_12 LIFE_old Toshimitsu FEM_orbitrap_plasma FEM_long Cao_HILIC IPB_Halle ag_XBridgeC18 FEM_orbitrap_urine Takahashi RIKEN LIFE_new FEM_lipids MTBLS20 Krauss MTBLS36 Mark MPI_Symmetry MTBLS38

Best out of seven GB Average improvement (%): 0.72

Difference in median error relative to the total elution time (%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry



Figure 4: Comparison of the normalized median error between the best performing model out of the selected seven per dataset, and the GB (a), and SVR (b) models. Zero dierence in the error indicates that the algorithm under evaluation performed the best. Non-zero values indicate the lower error when choosing one of the other algorithms.

Combining multiple prediction algorithms Instead of selecting the best learning algorithm for a dataset one might also consider combining the predictions of several models to compute what is called blended predictions. This approach can only achieve better performance than the best performing model in the blend if the predictions computed by the dierent models are suciently uncorrelated. Figure S-16 shows the correlation between the errors for the dierent learning algorithms. It shows that this correlation is high (r > 0.82) between the three algorithms that t a continuous mathematical function (SVR, BRR and LASSO), and that it is also high (r > 0.73) between the three tree-based algorithms (GB, AB and RF). However, the correlation is much lower between these two classes of algorithms (r < 0.59). Finally, the ANN model shows very low correlation with any of the other algorithms (r < 0.58). A very simple blending strategy was implemented based on these correlations, which averages the predictions of SVR, ANN and GB. While the eect of blending will be minor

15

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

for molecules that show similar tR predictions for the three blended models, the potential eect on molecules that show suciently dierent tR predictions should be the reduction of outlying predictions (i.e. predictions with large error). This was investigated by looking at the percentage of molecules (over all 36 datasets) with prediction error lower than a certain threshold. Figure 5 shows dierent values for the threshold on the horizontal axis with the corresponding percentage of molecules with prediction error below this threshold on the vertical axis. Figure 5 shows that the averaged (blended) predictions perform better than the single best model, except for molecules with the smallest prediction errors (errors < 2.5%). For individual datasets, both the mean and median absolute error rank is improved when blending, and is better than the GB models (Table S-9). So, even though the dierence between the blended model and GB is small, these results show that outlying tR predictions can be reduced by even a simple blending strategy. It is likely that more advanced blending strategies can further reduce prediction error. Figure 5 also shows that only about half of all metabolites are predicted with high accuracy (error < 5%). For instance, for GB 50% of all molecules fall within a 4.1% threshold. After this point, higher thresholds provide relatively smaller gains. For instance, for GB 80% of the metabolites fall within a threshold of 17.7%.

Comparison between SVR and gradient boosting models The SVR algorithm is one of the most popular algorithms for tR prediction, but the results presented here show that the GB algorithm is more accurate for most datasets. The boosting algorithm in GB works by decreasing the bias of an ensemble of weak learners that have a low variance and high bias. The downside of this is that there need to be enough training examples to eectively decrease the bias. Figure 6 shows a comparison between the performance of both algorithms while taking the size of the dataset into account. The 36 datasets are split into two equal sized groups: 16

ACS Paragon Plus Environment

Page 16 of 28

80 60 20

40

BRR LASSO ANN AB GB RF SVR Blended

0

Metabolites under threshold (%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

100

Page 17 of 28

0

5

10

15

20

Error threshold relative to the total elution time (%)

Figure 5: Percentage of metabolites with predictions under an error threshold plotted against that error threshold across all 36 datasets. Blended predictions are calculated as the average prediction for a metabolite by GB, ANN and SVR. one with low, and one with high number of training examples. This division was made based on a threshold of at least 100 training examples for the high number of examples. 13 out of the 18 datasets in the group with a high number of examples have a lower median absolute error for GB than for SVR (Fisher's exact test p-value < 0.02), and 9 out of 18 datasets in the group with a low number of examples have a lower median absolute error for GB than for SVR (Fisher's exact test p-value = 1). The mean absolute error shows the same trend, where 15 out of 18 datasets with high number of examples have a better performance for the GB trained model (Fisher's exact test p-value < 0.0002), while 10 out of 18 datasets with low number of examples have a better performance for GB (Fisher's exact test p-value = 0.74).

17

ACS Paragon Plus Environment

400

Page 18 of 28

300

300

200

200

100

100

Total number of examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

400

Analytical Chemistry

−2

0

2

4

6

SVR − GB (difference in relative median error (%))

−2

0

2

4

6

SVR − GB (difference in relative mean absolute error (%))

Figure 6: Comparison of the median and mean absolute error between GB and SVR for every dataset. The dierence in the median and mean absolute error between GB and SVR is plotted with the number of training examples. The vertical line indicates the area where there is no performance dierence, and the horizontal line indicates the 100 examples cut-o between low and high number of examples in the dataset.

Eect of a reduced feature set In this section we want to investigate the possible overtting by the machine learning algorithms on the bigger feature set. Overtting is a problem for any machine learning task, but due to the relatively high amount of features (151) compared to the number of training examples this becomes a serious concern for some datasets. To detect overtting on 151 features, the performance of models trained on all 151 features is compared with models trained on the minimal subset of eleven features. This because overtting is less likely to occur in the latter. Figure 7 shows that the three datasets (Matsuura, Matsuura_15 and Beck) that were consistently a worse t for GB in Figure 3 achieve a higher performance on the minimal set of eleven features. This higher performance for the minimal feature set indicates overtting on the 151 features. However, limiting the number of features decreased the performance for most datasets (23 out of 36) with on average a lower median absolute error of 0.42% for GB models trained on all 151 features. This shows that the larger feature set is still preferred

18

ACS Paragon Plus Environment

Page 19 of 28

over the minimal feature set for most datasets. GB 11 features − GB 151 features (Average improvement: 0.42 %) 3

2

1

0

−1

−2

Taguchi_12 Toshimitsu Beck Matsuura Otto Matsuura_15 LIFE_new FEM_short Krauss Stravs UniToyama_Atlantis FEM_orbitrap_urine UFZ_Phenomenex MTBLS38 PFR−TK72 MPI_Symmetry kohlbacher FEM_lipids Ales_18 Cao_HILIC Stravs_22 FEM_orbitrap_plasma Krauss_21 IPB_Halle RIKEN Mark MTBLS87 Eawag_XBridgeC18 LIFE_old Tohge MTBLS20 Nikiforos MTBLS36 FEM_long Takahashi Taguchi

Difference median error relative to the total elution time (%)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Figure 7: Dierence in the absolute median error per dataset for GB trained with eleven, or with 151 features. Positive numbers mean that the absolute median error was lowered by the indicated amount when applying GB with 151 features. The pairwise comparison between the other learning algorithms that are tted on the reduced feature set in Figures S-17 and S-18 follows the same conclusions as Figure 3. The GB algorithm is able to outperform the other algorithms with an improved median absolute error of 0.94% to 2.75% on average. This shows that the earlier conclusions made with the complete feature set are not due to overtting of some of the algorithms on the large set of features.

Feature relevance The performance evaluations conducted here have shown that overall, GB models perform best. The relevance of each feature in all GB models is therefore investigated in more detail in this section. The F-score provided by GB reects this feature relevance, and is computed as the number of times that feature was selected to split the training examples during decision 19

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

tree construction. Figure 8 shows the highest scoring feature based on the F-score. The MolLogP describes the hydrophobicity through the octanol/water partition coecient estimation and is the most important feature across all datasets. This is the result of the large proportion of datasets based on reverse-phase LC, which separates analytes based on hydrophobicity. Furthermore, the feature is the sum of 68 subparts of atomic contributions in a linear combination. 40 As a result this feature is a complex feature condensed from several features and makes a direct comparison with less complex features dicult. However, from a pragmatic standpoint the MolLogP remains an important feature that should be considered in any model for retention time prediction. The remaining features generally describe structural features of molecules on top of chemical features (Chi4v, kappa, EState, PEOE and SlogP). The relevance of these features can be explained by interactions of reactive groups (e.g. cyclic carbons) with the solid phase of the column, and similarity between molecules. High structural and chemical similarity between molecules is therefore a good indicator for molecules to have the same retention time. Figure 8 also shows that the importance of features is often shared between datasets, but that a substantial proportion of features used in the models is nevertheless shared by only a few datasets. These dierences in feature importance underline that knowledge of features cannot be directly transferred from one experimental setup to another.

20

ACS Paragon Plus Environment

Page 20 of 28

Page 21 of 28

0

0.2 0.4 0.6 0.8

1

Normalized F score SMR_VSA5 ( 0.16 ) EState_VSA8 ( 0.166 ) Kappa2 ( 0.17 ) EState_VSA9 ( 0.171 ) MolWt ( 0.174 ) EState_VSA3 ( 0.177 ) PEOE_VSA6 ( 0.181 ) PEOE_VSA9 ( 0.181 ) SMR_VSA7 ( 0.184 ) PEOE_VSA3 ( 0.187 ) EState_VSA2 ( 0.191 ) EState_VSA5 ( 0.192 ) EState_VSA4 ( 0.195 ) BertzCT ( 0.199 ) TPSA ( 0.206 ) HallKierAlpha ( 0.212 ) PEOE_VSA7 ( 0.212 ) SlogP_VSA2 ( 0.214 ) SlogP_VSA5 ( 0.215 ) PEOE_VSA8 ( 0.217 ) SMR_VSA10 ( 0.22 ) MinEStateIndex ( 0.268 ) Kappa3 ( 0.277 ) MaxPartialCharge ( 0.309 ) MaxAbsEStateIndex ( 0.321 ) MinPartialCharge ( 0.342 ) MinAbsEStateIndex ( 0.37 ) BalabanJ ( 0.398 ) Chi4v ( 0.557 ) MolLogP ( 0.796 ) Nikiforos Taguchi Matsuura_15 Matsuura Taguchi_12 FEM_orbitrap_urine FEM_lipids FEM_orbitrap_plasma Toshimitsu Krauss Beck kohlbacher Stravs_22 IPB_Halle MTBLS38 Mark LIFE_old MTBLS36 MPI_Symmetry UniToyama_Atlantis Tohge Eawag_XBridgeC18 Stravs Krauss_21 UFZ_Phenomenex PFR TK72 MTBLS20 Otto Ales_18 FEM_short LIFE_new Cao_HILIC MTBLS87 FEM_long Takahashi RIKEN

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry



Figure 8: Normalized F-score from GB per dataset. F-scores were normalized by dividing all features by the maximum F-score per dataset. Every feature name is followed by the mean F-score of all datasets. Any feature with a mean F-score below 0.15 is excluded.

Conclusion When training an LC tR predictor, researchers make multiple decisions that can inuence the performance and interpretability of the nal model. The most notable such decisions are the number of training examples to include, the machine learning algorithm to use, and the 21

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

molecular features to calculate. For all datasets analysed here, a fairly accurate model could only be obtained when the number of training examples was at least 40. However, dierent datasets and algorithms require a dierent number of training examples to achieve the highest performance possible. Generally the GB algorithm delivers the best performance, given sucient training examples (>100). Models trained with GB generally keep improving for increasing number of training instances, where other algorithms converge in their performance more quickly as training instances increase. Cross-validation conrms that GB is the most likely candidate to deliver the best performance, and is the least likely to give the worst performance. However, the best performance in 23 out of 36 datasets was obtained using another algorithm. This shows the importance of testing dierent algorithms before choosing the learning algorithm that is used to generate the nal model. The algorithms GB, ANN and SVR generate complementary models and provide a good starting point for either selecting the single most suitable algorithm or blending multiple algorithms. The feature importance analysis of the GB models show that feature relevance is also dataset dependent. Selecting features based on other experimental setups can therefore result in a suboptimal model. In the initial stages of creating a model, a researcher should therefore be careful when excluding features. The GB algorithm is generally able to select those features that are important to achieve a high performance without overtting. However, the best overall performance can be achieved by blending algorithms, which results in a lower overall prediction error. The blending technique applied here is very simplistic, but this already indicates that more advanced blending techniques are worth investigating.

22

ACS Paragon Plus Environment

Page 22 of 28

Page 23 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Acknowledgement This project was made possible by MASSTRPLAN. MASSTRPLAN received funding from the Marie Sklodowska-Curie EU Framework for Research and Innovation Horizon 2020, under Grant Agreement No. 675132.

References (1) Malviya, R.; Bansal, V.; Prakash Pal, O.; Kumar Sharma, P. Journal of Global Pharma

Technology 2010, 2, 2226. (2) Bird, I. M. BMJ: British Medical Journal 1989, 299, 783. (3) Wolfer, A. M.; Lozano, S.; Umbdenstock, T.; Croixmarie, V.; Arrault, A.; Vayer, P.

Metabolomics 2016, 12, 8. (4) Aicheler, F.; Li, J.; Hoene, M.; Lehmann, R.; Xu, G.; Kohlbacher, O. Analytical Chem-

istry 2015, 87, 76987704. (5) Creek, D. J.; Jankevics, A.; Breitling, R.; Watson, D. G.; Barrett, M. P.; Burgess, K. E. V. Analytical chemistry 2011, 83, 87038710. (6) Wilson, I. D.; Plumb, R.; Granger, J.; Major, H.; Williams, R.; Lenz, E. M. Journal of

Chromatography B 2005, 817, 6776. (7) Lu, W.; Liu, X.; Liu, S.; Cao, W.; Zhang, Y.; Yang, P. Scientic reports 2017, 7, 43959. (8) Spicer, V.; Yamchuk, A.; Cortens, J.; Sousa, S.; Ens, W.; Standing, K. G.; Wilkins, J. A.; Krokhin, O. V. Analytical chemistry 2007, 79, 87628768. (9) Klammer, A. A.; Yi, X.; MacCoss, M. J.; Noble, W. S. Analytical Chemistry 2007, 79, 61116118.

23

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(10) Moruz, L.; Tomazela, D.; Kall, L. Journal of Proteome Research 2010, 9, 52095216. (11) Palmblad, M.; Ramström, M.; Markides, K. E.; Håkansson, P.; Bergquist, J. Analytical

Chemistry 2002, 74, 58265830, PMID: 12463368. (12) Bertsch, A.; Jung, S.; Zerck, A.; Pfeifer, N.; Nahnsen, S.; Henneges, C.; Nordheim, A.; Kohlbacher, O. Journal of proteome research 2010, 9, 26962704. (13) Strittmatter, E. F.; Kangas, L. J.; Petritis, K.; Mottaz, H. M.; Anderson, G. A.; Shen, Y.; Jacobs, J. M.; Camp, D. G.; Smith, R. D. Journal of proteome research

2004, 3, 760769. (14) Stanstrup, J.; Neumann, S.; Vrhovsek, U. Analytical Chemistry 2015, 87, 94219428. (15) Boswell, P. G.; Schellenberg, J. R.; Carr, P. W.; Cohen, J. D.; Hegeman, A. D. Journal

of Chromatography A 2011, 1218, 67426749. (16) Meek, J. L. Proceedings of the National Academy of Sciences 1980, 77, 16321636. (17) Moruz, L.; Käll, L. Mass Spectrometry Reviews 36, 615623. (18) Katritzky, A. R.; Ignatchenko, E. S.; Barcock, R. A.; Lobanov, V. S.; Karelson, M.

Analytical Chemistry 1994, 66, 17991807. (19) Zellner, B. d.; Bicchi, C.; Dugo, P.; Rubiolo, P.; Dugo, G.; Mondello, L. Flavour and

Fragrance Journal 23, 297314. (20) Domingos, P. Communications of the ACM 2012, 55, 7887. (21) Song, M.; Breneman, C. M.; Bi, J.; Sukumar, N.; Bennett, K. P.; Cramer, S.; Tugcu, N.

Journal of chemical information and computer sciences 2002, 42, 13471357. (22) Golmohammadi, H.; Dashtbozorgi, Z.; Vander Heyden, Y. Chromatographia 2015, 78, 719. 24

ACS Paragon Plus Environment

Page 24 of 28

Page 25 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

(23) Lei, B.; Li, S.; Xi, L.; Li, J.; Liu, H.; Yao, X. Journal of Chromatography A 2009, 1216, 44344439. (24) Chen, J.; Yang, T.; Cramer, S. M. Journal of Chromatography A 2008, 1177, 207214. (25) Luan, F.; Xue, C.; Zhang, R.; Zhao, C.; Liu, M.; Hu, Z.; Fan, B. Analytica Chimica

Acta 2005, 537, 101110. (26) Ma, C.; Ren, Y.; Yang, J.; Ren, Z.; Yang, H.; Liu, S. Analytical chemistry 2018, 90, 1088110888. (27) Cao, M.; Fraser, K.; Huege, J.; Featonby, T.; Rasmussen, S.; Jones, C. Metabolomics

2015, 11, 696706. (28) Hoerl, A. E.; Kennard, R. W. Technometrics 1970, 12, 5567. (29) Tibshirani, R. Journal of the Royal Statistical Society. Series B 1996, (30) Freund, Y.; Schapire, R. E. Journal of computer and system sciences 1997, 55, 119139. (31) Mason, L.; Baxter, J.; Bartlett, P.; Frean, M. NIPS 1999, (32) Breiman, L. Machine learning 2001, 45, 532. (33) Cortes, C.; Vapnik, V. Machine learning 1995, 20, 273297. (34) Weininger, D. Journal of chemical information and computer sciences 1988, 28, 3136. (35) Landrum, G. The RDKit 2016.09.1 documentation. 2016; http://www.rdkit.org/

docs/. (36) Bergstra, J.; Bengio, Y. Journal of Machine Learning Research 2012, 13, 281305. (37) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Others, Journal of Machine Learning

Research 2011, 12, 28252830. 25

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(38) McKinney, W. Python for High Performance and Scientic Computing 2011, 19. (39) Chen, T.; Guestrin, C. Proceedings of the 22nd acm sigkdd international conference on

knowledge discovery and data mining 2016, 785794. (40) Wildman, S. A.; Crippen, G. M. Journal of Chemical Information and Computer Sci-

ences 1999, 39, 868873.

Supporting Information Available The Supporting Information is available free of charge on the ACS Publications website at DOI: XXXX. Experimental meta data of the datasets and physiochemical information of analytes in these datasets (Tables S-1 and S-2 and Figure S-1); standard deviation of the learning curves (Figure S-2); molecular descriptors included and excluded for the tting procedure (Tables S-3 and S-4); exact mean absolute error, median absolute error, and Pearson correlation for each dataset (Tables S-5, S-6 and S-7); learning curves for the mean absolute error and Pearson correlation (Figures S-3 and S-4); aggregated learning curves for the ve datasets (Figures S-5 and S-6); scatter plots of predicted and observed retention times for each dataset and algorithm (Figures S-7 through S-11); dierence in relative median error between the algorithms (Figure S-13); baseline linear model compared to the other algorithms (Figure S-14); simulation of algorithms error rate and the relative overlap of error windows (Figure S-15); Pearson correlation between errors of the algorithms (Figure S-16); performance comparison of algorithms on a reduced feature set (Figures S-17 and S-18); calculation cost in seconds for each algorithm (Table S-8); aggregated performance ranking of the algorithms (Table S-9); code listing with the hyperparameter space for each algorithm (Listing S-1)

26

ACS Paragon Plus Environment

Page 26 of 28

Page 27 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Analytical Chemistry

Author information Corresponding Author E-mail: [email protected], Albert Baertsoenkaai 3, 9000 Gent, Belgium

ORCID Lennart Martens: 0000-0003-4277-658X Robbin Bouwmeester: 0000-0001-6807-7029

Author Contributions S.D. and R.B. designed the experiments. R.B. wrote the code for training the models and evaluating them. S.D., L.M. and R.B. analyzed the results. All authors contributed to the manuscript.

27

ACS Paragon Plus Environment

Analytical Chemistry 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Graphical TOC Entry

28

ACS Paragon Plus Environment

Page 28 of 28