CiRCus: a framework to enable classification of complex high

5 days ago - Active Instrument Engagement Combined with a Real-Time Database Search for Improved Performance of Sample Multiplexing Workflows...
0 downloads 0 Views 2MB Size
Subscriber access provided by Macquarie University

Article

CiRCus: a framework to enable classification of complex high-throughput experiments Florian Seefried, Tobias Schmidt, Maria Reinecke, Stephanie Heinzlmeir, Bernhard Kuster, and Mathias Wilhelm J. Proteome Res., Just Accepted Manuscript • DOI: 10.1021/acs.jproteome.8b00724 • Publication Date (Web): 25 Feb 2019 Downloaded from http://pubs.acs.org on February 25, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

CiRCus: a framework to enable classification of complex high-throughput experiments Florian Seefried 1, Tobias Schmidt 1, Maria Reinecke 1, 2, Stephanie Heinzlmeir 1, Bernhard Kuster 1, 2, 3, Mathias Wilhelm 1, * 1 Chair of Proteomics and Bioanalytics, Technical University of Munich, Freising, Germany 2 German Cancer Research Center and German Cancer, Consortium, Heidelberg, Germany. 3 Bavarian Center for Biomolecular Mass Spectrometry, Freising, Germany

*

Corresponding author:

Mathias Wilhelm ([email protected]) Tel. +49 8161 714202

Fax: +49 9161 715931

1 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 30

Abstract Despite the increasing use of high-throughput experiments in molecular biology, methods for evaluating and classifying the acquired results has not kept pace, requiring significant manual efforts to do so. Here we present CiRCus, a framework to generate custom machine learning models to classify results from high-throughput proteomics binding experiments. We show the experimental procedure that guided us to the layout of this framework as well as the usage of the framework on an example dataset consisting of 557,166 protein:drug binding curves achieving an AUC of 0.9987. By applying our classifier to the data, only 6% of the data might require manual investigation. CiRCus bundles two applications, a minimal interface to label a training dataset (CindeR) and an interface for the generation of a random forest classifiers with optional optimization of pre-trained models (CurveClassification). CiRCus is available on https://github.com/kusterlab accompanied by an in-depth user manual and video tutorial.

Keywords Labeling, Classification, Proteomics, Kinobeads, Competition binding, Machine learning

Introduction The emerging application of high throughput experiments in the field of molecular biology foster explorative and unbiased experiments. The downside of those explorative approaches is the fact that the proportion of interesting observations in the data is typically low 1. In order to find interesting changes in the observed data, one classical approach is to use downstream statistical analysis 2, 3. However, this is not always suitable, especially when multiple parameters of the underlying data need to be investigated simultaneously in order to assess the overall quality and significance of the results. This is apparent e. g. in proteomic applications if protein or peptide intensities 2 ACS Paragon Plus Environment

Page 3 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

are investigated as a function of compound concentrations, temperatures or time 4, 5. Here, manual inspection and annotation of the data is often used which impairs reproducibility and consistency. One particular example where such a procedure is necessary is the classification of results obtained from Kinobeads experiments, which enables the target deconvolution of small molecule kinase drugs

1, 5, 6.

It can reveal binding partners of small molecules and

subsequently hint at potential explanations for a drug’s mode of action as well as potential side effects that the investigated drug might have 6. Usually those experiments are performed by varying drug concentrations to model the drug-target interaction as a log-logistic function

7, 8.

This allows the estimation of binding parameters (i.e. concentration of half-maximal response) for many protein-drug-

interactions in a single experiment. In addition, each concentration-dependent binding curve is characterized by multiple parameters enabling the assessment of certainty of the observed interactions (e.g. number of observed peptides). Classification is a well-established data mining technique that is trained on manually labeled examples and subsequently applied to large datasets. This process avoids inconsistencies likely generated during manual data inspection and allows filtering based on adaptive criteria. In contrast to applying manual filter criteria, this approach makes use of the learned separation in a high dimensional space. Here we present a framework, termed CiRCus (CindeR and CurveClassification), which can be used to train and apply a classifier, readily useable by wet-lab scientists with a basic understanding of machine learning. Two applications are presented and evaluated; on a large annotated dataset one for easy training data generation and one for model learning, refinement and application. We provide an extensive user manual and video tutorial for both. Thus, CiRCus makes the generation and application of high-quality predictors conveniently accessible, allowing its use in the field of proteomics where the number of datapoints in a single experiment far exeeds what can be manually investigated and where simple cutoffs are not sufficient as filter criteria.

3 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 30

Software Tools / Methods CindeR This application is intended to help with the labeling of datasets. It can be used for generating a training dataset which can be used in the developed classification framework or, alternatively, for manual labeling. Briefly, any csv file can be uploaded and visualized using custom (e.g. https://github.com/kusterlab/curveClassification_shiny), plot functions. Labeling is performed by swiping or pressing left or right. The resulting data can then be downloaded for subsequent model training. Furthermore, this application can be also extended for multi class labeling (see SI). An example of such an extension for four and five classes is given on the branch dev and multiClass_Labeling of the github project, respectively. A detailed user manual and video tutorial can be found under https://github.com/kusterlab/cindeR/tree/master/manual.

CurveClassification package and shiny app The CurveClassification developed here consists of two parts, an R-package that adds additional functionality to the mlr package 9 and an accompanying shiny

10

web application that integrates this R-package and allows the generation, evaluation and use of learned

classifiers in a guided fashion. The main feature of the package is the implementation of a class called WrappedCombiModel (Figure S-1) that integrates feature calculation functions, the used dataset and the model parameters. The feature calculation, or generation, functions can in principle perform any kind of preprocessing of the data which leads to the generation of additional features used for later training. While preprocessing functions that center and scale the data or calculate the principal components are already implemented in several packages 9, 11,

the added feature generation functions are more useful when more general preprocessing is required (e.g. normalization of the data to

4 ACS Paragon Plus Environment

Page 5 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

a certain column or performing a least square fit based on several columns). Once a trained model of this class is generated, those functions are stored and called automatically before the prediction to every new dataset. Therefore, no separate preprocessing is required and raw data can be directly fed to such a model. In addition, a retrain function that reuses all prior parameters of the model and extends or replaces the data foundation of the model was implemented which can be used to further optimize an existing model. The shiny application is based on this package and provides a guided interface to generate a classifier. It consists of seven tabs where each is intended for a certain task. The main features of this application are the integration of feature generation functions, the training and retraining of a model, visualization of observations as well as the prediction for new data. CurveClassification allows users to perform a variety of different tasks online, including but not limited to feature generation and exploration, model generation, model optimization and prediction. Similar to the CindeR app, custom plot functions can be uploaded to visualize the acquired data. A detailed user manual as well as links to video tutorials are available here https://github.com/kusterlab/curveClassification_shiny/tree/master/manual.

Data The data used in this study was published by Klaeger et al. 1 and contains 243 competition binding assays (experiment) using the Kinobeads technology 5 totaling 557,166 binding curves. Published proteinGroup.txt files were obtained from PRIDE and used in this study. For the purpose of this study, only observations which had recorded LFQ intensity in all available concentrations were used. The original data was manually labelled considering the following features: Normalized LFQ intensity, unique peptides, and MS/MS count for the concentrations 0 nM, 3 nM, 10 nM, 30 nM, 100 nM, 300 nM, 1000 nM, 3000 nM and 30000 nM (0 nM represents the DMSO control). In addition, also the quantile of the intensity of the DMSO control in each experiment was considered. This resulted in 5,083 (1.8 %) high confidence binding curves that showed a concentration dependent effect (low confidence targets were not considered as targets). This dataset was divided into a training set (80% of all observations) and a test set (20% of all observations). 5 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 30

Estimation of the learning curve To estimate the learning curve, and thus, the minimum number of targets that are required to generate an initial model, competition binding experiments were randomly sampled iteratively starting from one experiment up to 194 experiments (80 % of all). The number of targets used for training thus varied in each step. During every step of the iteration, a separate model was trained and its performance was evaluated on the observations that were part of the remaining 20 % of experiments that were never sampled. This procedure was repeated ten times in order to estimate the variance of the model performance for a certain number of targets.

In-silico evaluation to find the minimal number of concentrations To evaluate the minimum number of concentrations necessary for a random forest to distinguish targets from non-targets, all 247 combinations of concentrations possible for the dataset were analyzed. The available concentrations are 3 nM, 10 nM, 30 nM, 100 nM, 300 nM, 1000 nM, 3000 nM and 30000 nM. Since some of the features that were used depend on the particular responses at the respective concentrations (e.g. slope of linear fit), these features were calculated for every of the 247 in-silico evaluations.

Computational Experiments/ Results and Discussion CiRCus is a combination of the two web applications CindeR and CurveClassification designed to generate classification models for datasets where multiple parameters needs to be evaluated simultaneously (Figure 1). CindeR is intended to accelerate the manual labeling of data in order to provide a training dataset for CurveClassification. Any kind of csv file can be uploaded in combination with a plot function that visualizes the relevant parameters of the data. Every observation is subsequently displayed and can be labeled by a keystroke. This results in a dataset that contains additional class labels (True or False). The CurveClassification tool is intended as a guided framework for the generation and application of specific classifiers. Therefore, any kind of csv file can be uploaded which can be used on the one hand to generate a classification model with or without data specific preprocessing (feature generation function) or, on 6 ACS Paragon Plus Environment

Page 7 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

the other hand, to optimize an already existing model. The generated models can be used to predict new data. Those predictions can be manually reevaluated and used to further optimize the model, if required. Moreover, every observation in the dataset can be visualized at every stage of the model generation and application (using custom R plot functions) allowing easy exploration of model performance and positive predictions of new data. Furthermore, a visualization of nearest neighbors of a selected observation is available which can be used during model generation to evaluate the consistency of the manually labeled training dataset. We briefly introduce the experiments that led to the design of the applications presented here and subsequently show an example process.

Training data labeling One key requirement for supervised learning is the availability of large amounts of labeled data. However, the generation of such data cannot be meaningfully automated and thus requires manual inspection, a tedious process. CindeR provides an easy and minimalistic interface to allow fast labeling of data (Figure 1). For this purpose, data can be uploaded to a web interface and visualized using custom R plot functions. A single keystroke (or swipe) is used for fast labeling, thus allowing the evaluation of hundreds of observations in a short time. The results can be downloaded and aggregated across many users to ensure unbiased data generation.

7 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 30

Figure 1. Layout overview and workflow. CiRCus is a combination of the two web applications CindeR and CurveClassification to generate a classification model. CindeR: Any kind of csv file can be uploaded in combination with a plot function that visualizes the relevant properties of the data. Every observation is subsequently displayed and can be labeled by a keystroke (or swipe). This results in a dataset that contains additional class labels. CurveClassification: Any kind of csv file can be uploaded which can be used on the one hand to generate a classification model with or without data specific preprocessing (feature generation function) to optimize an existing model

8 ACS Paragon Plus Environment

Page 9 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

or on the other hand to optimize an already existing model. The generated models can be used to predict new data. Those predictions could be manually reevaluated and used to further optimize the model if required. Furthermore, different options for the visualization of observations are available in every stage.

Classification model setup Given the large number of available classification algorithms, the first task was to identify suitable algorithms allowing the classification of strongly skewed data (i.e. positive to negative examples 100:1), exemplified here on dose-response data. Thus, we compared 42 classification algorithms, based on the feature set that was used for manual labeling (see Methods; Figure S-2) in order to find an optimal algorithm for classification. The algorithms were compared using the following six measures: true positive rate (TPR), false positive rate (FPR), precision (PPV), accuracy (ACC), area under the receiver operating characteristic (ROC) curve (AUC) and run time. Each training was evaluated by a 10 fold cross-validation to obtain stable estimates of their performance. The baseline for the model evaluation was the comparison to a simple algorithm (“featureless”, 9) that always predicts the class with the highest proportion. Algorithms belonging to the family of random forests and support vector based (SVM) algorithms showed a high probability that predicted true values were actually true (PPV), a high rate of covering all positive classes (TPR), a low amount of negative examples classified as positive classes (FPR) and overall good performance (ACC, ROC, AUC) (Figure S-2). However, due to the scaling of training time in relation to the dataset size, which is less than quadratic for random forests and cubical for SVMs 12, 13, a random forest 14 was selected as the classification algorithm (number of trees = 750; number of randomly sampled features taken into account when computing the best node split = 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠) (see SI). Furthermore, for the used dataset, it was found that the performance gain of the model obtained through hyperparameter optimization was negligible for random forests but not for SVMs, which is an additional advantage of the random forest (Figure S-3).

9 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 30

Comparison of sampling methods Sampling is an advantageous step for data with a high skewness in labels to balance the true and false observations during training. Therefore, we investigated the necessity for balancing by three sampling methods: undersampling (random under sampling of majority class), oversampling (replication of minority class) and SMOTE (oversampling with a generative model) 15. Because every sampling method has different non-derivable parameters, we used a Bayesian optimization approach (see SI) 14, 16 to find their optimal parameters to achieve the highest true positive rate.

Table 1. Comparison of three different sampling methods with their optimal hyperparameters to a reference (without sampling).

Method

Parameter

TPR

FPR

ACC

PPV

AUC

Run time [h]

Reference

-

78.4

0.1

99.5

94.3

99.78

2.2

Undersampling

undersampling 0.05

rate: 96.4

1.1

98.8

61.0

99.82

0.06

SMOTE

number of NN: 20

91.8

0.4

99.4

80.1

99.83

4.7

83.9

0.1

99.6

91.7

99.86

2.4

Oversampling rate: 91.8 Oversampling

Oversampling rate:59.0

The shown values are the mean values of a ten-fold cross-validation based on the train dataset.

Table 1 summarizes the obtained results and shows that SMOTE and the undersampling method perform better than the model without sampling or oversampling. When investigating the resulting probability distributions for SMOTE and undersampling that are split by their initial class label (Figure 2), it is apparent that the models differ. Since the requirement for the classifier was not to lose any 10 ACS Paragon Plus Environment

Page 11 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

positive observations during the automated classification, the threshold was estimated, from a 10 fold cross-validation during training, to achieve a TPR of 0.99. The comparison of the performance, of the models with the adjusted threshold, revealed that under those conditions the overall model performance was similar 17. This implies that the type of sampling is of lesser importance, since the optimal threshold can be estimated through cross-validation. Thus, the undersampling method with an undersampling rate of 0.05 was chosen due to the computation time that is orders of magnitude lower.

Figure 2. Comparison of the target probability distribution of the test set predicted with a model generated with Undersampling (blue) or SMOTE (orange) for every of the three class labels available in the dataset. The dashed lines indicate the threshold that was estimated from a 10 fold cross-validation to achieve a TPR of 99 %.

11 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 30

Model performance evaluation Encouraged by a model with an optimized threshold to achieve a TPR of 0.99 we investigated the performance on the holdout dataset that contained 57,261 observations (Table 2, 1).

Table 2: Confusion matrix of the test set predicted with the model that is based on the feature set that was used for the manual classification of the dataset (see Methods).

Predicted class Actual class

False

True

False

53,828

2,418

True

7

1,008

The applied threshold was estimated from a 10 fold cross-validation to achieve a TPR of 99 %.

12 ACS Paragon Plus Environment

Page 13 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 3: Receiver operating characteristic of the test set predicted with the model that is based on the feature set that was used for the manual classification of the dataset (see Methods).

The initial model already showed impressive performance (Figure 3) with an AUC of 0.998. In case that no false classifications should be included in the final result, this model still requires manually checking 3,426 positive predicted curves (Table 2) to filter for false positive identifications (false negatives are avoided through threshold adjustment) representing 6.0 % of the full test dataset. Further investigation of the false positives, lead to the realization that the initial feature set, containing Normalized LFQ intensity, MS/MS and unique peptide counts over the concentration range in combination with the intensity quantile of DMSO, did not provide enough information for optimal differentiation of positive and negative observations. Both curves shown in Figure 4 were classified positive by the initial model, however, were not manually annotated to be high confidence targets. Such a sanity-check of false predicted observations can be done conveniently using the visualization features of CurveClassification. A click on a single observation, in the prediction results of train and test dataset, results in the visualization of this observation (using the custom R plot function). The data table can be filtered and ordered, and therefore, allows fast evaluation of specific groups. Additionally, a nearest neighbor visualization is integrated where the click on a single observation results in the visualization of the selected observation and the closest observation (using a Euclidean distance based on all features used for model training) that were annotated as True and False. This option was found to be useful to detect inconsistencies in the manual annotated data, which can potentially mislead the classifier during the training.

13 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 30

Figure 4. Example of two dose-response-curves that were predicted as positive samples from the initial model (Threshold: 0.18) with an intermediate probability (Original probability). The model obtained after feature engineering (final model; Threshold: 0.33) classifies those curves as negative (Final probability).

Feature engineering Machine learning algorithms derive their decision based on given features and/or their combination, but every classical machine learning algorithm can be improved by calculating new features with domain knowledge. This process of finding “new” features is called feature engineering. For example, the dataset investigated here contained additional labels describing relevant properties of the corresponding protein. The annotation “kinase” describes whether a protein belongs to the class of protein kinases whereas “direct binder” was used to indicate whether a protein is known for its ability to be enriched with the experimental setup. The last annotation used is “co-enriched”, which describes if a protein is known to be enriched indirectly (i.e. complex partner) through another (direct binder) protein.

14 ACS Paragon Plus Environment

Page 15 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

The criteria an observation has to fulfill are less stringent for proteins that were known to be kinases compared to proteins for which no information was (yet) available. Thus, those already available annotations were also integrated as features. Using those features in addition to the initial feature set increased the performance of the model evaluated using cross-validation (initial features AUC: 0.9982; initial features + annotation AUC: 0.9988; Figure 5). Another disadvantageous property of classical random forests is that features are treated in isolation and thus do not take the order of relative responses at different concentrations explicitly into account. A simple way to encode such a connection between features is a linear model fit which captures the information whether a binding curve has the trend to increase or decrease while simultaneously capturing the steepness via the slope of the line. Another parameter that could be obtained from a least square fit is its coefficient of determination (R2). In the initial data, the R2 of a log-logistic curve fit was taken into account for classification (as a measure of curve quality) and it could be shown that it correlates with a Pearson correlation of 0.608 with the one resulting from the corresponding linear fit (Figure S-4). Therefore, the R2 and the slope obtained from a linear fit were also used as features for the random forest model. The main advantage of using a linear fit over a log-logistic fit is the runtime which is tremendously lower for a linear fit, while at the same time being less dependent on the overall shape. An additional feature generated in this context was the area under the curve (auc). The idea is that non-targets should show a higher area under the curve than targets because increasing inhibitor concentrations should lead to a decrease of relative intensity for targets. The effect of including this feature was hardly detectable and not statistically significant (initial features + annotation + linear model fit AUC: 0.9990; initial features + annotation + linear model fit + auc AUC: 0.9990) (Figure 5), but the manual review of the predicted false positives using the interface of CurveClassification revealed that the probability of those curves is more similar to true positive binding curves than to true negatives. All those features can be calculated using the introduced feature generation function methodology (see Methods). Examples of such feature generation functions are available on GitHub. 15 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 30

16 ACS Paragon Plus Environment

Page 17 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 5. Performance of the model during the feature engineering demonstrated with true positive rate (TPR) and area under the ROC curve (AUC). The data is based on a 10 fold cross-validation with a similar seed for every feature set. The star indicates a p-value (obtained from a pairwise t-test) lower than 0.05 and thus significance.

The final model utilizing the initial features as well as the protein annotation, linear fit and auc showed the best results (Table 3). This highlights that additional feature engineering, incorporating knowledge implicitly or explicitly used during manual labeling, can improve overall model performance. The FPR of the final model is 1.6 % which results in the necessity to manually review 3.3 % of the full dataset if false positive observations should be removed overall significantly reducing manual intervention. Furthermore, the final model was applied to an independent internal dataset (see SI) which could be classified with a similar performance (AUC: 0.996, ACC: 0.997, PPV: 0.784, FPR: 0.003, TPR: 0.982). Thus, the obtained final model generalizes well and can be applied to datasets that are generated with a similar experimental setup.

Table 3. Confusion matrix of the test set predicted with the model that is based on the engineered feature set.

Predicted class Actual class

False

True

False

55,354

892

True

12

1,003

The applied threshold was estimated from a 10-fold cross-validation to achieve a TPR of 99 %.

17 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 30

Learning curve estimation The main bottleneck of the proposed classification procedure remains the labeling of the initial training dataset. Thus, the minimal number of targets to generate an initial model with an adequate performance was investigated, as well as the ability to optimize the performance of such a model by adding more data. Furthermore, the optimization of time required for experimental data acquisition and sample preparation, which is mainly driven by the number of samples (e.g. number of evaluated concentrations), was investigated by insilico evaluations. The experiment (see Methods) to estimate the learning curve for the observed performance measures (TPR, FPR, ACC, PPV and AUC) revealed that the variance of these measures was already decreased drastically after around 400 targets. The absolute gain in the performance measures per additional targets was negligible beyond 400 targets (Figure 6). This implies that around 400 positive observations are sufficient to generate an initial model (under the constraints that the dataset is skewed and positive observations represent the minor class). An initial model generated in this way could be used to predict new data. A subsequent manual evaluation of the positive predictions can then be used to weed out false predictions, which results in an additional dataset that can be used to refine the initially obtained model (Figure 1).

18 ACS Paragon Plus Environment

Page 19 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Figure 6: Performance of the model with the final feature set in dependency of the number of targets in the training dataset. The Maximum number of targets per bin represents the maximum of the range of positive observations in the corresponding bin. The red line indicates the median performance of the 50 models with the highest number of targets.

The in-silico evaluation has to be assessed from two angles, since those experiments can reveal interaction partners as well as the binding parameters of the investigated compound. While the first task profits from concentrations that are orders of magnitudes higher than the inflection point, the opposite holds true for the latter task. Thus, one extreme concentration could be included for the purpose of interaction detection and five to six concentrations equally distributed over the expected EC50 range could be used to determine the binding parameter (Figure S-5). Hence, it is possible to perform such an in-silico evaluation to optimize the experimental workflow to minimize experimental effort, while ensuring similar accuracy for the relevant experimental results. As an example, if the main focus of an experiment is the detection of novel interactions, two concentrations (in combination with a DMSO control) are in principle sufficient to identify them. While the concentrations 3 nM and 30000 nM are suitable to detect the interactions they are not suitable to estimate the corresponding EC50. If this is also relevant (e.g. estimate biological importance of interactions) a good selection would be 100 nM and 1000 nM, which allows for both identification and estimation of EC50.

Conclusion In this study we introduced a framework consisting of two web applications, CindeR and CurveClassification, for the classification of experimental observations. We show its use on data obtained from high throughout competition binding data. The comparison of different classification algorithms revealed that, while in principle many algorithms are suitable for this task, the good scaling properties in combination with high performance lead to the selection of random forests over e.g. SVMs. Furthermore, we investigated different

19 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 30

sampling methods, with the conclusion that when optimizing the threshold of the model, the choice is of minor importance. Undersampling was selected due to its good overall performance and the resulting fast model training. While this evaluation was done on one particular dataset, we have successfully used CiRCus in form presented here in a variety of different projects, suggesting that the decisions made here are generalizable (for data with similar properties as the example data) to some extent.

We showed on earlier published data how both applications could be used in order to train a model with high initial performance (AUC 0.998; ACC 0.958; PPV 0.294; TPR 0.993). This model could already be used to minimize the required manual effort, since only 6.0% of the full dataset required manual review to weed out false positives. We were able to further increase its performance by additional feature engineering (AUC 0.999; ACC 0.984; PPV 0.530; TPR 0.988). However, the initial feature selection and the subsequent feature engineering can be tedious if relevant experimental features were not known e.g. usage of data that was not generated in-house. The CiRCus framework allows the generation of high quality classifiers in a guided fashion. In addition, it provides a data specific visualization option, which allows for a convenient evaluation of false classified observations during the model generation process and an easy exploration of positive predicted observations from new data. Furthermore, CiRCus was applied to two similar internal datasets (protein and peptide intensity as a function of concentration), plus one external credit card fraud dataset 18 (see SI), and showed a similar performance. Thus, the applicability of the shiny app should not be limited to competition binding experiments and can likely be used for other experiments (e.g. pulldowns) where metadata in combination with varying protein and peptide intensities caused by various factors need to be evaluated simultaneously.

20 ACS Paragon Plus Environment

Page 21 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

ASSOCIATED CONTENT HoldoutDataset.csv Internal independent dataset generated with a similar to experimental setup of Klaeger et al. 1 SupplementFigures.pdf Figure S-1: Structure of an R object of class WrappedCombiModel Figure S-2: Comparison of 42 classification algorithms based on six performance measures Figure S-3: FPR obtained by the Bayesian hyperparameter optimization of the parameters mtry (number of randomly sampled features taken into account when computing the best node split) and ntree (number of trees) of the random forest. Figure S-4: Scatterplot of R2 obtained from a log-logistic model (x-axis) and R2 obtained from a linear fit (y-axis) Figure S-5: Average relative error pKd estimation vs. classification cluster resulting from the in-silico evaluation of all 247 concentration combinations SupplementInformation Hyperparameters of the applied random forest Bayesian optimization for hyperparameter search Multi-label Classification External dataset

21 ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 30

ACKNOWLEDGMENT The authors wish to thank numerous colleagues including D. Zolg, S. Petzoldt, J. Rechenberger, J. Zecha, P. Samaras, M. Frejno and the entire Kuster team for fruitful discussions.

ABBREVIATIONS ACC, accuracy; AUC, area under the receiver operating characteristic curve; DMSO, Dimethly Sulfoxide; FN, false negative; FP, false positive; FPR, false positive rate; LFQ, label free quantification; NN, nearest neighbors; PPV, precision; ROC, receiver operating characteristic; SVM, support vector machine; auc, area under the dose response curve

NOTES M.W. and B.K. are founders and shareholders of OmicScouts, a company providing proteomic and chemical proteomic services. They have no operational role in the company.

References

22 ACS Paragon Plus Environment

Page 23 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

1. Klaeger, S.; Heinzlmeir, S.; Wilhelm, M.; Polzer, H.; Vick, B.; Koenig, P.-A.; Reinecke, M.; Ruprecht, B.; Petzoldt, S.; Meng, C., The target landscape of clinical kinase drugs. Science 2017, 358 (6367), eaan4368. 2. Käll, L.; Vitek, O., Computational Mass Spectrometry–Based Proteomics. PLoS Computational Biology 2011, 7 (12), e1002277. 3. Kumar, C.; Mann, M., Bioinformatics analysis of mass spectrometry-based proteomics data sets. FEBS letters 2009, 583 (11), 1703-1712. 4. Jensen, A. J.; Molina, D. M.; Lundbäck, T., CETSA: a target engagement assay with potential to transform drug discovery. Future medicinal chemistry 2015, 7 (8), 975-978. 5. Bantscheff, M.; Eberhard, D.; Abraham, Y.; Bastuck, S.; Boesche, M.; Hobson, S.; Mathieson, T.; Perrin, J.; Raida, M.; Rau, C.; Reader, V.; Sweetman, G.; Bauer, A.; Bouwmeester, T.; Hopf, C.; Kruse, U.; Neubauer, G.; Ramsden, N.; Rick, J.; Kuster, B.; Drewes, G., Quantitative chemical proteomics reveals mechanisms of action of clinical ABL kinase inhibitors. Nature Biotechnology 2007, 25, 1035. 6. Klaeger, S.; Gohlke, B.; Perrin, J.; Gupta, V.; Heinzlmeir, S.; Helm, D.; Qiao, H.; Bergamini, G.; Handa, H.; Savitski, M. M., Chemical proteomics reveals ferrochelatase as a common off-target of kinase inhibitors. ACS chemical biology 2016, 11 (5), 1245-1254. 7. Clark, N. A.; Hafner, M.; Kouril, M.; Williams, E. H.; Muhlich, J. L.; Pilarczyk, M.; Niepel, M.; Sorger, P. K.; Medvedovic, M., GRcalculator: an online tool for calculating and mining dose–response data. BMC Cancer 2017, 17, 698. 8. Ritz, C.; Baty, F.; Streibig, J. C.; Gerhard, D., Dose-response analysis using R. PloS one 2015, 10 (12), e0146021. 9. Bischl, B.; Lang, M.; Kotthoff, L.; Schiffner, J.; Richter, J.; Studerus, E.; Casalicchio, G.; Jones, Z. M., mlr: Machine Learning in R. The Journal of Machine Learning Research 2016, 17 (1), 5938-5942. 10. Chang, W.; Cheng, J.; Allaire, J.; Xie, Y.; McPherson, J., shiny: Web application framework for R [Computer software]. URL http://CRAN. R-project. org/package= shiny (R package version 1.0. 0) 2017. 11. Kuhn, M., Caret package. Journal of statistical software 2008, 28 (5), 1-26. 12. Bottou, L.; Lin, C.-J., Support vector machine solvers. Large scale kernel machines 2007, 3 (1), 301-320. 13. Louppe, G., Understanding random forests: From theory to practice. arXiv preprint arXiv:1407.7502 2014. 14. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P., Random Forest:  A Classification and Regression Tool for Compound Classification and QSAR Modeling. Journal of Chemical Information and Computer Sciences 2003, 43 (6), 1947-1958. 15. Chawla, N. V.; Bowyer, K. W.; Hall, L. O.; Kegelmeyer, W. P., SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 2002, 16, 321-357. 16. Bischl, B.; Richter, J.; Bossek, J.; Horn, D.; Thomas, J.; Lang, M., mlrMBO: A modular framework for model-based optimization of expensive black-box functions. arXiv preprint arXiv:1703.03373 2017. 17. Prati, R. C.; Batista, G. E.; Monard, M. C. In Data mining with imbalanced class distributions: concepts and methods, IICAI, 2009; pp 359-376. 18. Pozzolo, A. D.; Caelen, O.; Johnson, R. A.; Bontempi, G. In Calibrating Probability with Undersampling for Unbalanced Classification, 2015 IEEE Symposium Series on Computational Intelligence, 7-10 Dec. 2015; 2015; pp 159-166.

ACS Paragon Plus Environment

23

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 30

For TOC Only

ACS Paragon Plus Environment

24

Page 25 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Layout overview and workflow. CiRCus is a combination of the two web applications CindeR and CurveClassification to generate a classification model. CindeR: Any kind of csv file can be uploaded in combination with a plot function that visualizes the relevant properties of the data. Every observation is subsequently displayed and can be classified by a keystroke (or swipe). This results in a dataset that contains additional class labels. CurveClassification: Any kind of csv file can be uploaded which can be used on the one hand to generate a classification model with or without data specific preprocessing (feature generation function) to optimize an existing model or on the other hand to optimize an already existing model. The generated models can be used to predict new data. Those predictions could be manually reevaluated and used to further optimize the model if required. Furthermore, different options for the visualization of observations are available in every stage. 178x177mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Comparison of the target probability distribution of the test set predicted with a model generated with Undersampling (blue) or SMOTE (orange) for every of the three class labels available in the dataset. The dashed lines indicate the threshold that was estimated from a 10 fold cross-validation to achieve a TPR of 99 %.

ACS Paragon Plus Environment

Page 26 of 30

Page 27 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Receiver operating characteristic of the test set predicted with the model that is based on the feature set that was used for the manual classification of the dataset (see Methods). 77x75mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Example of two dose-response-curves that were predicted as positive samples from the initial model (Threshold: 0.18) with an intermediate probability (Original probability). The model obtained after feature engineering (final model; Threshold: 0.33) classifies those curves as negative (Final probability). 178x88mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 28 of 30

Page 29 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Proteome Research

Performance of the model during the feature engineering demonstrated with true positive rate (TPR) and area under the ROC curve (AUC). The data is based on a 10 fold cross-validation with a similar seed for every feature set. The star indicates a p-value (obtained from a pairwise t-test) lower than 0.05 and thus significance.

ACS Paragon Plus Environment

Journal of Proteome Research 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Performance of the model with the final feature set in dependency of the number of targets in the training dataset. The Maximum number of targets per bin represents the maximum of the range of positive observations in the corresponding bin. The red line indicates the median performance of the 50 models with the highest number of targets. 166x86mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 30 of 30