Modelling Kinase Inhibition Using Highly Confident Data Sets

7 days ago - Protein kinases form a consistent class of promising drug targets, and several efforts have been made to predict the activity of small-mo...
0 downloads 3 Views 5MB Size
Subscriber access provided by Kaohsiung Medical University

Chemical Information

Modelling Kinase Inhibition Using Highly Confident Data Sets Sorin Avram, Alina Bora, Liliana Halip, and Ramona Curpan J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00729 • Publication Date (Web): 30 Apr 2018 Downloaded from http://pubs.acs.org on May 1, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Modelling Kinase Inhibition Using Highly Confident Data Sets Sorin Avram*, Alina Bora, Liliana Halip, Ramona Curpăn* Department of Computational Chemistry Institute of Chemistry Timisoara of Romanian Academy 24 Mihai Viteazu Avenue 300223-Timişoara, Romania

ABSTRACT

Protein kinases form a consistent class of promising drug targets, and several efforts have been made to predict the activity of small-molecules against a representative part of the kinome. This study continues our previous work (Bora, A.; Avram, S.; Ciucanu, I.; Raica, M.; Avram, S., Predictive Models for Fast and Effective Profiling of Kinase Inhibitors. J. Chem. Inf. Model. 2016, 56, 895−905; www.chembioinf.ro), aiming to build and measure the performance of ligand-based kinase inhibitor prediction models. Here, we analyzed kinase-inhibitor pairs with multiple activity points extracted from ChEMBL database, and identified the main sources of inconsistency. Our results indicate that lower IC50 values are usually less affected by errors and reflect more accurately the structure-activity relationship of the molecules against the target, ideally for quantitative structure-activity relationship (QSAR) studies. Further, we modeled the activity of 104 kinases using unbiased target-specific activity points. The performance of predictors built on extensive connectivity fingerprints (ECFP4) and 2D-pharmacophore

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 41

fingerprints (PFPs) are compared by means of tolerance intervals (TIs, 95%/95%) in virtual screening (VS) and classification (CLS) tasks, using external random (RandSets) and diversitybased (DivSets) test sets. We found the two encodings to perform superior to each other on different kinases in VS, and PFP-models to perform consistently better in classifying actives (higher sensitivity). Next, we combined the two encoding into a single one (PFPECFP) and demonstrate that, especially in VS (as indicated by the exponential receiver-operating curve enrichment metric, eROCE), for the vast majority of kinases, model-performance increased compared to individual fingerprint models. These findings are highlighted in the more challenging DivSets compared to RandSets. The current paper explores the boundaries of inhibitor predictors for individual kinases to enhance VS and ultimately aid the discovery of novel compounds with desirable polypharmacology.

INTRODUCTION Learning algorithms, have a wide spectrum of applications, in various fields, including drug discovery.1-6 Numerous times machine learning has been successfully used to identify novel compounds, active against a desired protein target.7 Powerful prediction models are able to promote valuable compounds in the virtual screening (VS) of large chemical libraries8 and also to discriminate between active and inactive compounds in classification tasks.9 In order to assess their usefulness, the prediction capabilities need to be estimated in real-life applications, e.g., in classification, i.e., the separation of actives from inactives by accurate class-labelling, or in VS, i.e., the prioritization of actives by ranking a large chemical library according to a scoring function. In drug discovery, a series of prediction models (predictors) have been developed to tackle important targets, such as kinases.10-14

ACS Paragon Plus Environment

2

Page 3 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The human kinome comprises more than 500 kinases,15, 16 and many are involved in a plethora of biological processes determinant for several pathological conditions, including cancers.17 For drug development, the design of selective compounds across several classes of kinases is a necessary and a challenging research field.18 Several computational models have been developed to predict kinase ligands (in many cases inhibitors)14, 19-21 at a nearly kinome-wide level.14 Activity data (Ki, IC50, Kd etc) extracted from ChEMLdb,22,

23

PubChem Bioassay24 or Kinase Knowledgebase (KKB),25 have been used to

train and test supervised machine learning models.10, 14 Among these, random forest has been shown to generally outperform alternative methods (e.g., k-nearest neighbors, naive Bayes, support-vector machines, deep neural network etc) in the kinome-wide profiling of kinase inhibitors,6, 14 offering practical advantages, e.g., runs efficiently on large data sets, handles large numbers of input variables of any type (discreet, continuous or binary) without requiring variable selection or normalization, standard parameters offer good results, etc.6, 14 During the past decade, molecular encodings such as extended-connectivity fingerprints (ECFPs)26 have been widely adopted and validated in similarity search27,

28

and lately also in

machine learning applications, including structure-activity predictions against kinases.19, 20 Other molecular descriptors, such as, pharmacophore-based encodings,29 are emerging as promising descriptors in scaffold-hopping predictions.14, 30 Simple implementation, such as, ChemAxon’s atom-pair based 2D pharmacophore fingerprints (PFPs)31 are represented as counts of all pharmacophore feature pairs of molecular path lengths ranging from, e.g., 2 to 10. Comparison studies of two-centric, 2D-pharmcophore implementations and extended connectivity fingerprints focused mainly on similarity search28 and clustering32 methods, to a smaller extent on large-scale supervised machine learning. Continuous efforts are undertaken to standardize and

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 41

mine databases containing bioactivity results,24, 33, 34 which are supplied to cheminformatics in general, and particularly to supervised-learning. The activity of a compound tested against kinases depends upon several factors related to the assays. For example, according to Bain et al.,35 the stability of the compound, its cell permeability and biological reactivity (causing the local accumulation to preferential targets), as well as a millimolar ATP-concentration encountered in cells, might lead to large variation in IC50 values in cellular assays compared to biochemical ones. Mixing the outcome of these determinations might negatively impact the class definitions of actives and inactives in machine learning generating poor performance and misleading predictions. The current paper continues our recent efforts14 to build and extensively validate ligand-based models for the prediction of kinase-inhibitors. Here, we aim to i) review the inhibitory activity data (IC50) extracted from ChEMBL database, concerning a relevant group of protein-kinases with consistent inhibitory data;14 ii) identify and remove several sources of inconsistent activities; iii) use confident, target-specific inhibitory data, to train classification models based on different molecular descriptors; iv) evaluate the prediction power measured in CLS and VS tasks on challenging external test sets; v) provide grounds for statistically sound prediction comparisons. Materials and Methods Activity data retrieval and class definitions. One hundred seven UniProt IDs36 of human kinases employed recently for modelling kinase inhibitors,14 were used to search for standardized IC50 values in ChEMBL version 22_1.22, 23, 37 We found a number of 123718 non-zero activity points (APs) of single protein targets, with relationship of type “>=”, “>” and “=”. Throughout

ACS Paragon Plus Environment

4

Page 5 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

this paper, we will refer to this data set as the initial set of APs. Further, APs were grouped according to unique compound-target pairs, based on the molecule’s ChEMBL ID and the target’s UniProt ID. Pairs with multiple APs considered inconsistent were submitted to analysis (see next section). Only corrected and biochemical assay outcomes were further used, and the geometric mean has been computed to obtain a single IC50 value for each compound-kinase pair. The compounds in the kinase sets were split into two groups according to the inhibition activity: actives (IC50 ≤ 10 µM) and inactives (IC50 > 10 µM), followed by the adjustment of the IC50 threshold of the actives (for each kinase set) based on the distribution of the inhibitory values. The adjusted IC50 thresholds (see Excel file in Supporting Information) were computed using package “hotspot”38, 39 in R statistical environment40 as previously performed by Bora et al.14 Inconsistent activity analysis. In the initial set of APs, a number of 1122 compound-kinase pairs for which multiple IC50 determinations are available, with values spread over two orders of magnitude, were considered inconsistent and retained for manual assay inspection. The extreme APs were analyzed based on the information found in the assay “Description” field provided by ChEBMLdb. The activity information and compound structures were tracked down to the source articles. APs were tagged according to several characteristics (many related to the assay) which could affect the IC50 values: cellular (Cellular), tissue assays (Tissue), mutant kinase targets (Mutant), patent data (Patents), specification of ATP concentrations > 100 µM (ATP), specification of reaction times > 1 hour (Time), curation errors which can be corrected (Errors), and ambiguous activity information (Ambiguous). A last category, i.e., BioChemical, defined APs which were not included in the previous groups.

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 41

Balanced kinase sets of compounds. A number of 33037 compounds (unique ChEMBL22, 23 molecular identifier) were obtained after the removal of inconsistent APs. Additionally, we used PubChemKinIna data set,14 which comprise 38957 compounds considered as being generally inactive against kinases based on high-throughput screening results. All compounds have been standardized using ChemAxon’s Standardizer41 (salt removal, largest disconnected fragment kept, functional group transformations and basic aromatization scheme used).14 From the 107 kinase, three sets contained < 50 actives and were excluded, i.e., MAP3K7 (mitogen-activated protein kinase kinase kinase 7, UniProt ID O43318), ATR (Serine/threonineprotein kinase ATR, UniProt ID Q13535) and PLK2 (Serine/threonine-protein kinase PLK2, UniProt ID Q9NYY3) and a number of 104 kinase sets were supplied to supervised machine learning. For each set of kinase inhibitors, we selected an equal number of non-inhibitors from the available inactives in the corresponding ChEMBL kinase set merged with all PubChemKinIna inactives. The selection has been done based on ChemAxon’s topological fingerprints41 and Tanimoto similarity.42 Thereby, we obtained balanced kinase sets (freely available on www.chembioinf.ro, and briefly described in the Excel file in Supporting Information) containing actives and their most similar inactives in an equal number of class labels. Training and Test sets. The balanced kinase sets have been divided into 80% training and 20% test sets, fifty times, according to two splitting criterion which generated the random (RandSets) and the diverse sets (DivSets). RandSets were obtained by a random selection of compounds (proportional class labels) using “createDataPartition” from the package “caret”43 available in R statistical environment.40

ACS Paragon Plus Environment

6

Page 7 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In DivSets, we aimed for maximizing the diversity of the actives in the external test set while randomly selecting inactives. Within each kinase set, the actives have been clustered using the sphere exclusion algorithm implemented in ChemAxon’s jkluster program (dissimilarity radius of 0.3),44 ChemAxon’s topological fingerprints and Tanimoto distance. The clusters were sorted decreasingly, according to the size, and added to the training sets until a percentage of 80% of the actives in the kinase set was reached (see Excel file in Supporting Information). In some cases, in order to assign exactly 80% of the actives to the training set, the necessary number of actives were randomly selected from the smallest training cluster, i.e., the largest test set cluster. Thereby, the smallest (and more numerous) clusters, covering more diverse actives, were included in the test set as exemplified for ROCK1 (Rho-associated protein kinase 1) in Figure S1 and Figure S2 in Supporting Information. The 20% test sets, containing equally sized labels, were evaluated for CLS. In order to create real-life VS conditions,14 each test set was merged with the PubChemKinIna compounds tailored to avoid inactives used for training the corresponding model. Virtual screening conditions are assured by the large number of inactives encompassed in PubChemKinIna (~39000).14 Depending on the kinase set, the predictors are challenged to prioritize 10 to 200 actives, for the vast majority of the kinases. Thereby, each model generated in the current study, has been evaluated separately on independent CLS and VS sets. Molecular descriptors. Using ChemAxon’s Java API41 we computed the following molecular encoding: ChemAxon’s molecular fingerprints (similarity search and clustering), ECFP4 (1024bit and 512-bit length), and pharmacophore fingerprints (PFs, standardized to the 210 long format described by Bora et al).14 Additionally, PFP and ECFP4 were joint into PFPECFP of length 722, by simply appending to the 210 long PFP vector, the 512-bit version of ECFP.

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 41

Learning algorithm. Random forest models were computed based on conditional inference trees using function “cforest” (with default parameters: ntrees = 500, mtry = square root of the total number of variables) from package “party”45 available in R statistical environment.40 The response classes were set as the majority class of each node and class probabilities were computed from the conditional distribution of the response.45 Performance intervals. Tolerance intervals (TIs) describe the spread of values within confidence limits computed for a certain percentile of the distribution. Here, tolerance intervals have been computed for each evaluation parameter based on 50 samples (resulted from the resampling of the training and test sets) using the function “nptol.int” (Wilks non-parametric approach)46 in package “tolerance”, available in R statistical environment.40 The one-sided tolerance interval for 95% of the data was computed at 95% confidence level (95%/95% TI). Evaluation parameters: A set of evaluation parameters (see Table 1) were computed (using our in-house program ETCIv1.6)14 to describe the CLS (and discrimination) power of the models, as well as their VS performance (the early enrichment in actives). Model-performance in the CLS test sets (equal number of class samples) is assessed through sensitivity (Se, the fraction of actives correctly predicted), specificity (Sp, the fraction of inactives correctly predicted), accuracy (Acc, the fraction of correct predictions) and the area under the receiver operating curve (AUC),47 and in the VS test sets, through the exponential receiver operating curve enrichment (eROCE),27, 33, 48 the true positive rate (TPR) at 0.5% and 1% false positives (FPs).14, 33, 49

The eROCE parameter computes identical values (with the same meaning) as the

Boltzmann Enhanced Discrimination of ROC (BEDROC)50 parameter in VS conditions, but offers important advantages i.e., easy to compute (a simple function of the false positive rate),

ACS Paragon Plus Environment

8

Page 9 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

increased robustness also in non-typical VS scenarios and a wider application in comparing results from different studies and data sets. 48 Table 1. Description of the evaluation parameters used to assess the classification and virtual screening capacities of the predictors. Evaluation type

Name

Equation[a]

Classification (CLS)

Sensitivity

Se = TP / (TP + FN)

Specificity

Sp = TN / (TN + FP)

Accuracy

Acc= (TP + TN) / (TP + FN + TN + FP)

Area under the receiver  = 1 − 1 ∑+   + =1 operating curve Virtual Screening (VS)

Exponential ROC Enrichment (α = 20)

  =

1 +

∑+  −    =1

True Positive Rate at TPRx = TPR at x% FP, where x =[0.5%, x% of the False 1%]FPs Positives [a]

TP is the number of correctly predicted actives (true positives), TN is the number of correctly predicted inactives (true negatives), FP is the number of incorrectly identified actives (false positives), FN is the number of incorrectly identified inactives (false negatives), TPR is the fraction of correctly predicted actives, FPRi is the ratio of the number of miss-predicted inactives to the total number of inactives when the ith active was retrieved in the ranking list.

RESULTS AND DISCUSSIONS We begin by reporting our results of the inconsistent APs related to kinase inhibition, and in the following sections describe the output of an extensive model-evaluation study based on homogeneous activity data. The results of the current study comprise 77633 investigated inhibitory APs implying 33037 compounds and covering 107 kinases. We built and evaluated

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 41

(the VS and CLS performance of) 31200 random forest models (3 types of molecular encodings), covering 104 kinase sets, which provided diversity-based (DivSets) and random-based test sets (RandSets). The results are discussed in terms of tolerance intervals (TIs, i.e., the range in which at least 95% of the evaluation values fall with a 95% level of confidence) computed for 7 evaluation metrics describing the potential of the approaches.

Inconsistencies in activity points We found 1122 compound-kinase pairs (covering 3381 APs by 1407 assays in 778 documents) showing inconsistencies (multiple activity determination for which the values spread over two orders of magnitude). We considered several possible causes which defined categories of APs (see above in Materials and Methods) and manually inspected the lowest and the highest IC50 values, as well as the corresponding assay properties in the published articles providing the APs.

Table 2. Counts (and percentages) of APs (IC50 values) in categories describing a total of 1122 compound-kinase pairs with inconsistent APs; categorized counts of kinase APs found in the initial set of APs. Inconsistency analysis set Highest IC50

Lowest IC50

Total

Initial set of APs

Tissue

113 (93%)

8 (7%)

121

1015

Error

77 (90%)

9 (10%)

86

378

ATP

39 (80%)

10 (20%)

49

2983

Cellular

661 (77%)

192 (23%)

853

21653

Time

112 (73%)

42 (27%)

154

5884

Categories

ACS Paragon Plus Environment

10

Page 11 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Ambiguous

47 (71%)

19 (29%)

66

302

Mutant

18 (45%)

22 (55%)

40

1026

Patent

22 (27%)

59 (73%)

81

23824

BioChemical

118 (15%)

663 (85%)

781

77255

In Table 2 we report within each category the percentage of APs found as the highest and lowest IC50 values in the inconsistency sets. The highest IC50 values were encountered in activity determinations performed on tissues (whole blood in general, which are not flagged as tissuebased AP in the corresponding field in ChEMBL; Tissue 93%) and cells (also many times not correspondingly assigned; Cellular 77%), and APs with specified reaction times > 1 hour (Time 73%) or high ATP concentrations > 100 µM (ATP 80%). Errors (Error) originated from activity value conversions affected in 90% of the cases the higher IC50 APs. The lowest IC50 values are associated in ~85% of the cases (out of 781) with target-specific biochemical assays (BioChemical). Activity determination extracted from patents could not be verified because of limited access to the assay protocols and were separately categorized. In 73% of the cases studied herein, Patents APs were encountered within the lower IC50 values. These results, suggest that the lower IC50 values are usually less affected by errors, avoid factors related to the cells or tissues, and reflect more accurately the structure-activity relationship of the molecules against the target. The analysis of the inconsistent data resulted in the categorization of entire assays which included many more APs. This enabled us to extend the tagging to the entire initial data set of 123718 APs (Table 2). Further, APs of type Cellular, Tissue, Mutant, Patents, ATP, Time and Ambiguous were filtered out. Erroneous APs were corrected and added to BioChemical APs in a

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 41

total of 77633 (63%) activity determinations which were further used in modelling the kinase inhibitory activity. Commonly, inconsistent activity pairs are identified and filtered out to avoid biased data in mining and modelling studies.51 However, the source of these extreme APs might affect more data. Applying the percentage of Cellular and Tissue APs found in the inconsistency sets (i.e., 70% of the highest IC50 values) to the 22668 APs (Cellular and Tissue) in the initial set, results in 15868 APs. This means that from a total of 105382 kinase-compound pairs found in the initial set of APs, about 15% of IC50 values account also for assay factors related to the cellular (or tissular) conditions, which can lead to miss-definition of actives, and might be a source of apparent activity-cliffs. Along with IC50, Ki determinations, could also be used to enrich the kinase inhibitory data sets. We found 64354 Ki APs referring to non-zero, single (non-mutant) protein inhibition, in nonpatent assays, with relationship of type “>=”, “>” and “=”. Approximately 90% of these resulted from a single kinase panel study, i.e., Metz’s et al.52 According to Sutherland’s et al53 comparative review of four large kinase profiling panels (which includes the Metz data set), the agreement for active compounds is only of 37%. Without neglecting the value of kinase panel assays, this low percentage, in addition to approximations regarding the conversion between the two activity types (Cheng-Prusoff equations),54 has persuaded us, at least for the purpose of the current study, to leave out Ki determinations. In the future, the revized IC50 APs assemble herein might help to identify and explain possible inconsistencies in kinase-profiling assays results. Comparison between molecular encodings The performances of the predictors built on different fingerprints were compared in terms of the TIs (95%/95%) of the evaluation parameters. Herein, we consider that, for a given kinase,

ACS Paragon Plus Environment

12

Page 13 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

model-performance using a fingerprint is superior to another fingerprint, if its TI (computed for a particular metric) is higher and the two TIs do not overlap. The evaluation results are shown in Figure 1-8. PFP and ECFP-based model performance In VS testing on DivSets, eROCE shows PFP and ECFP-based models to outperform each other, equally, on 34 kinases, but a superior TPR0.5 performance of ECFP-models was encountered in 37 cases (compared to PFP-models in 15 kinases). We found superior PFPmodels in classifying kinase inhibitors for 78 and 43 of the kinases, in terms of Se and Acc, respectively. The results obtained on the RandSets reveal more homogenous results with only a few superior TIs, in the case of VS parameters and Sp for ECFP models, and Se for PFP models. In terms of predicting inactives (Sp) there is no kinase in which PFP outperformed ECFP predictors (Figure 1 and Table S1 in Supporting Information). Considering the above findings, PFP and ECFP encodings seem to perform complementary: in VS evaluation, i.e., the two encodings perform superior on different kinases, and in CLS, i.e., PFP tend to better classify actives while ECFP inactives. These features are further speculated, by joining the two fingerprints into PFPECFP (of length 722).

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 41

Figure 1. Stacked barplots describing, for each of the 104 kinases, counts of superior TIs between three fingerprint models, i.e., PFP (green), ECFP (red) and PFPECFP (blue), in DivSets. E.g., in case of ABL1, in terms of eROCE, PFPECFP-models are superior to both PFP and ECFP (count of 2), while PFP-models perform superior to ECFP-models (count 1); in case of AKT1, eROCE TIs indicate ECFP- and PFPECFP-models are superior to PFP-models. PFPECFP model performance

ACS Paragon Plus Environment

14

Page 15 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

We found that PFPECFPs models enhanced especially the VS on DivSets as suggested by eROCE: for 28 kinases, PFPECFPs performed superior to both individual encodings, and on 62 better than at least one of them (Figure 1). In CLS, PFPECFP performed better than 71 of either PFP or ECFP in classifying actives. Some improvement can be seen in the classification of inactives but non-overlapping TIs couldn’t be detected for PFPECFP. In contrast to DivSets, the evaluation of RandSets reveals only a few cases of superior TIs among the three encodings used in modeling kinase inhibitors (see Table S1 in Supporting Information). For example, PFPECFPs outperformed the individual fingerprints, for 7, 9 and 7 kinases in terms of Sp, eROCE and TPR0.5. DivSets versus RandSets Commonly, in drug discovery, a data set containing active and inactive compounds against a target, is randomly split into training and test sets preserving proportional class samples.14, 27, 33 Here, we compared this approach i.e., RandSets, to DivSets, for which only inactives are randomly sampled, while actives are kept constant, so that models learning from low-diversity sets of actives are challenged to predict highly diverse kinase inhibitors. In comparison to DivSets, RandSets evaluations resulted in consistently higher scores. Subtracting separately the lower and the upper TI values of DivSets from RandSets results, for all kinases, enables a direct comparison between the two types of splitting (Table S2 and S3 in Supporting Information). In terms of the lower TI limits, in PFPECFP modelling, the smallest difference is encountered for Sp (< 0.103) and the largest for Se (middle 80% range between 0.064 and 0.422, across all kinases). In VS, the difference in TPR0.5 enrichment spans between 0.235 and 0.55, with 80% values bounded between 0.028 and 0.371. In terms of the upper TI

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 41

values, we see a narrow gap between the Se values (in 80% of the kinases ranging from -0.086 to 0.041). A consequence of the random splitting of actives is that similar molecules are supplied to training and testing, which improves model predictions compared to DivSets testing, as indicated by the wider TIs (in general, due to higher upper limits). Further, we computed the Pearson correlation coefficient between DivSets and RandSets results for each evaluation metric (see Table S4 in Supporting Information). The outcome revealed a consistently strong positive correlation of the lower TI margins, and in general, a moderate positive one of the upper TI margins. The largest difference in correlation values, between the two TI limits, was found for Se, Acc and AUC, in PFP and PFPECFP models, and the VS parameters in ECFP models. Regardless of the fingerprint, the capacity to classify inactive compounds is highly correlated between the DivSets and RandSets (Pearson coefficient values around 0.9), indicating that the prediction of inactive compounds might be only loosely dependent on the diversity of the actives in the data set. Our results suggest that in spite of significant correlation in the lower TI values (between 0.669 for PFPECFP’s Se and 0.939 for ECFP’s Sp; see Table S4 in Supporting Information), DivSet evaluations are in general more severe compared to RandSets. It provides a way to fully exploit the data set by testing the boundaries of the modelling tools which used on average 39% ± 9% of the total number of clusters for training. In DivSets, actives assigned for testing have Tanimoto values < 0.7 compared to the ones the model is trained on, and cover the majority of the clusters in the kinase sets (see Figure S1 and S2 in Supporting Information). Thus, compared to RandSets, DivSets might indicate a more realistic capacity of the method to identify both novel and diverse classes of kinases inhibitors in real-life scenarios.

ACS Paragon Plus Environment

16

Page 17 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2. Tolerance intervals (TIs) and medians of exponential ROC Enrichment (eROCE) values computed for PFP, ECFP4 and PFPECFP based models.

ACS Paragon Plus Environment

17

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 41

Figure 3. Tolerance intervals (TIs) and medians of TPRs at 0.5% FPs (TPR0.5) values computed for PFP, ECFP4 and PFPECFP based models.

Figure 4. Tolerance intervals (TIs) and medians of TPRs at 1% FPs (TPR1) values computed for PFP, ECFP4 and PFPECFP based models.

ACS Paragon Plus Environment

18

Page 19 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5. Tolerance intervals (TIs) and medians of sensitivity (Se) values computed for PFP, ECFP4 and PFPECFP based models.

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 41

Figure 6. Tolerance intervals (TIs) and medians of specificity (Sp) values computed for PFP, ECFP4 and PFPECFP based models.

Figure 7. Tolerance intervals (TIs) and medians of accuracy (Acc) values computed for PFP, ECFP4 and PFPECFP based models.

ACS Paragon Plus Environment

20

Page 21 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 8. Tolerance intervals (TIs) and medians of the area under the receiver operating curve (AUC) values computed for PFP, ECFP4 and PFPECFP-based models. Virtual screening and classification performances The TI shown in Figures 2-8 indicate the VS and CLS performance computed for each kinase. Compared to the average or median values of the evaluation results, the lower margin of the TIs indicates a more severe criterion to judge the prediction power of the models. Thus, we will first report on the lower TI values computed on the DivSets. In VS tasks, PFPECFP-modeling resulted in 71 kinases for which > 50% of the actives were retrieved before the first 0.5% inactives in the ranking list (TPR0.5 > 0.5). Such high early enrichment has been achieved only for 43 kinases in PFP-modeling and 55 kinases in ECFPmodelling. In terms of eROCE, scores > 0.5 were found for 92, 87 and 68 kinases in PFPECFP, PFP and ECFP-based modelling.

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 41

In CLS tasks of PFP and PFPECFP-models we found 64 kinases for which both Se and Sp values are > 0.5. In case of ECFP, the same classification rates are reached only for 24 kinases. The discriminative power, also computed on the CLS test sets, resulted for 84, 80 and 73 kinases, in AUC ≥ 0.7, in PFPECFP, PFP and ECFP –based models. The AUC scores are generally high suggesting that the probability scores computed by the models provide a good class-separation. The analysis of the upper TI limits can highlight models with poor performance. Regardless of the molecular encoding employed, in case of RPS6KA3, CAMK2D, CDK5, CDK7, CLK1, MAPK9, MAPK1, AXL, PTK2B, FYN, ZAP70, CSNK1D and PLK1, the early enrichment in the ~ top 0.5% of a chemical library could not exceed 50%. In case of eROCE, upper TI limits < 0.5 were found only for CDK5 and PTK2B, independent of the fingerprint type. Both Se < 0.5 and Sp < 0.5 were found in the case of PRKCA, PRKCZ, RPS6KA3, CAMK2D, CDK5, CLK1, MAPK1, AXL, PTK2B, FYN, ZAP70, MAP3K8 and PLK1. One should bear in mind that models built for these kinases are still capable to successfully identify kinase-inhibitors as suggested by RandSets results (Figure 2-8). Briefly, in RandSets, the average VS capabilities of PFPECFP-models with eROCE and TPR0.5 > 0.7 has been encountered for 98 kinases. Moreover, also in PFPECFP models, 90 kinases indicate concomitantly median Se and Sp > 0.8 and 96 kinases have AUC > 0.9. These results indicate a very good VS and CLS capacity of the PFPECFP models using target-specific clean kinase inhibitory APs. Conclusions Bounded by the kinase sets studies herein, we have shown that differences in cellular (and tissular) and biochemical assays contribute to the majority of the extreme inhibitory values. In

ACS Paragon Plus Environment

22

Page 23 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

QSAR modelling, ideally, the independent variable should reflect as accurate as possible the activity of a series of compounds against the target of interest (without being affected by other assay properties), and therefore we recommend the use of biochemical assay data for QSAR modelling but also for optimization and comparisons of docking based methods. Activity-cliffs identification might be also affected on the same grounds. Furthermore, the results obtained herein, suggest that taking the minimum activity value, if multiple are available, can be, in general, a simple way to avoid biased compound-target interaction values. Of course, for drug discovery, successful prediction of cell-based kinase-inhibitory determinations can be even a more valuable asset in hit or even lead-identification and should be explored in future studies. Here, we used 95%/95% tolerance intervals to estimate the potential of two dimensional molecular fingerprint based-models. We’ve found that each molecular encoding works better for some kinases compared the other. Moreover, PFP-based models perform better on learning to predict (diverse) actives compared to ECFP-based models. We show that combining the two fingerprints into PFPECFP, enhances the models, adding the features of the two fingerprints, and widening the spectra of high-predictive kinase-targets. Instead of reporting an average performance of the models we have focused on estimating the lowest performance one would expect in predicting diverse kinase-inhibitors (for each of the 104 kinases) of Tanimoto similarity < 0.7 compared to known actives (those used in model building). In this sense, DivSets evaluation showed in general optimistic results for the majority of the kinases, both in VS and CLS. We expect our results to roughly approximate the capacity of two dimensional molecular fingerprint based-models to identify and prioritize novel and diverse kinase inhibitors in

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 41

chemical libraries. Finally, we consider that the current results can contribute to a better evaluation of the boundaries of kinase inhibitor predictors in general and provide the PFPECFP models trained on the entire kinase-specific data sets compiled in Kinase Inhibition Predictor (KIP), an application freely available at www.chembioinf.ro. ASSOCIATED CONTENT Supporting Information. Contains figures and tables supporting the description of the results (PDF), and a brief description of the balanced kinase sets (EXCEL). This information is available free of charge via the Internet at http://pubs.acs.org. AUTHOR INFORMATION Corresponding Author * [email protected] ([email protected]), [email protected] ([email protected]) Notes The authors declare no competing financial interest. Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. ACKNOWLEDGMENT This work was supported by a grant of the Romanian National Authority for Scientific Research and Innovation, CNCS−UEFISCDI, project number PN-II-RU-TE-2014-4-0422 and

ACS Paragon Plus Environment

24

Page 25 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Romanian Academy, Institute of Chemistry Timişoara, project number 1.2.4/2017. All authors are indebted to ChemAxon Ldt for providing access to their software. ABBREVIATIONS RandSets, Random Sets; DivSets, Diversity Sets; CLS, classification; VS, Virtual Screening; AP, activity point; TI, tolerance interval. REFERENCES 1.

Lavecchia, A., Machine-Learning Approaches in Drug Discovery: Methods and

Applications. Drug Discovery Today 2015, 20, 318-331. 2.

Wang, N. N.; Dong, J.; Deng, Y. H.; Zhu, M. F.; Wen, M.; Yao, Z. J.; Lu, A. P.; Wang, J.

B.; Cao, D. S., ADME Properties Evaluation in Drug Discovery: Prediction of Caco-2 Cell Permeability Using a Combination of NSGA-II and Boosting. J. Chem. Inf. Model. 2016, 56, 763-773. 3.

Wang, N.-N.; Deng, Z.-K.; Huang, C.; Dong, J.; Zhu, M.-F.; Yao, Z.-J.; Chen, A. F.; Lu,

A.-P.; Mi, Q.; Cao, D.-S., ADME Properties Evaluation in Drug Discovery: Prediction of Plasma Protein Binding Using NSGA-II Combining PLS and Consensus Modeling. Chemom. Intell. Lab. Syst. 2017, 170, 84-95. 4.

Wang, N.-N.; Huang, C.; Dong, J.; Yao, Z.-J.; Zhu, M.-F.; Deng, Z.-K.; Lv, B.; Lu, A.-

P.; Chen, A. F.; Cao, D.-S., Predicting Human Intestinal Absorption with Modified Random Forest Approach: A Comprehensive Evaluation of Molecular Representation, Unbalanced Data, and Applicability Domain Issues. RSC Adv. 2017, 7, 19007-19018.

ACS Paragon Plus Environment

25

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

5.

Page 26 of 41

Deng, Y. H.; Wang, N. N.; Zou, Z. X.; Zhang, L.; Xu, K. P.; Chen, A. F.; Cao, D. S.;

Tan, G. S., Multi-Target Screening and Experimental Validation of Natural Products from Selaginella Plants against Alzheimer's Disease. Front. Pharmacol. 2017, 8, 539. 6.

Merget, B.; Turk, S.; Eid, S.; Rippmann, F.; Fulle, S., Profiling Prediction of Kinase

Inhibitors: Toward the Virtual Assay. J. Med. Chem. 2017, 60, 474-485. 7.

Lima, A. N.; Philot, E. A.; Trossini, G. H.; Scott, L. P.; Maltarollo, V. G.; Honorio, K.

M., Use of Machine Learning Approaches for Novel Drug Discovery. Expert Opin. Drug. Discov. 2016, 11, 225-239. 8.

Slynko, I.; Schmidtkunz, K.; Rumpf, T.; Klaeger, S.; Heinzlmeir, S.; Najar, A.; Metzger,

E.; Kuster, B.; Schule, R.; Jung, M.; Sippl, W., Identification of Highly Potent Protein Kinase CRelated Kinase 1 Inhibitors by Virtual Screening, Binding Free Energy Rescoring, and in Vitro Testing. ChemMedChem 2016, 11, 2084-2094. 9.

Hillisch, A.; Heinrich, N.; Wild, H., Computational Chemistry in the Pharmaceutical

Industry: From Childhood to Adolescence. ChemMedChem 2015, 10, 1958-1962. 10. Ferre, F.; Palmeri, A.; Helmer-Citterich, M., Computational Methods for Analysis and Inference of Kinase/Inhibitor Relationships. Front. Genet. 2014, 5, 196. 11. Avram, S.; Avram, S.; Crisan, L.; Pacureanu, L.; Kurunczi, L.; Bora, A., Self-Organizing Map Classification Model for the Prediction of MEK1 Inhibitors. Rev. Roum. Chim. 2015, 60. 12. Crisan, L.; Bora, A.; Pacureanu, L.; Avram, S.; L., K., PLS (Partial Least Square) Study for GSK-3 (Glycogen Synthase Kinase-3) Inhibition by Indirubin Derivatives. Rev. Chim. (Bucharest) 2012, 63, 481-488.

ACS Paragon Plus Environment

26

Page 27 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

13. Crisan, L.; Pacureanu, L.; Bora, A.; Avram, S.; Kurunczi, L.; Simon, Z., QSAR Study and Molecular Docking on Indirubin Inhibitors of Glycogen Synthase Kinase-3. Cent. Eur. J. Chem. 2012, 11, 63-77. 14. Bora, A.; Avram, S.; Ciucanu, I.; Raica, M.; Avram, S., Predictive Models for Fast and Effective Profiling of Kinase Inhibitors. J. Chem. Inf. Model. 2016, 56, 895-905. 15. Manning, G.; Whyte, D. B.; Martinez, R.; Hunter, T.; Sudarsanam, S., The Protein Kinase Complement of the Human Genome. Science 2002, 298, 1912-1934. 16. Miranda-Saavedra, D.; Barton, G. J., Classification and Functional Annotation of Eukaryotic Protein Kinases. Proteins 2007, 68, 893-914. 17. Knight, Z. A.; Lin, H.; Shokat, K. M., Targeting the Cancer Kinome through Polypharmacology. Nat. Rev. Cancer 2010, 10, 130-137. 18. Ekambaram, R.; Enkvist, E.; Vaasa, A.; Kasari, M.; Raidaru, G.; Knapp, S.; Uri, A., Selective Bisubstrate Inhibitors with Sub-Nanomolar Affinity for Protein Kinase Pim-1. ChemMedChem 2013, 8, 909-913. 19. Cao, D. S.; Zhou, G. H.; Liu, S.; Zhang, L. X.; Xu, Q. S.; He, M.; Liang, Y. Z., LargeScale Prediction of Human Kinase-Inhibitor Interactions Using Protein Sequences and Molecular Topological Structures. Anal. Chim. Acta. 2013, 792, 10-18. 20. Niijima, S.; Shiraishi, A.; Okuno, Y., Dissecting Kinase Profiling Data to Predict Activity and Understand Cross-Reactivity of Kinase Inhibitors. J. Chem. Inf. Model. 2012, 52, 901-912. 21. Schurer, S. C.; Muskal, S. M., Kinome-Wide Activity Modeling from Diverse Public High-Quality Data Sets. J. Chem. Inf. Model. 2013, 53, 27-38.

ACS Paragon Plus Environment

27

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 41

22. Bento, A. P.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Kruger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; Nowotka, M.; Papadatos, G.; Santos, R.; Overington, J. P., The ChEMBL Bioactivity Database: An Update. Nucleic Acids Res. 2014, 42, D1083-1090. 23. Papadatos, G.; Gaulton, A.; Hersey, A.; Overington, J. P., Activity, Assay and Target Data Curation and Quality in the ChEMBL Database. J. Comput.-Aided Mol. Des. 2015, 29, 885896. 24. Wang, Y.; Suzek, T.; Zhang, J.; Wang, J.; He, S.; Cheng, T.; Shoemaker, B. A.; Gindulyte, A.; Bryant, S. H., Pubchem Bioassay: 2014 Update. Nucleic Acids Res. 2014, 42, D1075-1082. 25. Sharma, R.; Schurer, S. C.; Muskal, S. M., High Quality, Small Molecule-Activity Datasets for Kinase Research. F1000Research 2016, 5. 26. Rogers, D.; Hahn, M., Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742-754. 27. Avram, S.; Pacureanu, L. M.; Seclaman, E.; Bora, A.; Kurunczi, L., PLS-DA - Docking Optimized Combined Energetic Terms (PLSDA-DOCET) Protocol: A Brief Evaluation. J. Chem. Inf. Model. 2011, 51, 3169-3179. 28. Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A., Comparison of Topological Descriptors for Similarity-Based Virtual Screening Using Multiple Bioactive Reference Structures. Org. Biomol. Chem. 2004, 2, 3256-3266. 29. McGregor, M. J.; Muskal, S. M., Pharmacophore Fingerprinting. 1. Application to QSAR and Focused Library Design. J. Chem. Inf. Comput. Sci. 1999, 39, 569-574.

ACS Paragon Plus Environment

28

Page 29 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

30. Reutlinger, M.; Koch, C. P.; Reker, D.; Todoroff, N.; Schneider, P.; Rodrigues, T.; Schneider, G., Chemically Advanced Template Search (CATS) for Scaffold-Hopping and Prospective Target Prediction for 'Orphan' Molecules. Mol. Inf. 2013, 32, 133-138. 31. Chemaxon. https://docs.chemaxon.com (accessed August 2017). 32. Lang, T.; Flachsenberg, F.; von Luxburg, U.; Rarey, M., Feasibility of Active Machine Learning for Multiclass Compound Classification. J. Chem. Inf. Model. 2016, 56, 12-20. 33. Avram, S. I.; Pacureanu, L. M.; Bora, A.; Crisan, L.; Avram, S.; Kurunczi, L., ColBioSFlavRC: A Collection of Bioselective Flavonoids and Related Compounds Filtered from HighThroughput Screening Outcomes. J. Chem. Inf. Model. 2014, 54, 2360-2370. 34. Gaulton, A.; Hersey, A.; Nowotka, M.; Bento, A. P.; Chambers, J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L. J.; Cibrian-Uhalte, E.; Davies, M.; Dedman, N.; Karlsson, A.; Magarinos, M. P.; Overington, J. P.; Papadatos, G.; Smit, I.; Leach, A. R., The ChEMBL Database in 2017. Nucleic Acids Res. 2017, 45, D945-D954. 35. Bain, J.; Plater, L.; Elliott, M.; Shpiro, N.; Hastie, C. J.; McLauchlan, H.; Klevernic, I.; Arthur, J. S.; Alessi, D. R.; Cohen, P., The Selectivity of Protein Kinase Inhibitors: A Further Update. Biochem. J. 2007, 408, 297-315. 36. UniProt Consortium, Reorganizing the Protein Space at the Universal Protein Resource (Uniprot). Nucleic Acids Res. 2012, 40, D71-75. 37. Chembl

Version

22_1.

https://www.ebi.ac.uk/chembl/

(accessd

June

2017),

10.6019/CHEMBL.database.22.1.

ACS Paragon Plus Environment

29

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 41

38. Darrouzet-Nardi, A.; Bowman, W. D., Hot Spots of Inorganic Nitrogen Availability in an Alpine-Subalpine Ecosystem, Colorado Front Range. Ecosystems 2011, 14, 848-863. 39. Anighoro, A.; Rastelli, G., Bear, a Molecular Docking Refinement and Rescoring Method. Comput. Mol. Biosci. 2013, 03, 27-31. 40. Ansell, P. J.; Lo, S. C.; Newton, L. G.; Espinosa-Nicholas, C.; Zhang, D. D.; Liu, J. H.; Hannink, M.; Lubahn, D. B., Repression of Cancer Protective Genes by 17beta-Estradiol: Ligand-Dependent Interaction between Human Nrf2 and Estrogen Receptor Alpha. Mol. Cell. Endocrinol. 2005, 243, 27-34. 41. JChem Base was used for structure searching and chemical database access and management, J., 2016, ChemAxon (http://www.chemaxon.com). 42. Willett, P., Similarity-Based Virtual Screening Using 2d Fingerprints. Drug Discovery Today 2006, 11, 1046-1053. 43. Kuhn, M. C. f. W., J.; Weston, S.; Williams, A.; Keefer, C.; Engelhardt, A.; Cooper, T.; Mayer, Z.; Kenkel, B.; the R Core Team; Benesty, M.; Lescarbeau, R.; Ziem, A.; Scrucca, L. caret: Classification and Regression Training 2015, R package version 6.0-41; http://CRAN.Rproject.org/package=caret. 44. Marin, Y. E.; Seiberg, M.; Lin, C. B., Aldo-Keto Reductase 1c Subfamily Genes in Skin Are UV-Inducible: Possible Role in Keratinocytes Survival. Exp. Dermatol. 2009, 18, 611-618. 45. Strobl, C.; Hothorn, T.; Zeileis, A., Party On ! - a New, Conditional Variable Importance Measure for Random Forests Available in the Party Package. The R Journal 2009, 1, 14-17.

ACS Paragon Plus Environment

30

Page 31 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

46. Wilks, S. S., Determination of Sample Sizes for Setting Tolerance Limits. Ann. Math. Stat. 1941, 12, 91-96. 47. Hanley, J. A.; McNeil, B. J., The Meaning and Use of the Area under a Receiver Operating Characteristic (ROC) Curve. Radiology 1982, 143, 29-36. 48. Avram, S. I.; Crisan, L.; Bora, A.; Pacureanu, L. M.; Avram, S.; Kurunczi, L., Retrospective Group Fusion Similarity Search Based on Eroce Evaluation Metric. Bioorg. Med. Chem. 2013, 21, 1268-1278. 49. Jain, A. N., Bias, Reporting, and Sharing: Computational Evaluations of Docking Methods. J. Comput.-Aided Mol. Des. 2008, 22, 201-212. 50. Truchon, J. F.; Bayly, C. I., Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem. J. Chem. Inf. Model. 2007, 47, 488-508. 51. Hu, Y.; Kunimoto, R.; Bajorath, J., Mapping of Inhibitors and Activity Data to the Human Kinome and Exploring Promiscuity from a Ligand and Target Perspective. Chem. Biol. Drug. Des. 2017, 89, 834-845. 52. Metz, J. T.; Johnson, E. F.; Soni, N. B.; Merta, P. J.; Kifle, L.; Hajduk, P. J., Navigating the Kinome. Nat. Chem. Biol. 2011, 7, 200-202. 53. Sutherland, J. J.; Gao, C.; Cahya, S.; Vieth, M., What General Conclusions Can We Draw from Kinase Profiling Data Sets? Biochim. Biophys. Acta 2013, 1834, 1425-1433. 54. Copeland, R. A. Comparing Relative Affinity. In Evaluation of Enzyme Inhibitors in Drug Discovery: A Guide for Medicinal Chemists and Pharmacologists, Copeland, R. A., Ed.; John Wiley & Sons, Inc.: Hoboken, New Jersey, 2005, p 131.

ACS Paragon Plus Environment

31

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 41

for Table of Contents use only Modelling kinase inhibition using highly confident data sets Sorin Avram*, Alina Bora, Liliana Halip, Ramona Curpăn*

ACS Paragon Plus Environment

32

Page 33 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1. Stacked barplots describing, for each of the 104 kinases, counts of superior TIs between three fingerprint models, i.e., PFP (green), ECFP (red) and PFPECFP (blue), in DivSets. E.g., in case of ABL1, in terms of eROCE, PFPECFP-models are superior to both PFP and ECFP (count of 2), while PFP-models perform superior to ECFP-models (count 1); in case of AKT1, eROCE TIs indicate ECFP- and PFPECFP-models are superior to PFP-models. 175x150mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2. Tolerance intervals (TIs) and medians of exponential ROC Enrichment (eROCE) values computed for PFP, ECFP4 and PFPECFP based models. 175x102mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 34 of 41

Page 35 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 3. Tolerance intervals (TIs) and medians of TPRs at 0.5% FPs (TPR0.5) values computed for PFP, ECFP4 and PFPECFP based models. 175x102mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4. Tolerance intervals (TIs) and medians of TPRs at 1% FPs (TPR1) values computed for PFP, ECFP4 and PFPECFP based models. 175x102mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 36 of 41

Page 37 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5. Tolerance intervals (TIs) and medians of sensitivity (Se) values computed for PFP, ECFP4 and PFPECFP based models. 175x102mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6. Tolerance intervals (TIs) and medians of specificity (Sp) values computed for PFP, ECFP4 and PFPECFP based models. 175x102mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 38 of 41

Page 39 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 7. Tolerance intervals (TIs) and medians of accuracy (Acc) values computed for PFP, ECFP4 and PFPECFP based models. 175x102mm (300 x 300 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 8. Tolerance intervals (TIs) and medians of the area under the receiver operating curve (AUC) values computed for PFP, ECFP4 and PFPECFP-based models. 175x102mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 40 of 41

Page 41 of 41 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Graphical Abstract 88x35mm (300 x 300 DPI)

ACS Paragon Plus Environment