A New Approach for Drug Target and Bioactivity Prediction: The

Nov 28, 2018 - We present MuSSeL, a multifingerprint similarity search algorithm, able to predict putative drug targets for a given query small molecu...
1 downloads 0 Views 695KB Size
Subscriber access provided by University of Winnipeg Library

Pharmaceutical Modeling

A new approach for drug target and bioactivity prediction: the Multi-fingerprint Similarity Search aLgorithm (MuSSeL) Domenico Alberga, Daniela Trisciuzzi, Michele Montaruli, Francesco Leonetti, Giuseppe Felice Mangiatordi, and Orazio Nicolotti J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00698 • Publication Date (Web): 28 Nov 2018 Downloaded from http://pubs.acs.org on November 29, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

A New Approach For Drug Target and Bioactivity Prediction: the Multi-fingerprint Similarity Search aLgorithm (MuSSeL)

Domenico Alberga, Daniela Trisciuzzi, Michele Montaruli, Francesco Leonetti, Giuseppe Felice Mangiatordi and Orazio Nicolotti* Dipartimento di Farmacia-Scienze del Farmaco, Università degli Studi di Bari “Aldo Moro”, Via E. Orabona, 4, I-70126 Bari, Italy

ABSTRACT We present MuSSeL, a multi-fingerprint similarity search algorithm, able to predict putative drug targets for a given query small molecule as well as to return a quantitative assessment of its bioactivity in terms of Ki or IC50 values. Predictions are automatically made exploiting a large collection of high quality experimental bioactivity data available from ChEMBL (version 22.1) combining, in a consensus-like approach, predictions resulting from a similarity search performed using 13 different fingerprint definitions. Importantly, the herein proposed algorithm is also effective in detecting and handling activity cliffs. A calibration set including small molecules present in the last updated version of ChEMBL (version 23) was employed to properly tune the algorithm parameters. Three randomly built external sets were instead challenged for model performances. The potential use of MuSSeL was also challenged by a prospective exercise for the prediction of five bioactive compounds taken from articles published in Journal of Medicinal Chemistry just few months ago. The paper emphasizes the importance of implementing multifingerprint consensus strategies to increase the confidence in prediction of similarity search algorithms and provides a fast and easy-to-run tool for drug target and bioactivity prediction.

*

Author to whom correspondence should be addressed; e-mail: [email protected]; telephone: +39080-5442551; fax: +39-080-5442230 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 28

INTRODUCTION Nowadays the identification of novel bioactive compounds is rationally driven by predictive in silico approaches whose application is especially important in the early stages of research programs.1 In this respect, a large manifold of consolidated computer aided protocols are available to properly address the design of new molecules towards specific and known biological targets.2 For instance, molecular docking and molecular dynamics are widely acknowledged as successful computational techniques to forecast and explain the key interactions established between threedimensional solved proteins and small molecules, thus allowing the prioritization of potential drug candidates on the basis of shape complementarity and interaction strength.3 These structure-based approaches are particularly pursued in hit-to-lead optimization provided that at least a given biological target has been already validated.4 In the present work, we flipped this view by developing a computational approach aimed at identifying the putative drug targets for drug-like molecules. This strategy can be paramount for repurposing known drugs or for optimizing side off actions.5–8 To this end, several methods have been so far developed. They can be roughly classified as: i) molecular similarity based approaches;9–16 ii) network-based models;17–30 and iii) advanced machine learning methods.31–42 Each of these approaches has its own pros and cons extensively reviewed elsewhere.5,7,43 In the present work, we developed a new molecular similarity based strategy to rationally disclose the latent although causative relationships connecting small drug-like molecules to potential protein drug targets. This approach relies on the fair assumption that if a given compound binds a given biological target, similar compounds will behave the same way.44 From a practical point of view, compounds can be encoded by employing various molecular notations, being molecular fingerprints (FPs)45 and physicochemical descriptors46 perhaps the most popular,47 from which a similarity value is then calculated using different metrics such as the wellknown Tanimoto coefficient.47,48 Noteworthy, FP based similarity search algorithms have shown both high performance and accuracy in the prediction of potential biological targets.14,15,49

ACS Paragon Plus Environment

Page 3 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

However, almost all the models reported in literature work as typical binary classifiers returning drug target interaction prediction in form of dichotomic values and indicating solely if a compound is active or inactive and nothing else.15,49 Unlike these, our approach in addition to the drug target prediction, can also provide an estimation of the bioactivity in terms of Ki or IC50 values. In particular, predictions are based on a multidimensional structural description whose level of detail depends on the number of FP notations explicitly employed.50 This allows smoothing the intrinsic limitations of single measures of similarity, which rarely could contain enough information to match both molecular and biological features. A striking example is the loss of accuracy due to the occurrence of activity cliffs flagging a surprising large gap between molecular and biological similarity.51,52 Keeping this in mind, we carried out an intensive retrospective cross-analysis of known bioactive compounds and their corresponding biological counterparts with a twofold purpose: i) to link new designed small drug-like molecules to highly-probable drug targets, and ii) to repurpose known drugs towards apparently unrelated diseases, explicitly accounting for their potential toxicity and/or unwanted side effects. Building on these assumptions, in this work we exploited a large collection of bioactivity data included in CHEMBL22.1,53–55 a manually curated database of drug-like bioactive molecules covering the most comprehensive spectrum of available bioactivity data ever seen in a public repository.56 In brief, the herein proposed algorithm, whose acronym is MuSSeL standing for Multi-fingerprint Similarity Search aLgorithm, implies the calculation of 13 different types of FPs and consists of two main steps: i) the first concerns the selection of drug targets biased by the query compound; ii) the second can allow the prediction of Ki or IC50 values towards each selected drug target, explicitly accounting for potential activity cliffs. The algorithm was tuned and massively challenged by using two calibration sets, that is one for predicting Ki and the other for predicting IC50, of very recently published bioactive molecules available in the latest ChEMBL23 but missing in the second to last ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 28

ChEMBL22.1, which was used as a benchmark. The potential use of MuSSeL was described by predicting five bioactive compounds whose structures and data have been published in Journal of Medicinal Chemistry since the beginning of this year. Noteworthy, the encouraging results strongly emphasize how applying multi-fingerprint similarity to high quality curated data increases by great far the predictive potential of the models. METHODS Construction of Ki and IC50 databases The whole ChEMBL22.1 database, containing 2,036,512 compound records was downloaded by means of an in-house Python script. The database was mined retaining only those entries complying with the following criteria: i) data referring to assay conducted on human targets (‘target_organism’ = ‘Homo sapiens’); ii) data annotated as direct binding (‘assay_type’ = ‘B’); iii) entries free of warnings in the field ‘data_validity_comment’. In doing that, 694,532 compounds were retrieved from ChEMBL22.1. These selected entries were thus split into two pools on the basis of the bioactivity attributes. As a result, the first and second pool contained entries annotated exclusively with Ki (239,400 entries) and IC50 (455,132 entries) measures, respectively. For the ease of data treatment, the relation symbol (‘>’ or ‘ 0.8), only one of them was chosen randomly. Building on this criterion, ECFP4, FCFP4, FP4, CDK and EXT were excluded for further calculations. As a result, following analyses will focus on a pool of 13 selected FPs. All the correlations among the FPs are reported in Tables S1 and S2 of Supporting Information for the Ki and IC50 databases respectively. Similarity threshold Values In order to fairly compare the employed FPs, 10 million comparisons between pairs randomly selected from Ki and IC50 databases were generated. Following the approach recently suggested by Maggiora et al.,70 the Tc distributions of the 13 different FPs were analyzed to designate a statistically significant similarity threshold Tcm% which indicates, for each considered FP, the value of Tc met or exceeded by the percentage of comparison m%. For instance, as far as the Ki database is concerned, Tc0.05 implies a threshold value of 0.83 for MACCS and of 0.86 for SUB. The Tc0.05

ACS Paragon Plus Environment

Page 7 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

values for the 13 FPs are reported in Table S3 of Supporting Information. The 13 probability distributions for the Ki database are shown in Figure S1. Activity cliffs detection Considering all the drug targets of the Ki or IC50 pool, the occurrence of possible activity cliffs (AC) was taken into account employing each FP one at a time. In this respect, the AC were detected, for a given FP type, when a given pair of molecules returned a large difference in experimental Ki or IC50 binding values (that is cliffs ≥ 1 logarithmic unit) despite a remarkable molecular similarity, namely Tc  Tc0.005. Similarity search prediction algorithm The prediction algorithm is structured into two stages as illustrated in Figure 1. As a first step, in order to designate those targets that are eligible, a given query compound is compared to all the entries associated to each specific drug target considering Ki or IC50 pools. As a second step, the bioactivity of the query can be predicted in terms of Ki or IC50 for each selected drug target. The two stages are detailed in the following sections.

DRUG TARGET AND BIOACTIVITY PREDICTION ALGORITHM QUERY

T1 Drug target prediction Stage

Is T1 predicted ? YES

Bioactivity prediction Stage

T2

T3

T4

Drug target prediction Stage

Drug target prediction Stage

Drug target prediction Stage

Is T4 predicted ?

Is T2 predicted ? NO

Exit

Is T3 predicted ? YES

Bioactivity prediction Stage

NO

Exit

T5

T6

T7

Drug target prediction Stage

Drug target prediction Stage

Drug target prediction Stage

Drug target prediction Stage

Is T5 predicted ?

Is T6 predicted ?

Is T7 predicted ?

Is TN predicted ?

NO

Exit

YES

Bioactivity prediction Stage

ACS Paragon Plus Environment

NO

Exit



TN

NO

Exit

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 28

Figure 1. The prediction algorithm is structured into two stages: i) the drug target prediction and the ii) bioactivity prediction. The query compound is screened against all the possible targets Ti. The bioactivity prediction is then performed taking into account only the predicted targets.

Stage 1: Drug target prediction The drug target prediction algorithm is detailed in Figure 2. Based on each of the 13 FP definitions, the Tc value between the query compound and all the entries associated to each drug target is calculated. Importantly, a drug target is considered eligible for a given query if for a minimum number of FP types (hereafter referred to as Nf) there is at least one entry associated to the drug target provided with a Tc at least corresponding to a certain adjustable similarity threshold Tcm%T. If this condition holds true, a score (SC) is assigned to the drug target as follows: 13



𝑆𝐶 =

𝑇𝑐𝑚𝑎𝑥 𝑖

𝑖=1 𝑇𝑐𝑚𝑎𝑥 ≥ 𝑇𝑐𝑇𝑚% 𝑖

where 𝑇𝑐𝑚𝑎𝑥 is the maximum Tc value, based on the i-th FP type, between the query and the 𝑖 molecules associated to the drug target provided that 𝑇𝑐 ≥ 𝑇𝑐𝑇𝑚%. Finally, the selected drug targets are ranked according to the assigned SC values.

ACS Paragon Plus Environment

Page 9 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2. Drug target prediction workflow. Stage 2: Bioactivity prediction Considering those drug targets selected at the first stage, the respective bioactivity of the query molecule can be quantitatively predicted in terms of Ki or IC50 as shown in Figure 3.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 28

Figure 3. Bioactivity prediction workflow. Based on each of the 13 FP definitions, the query is scanned to find the top similar k molecules associated to each selected drug target provided that Tc is larger than a certain cut-off Tcm%P and that they are not flagged as AC. If these conditions hold true, the Tc weighted average of experimental bioactivity values of the top similar k molecules is stored otherwise the respective FP type is no longer considered. Finally, the prediction of the Ki or IC50 of the query with respect to the selected drug target is performed if: i) there is a minimum number of FPs (that is at least equal to Nstored) returning a given prediction and ii) the difference stored between the maximum and minimum of the stored values is less than one logarithmic unit. If both these conditions are met, the

ACS Paragon Plus Environment

Page 11 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Ki or IC50 value of the query compound on the basis of the selected target is predicted as the average value of the Nstored data. Monitoring parameters for performance evaluation The performance of the multi-fingerprint similarity search algorithm was challenged on one side on the Ki pool and on the other on the IC50 pool. First, the accuracy of the drug target prediction was inferred by computing some monitoring parameters. In particular, the percentage of the calibration set molecules for which the correct drug target (that is pright) was predicted and the percentage of the molecules returning the correct drug target ranked within the top-5 positions (that is p5) were computed. Then, the accuracy of the bioactivity prediction was quantified calculating the coefficient of determination (that is r2) and the mean absolute error (MAE) comparing experimental and predicted Ki or IC50, defined as follows: 2

𝑁

𝑟2 = 1 ―

𝑀𝐴𝐸 =

∑𝑖 = 1(𝑦𝑖 ― 𝑦𝑖) 𝑁

∑𝑖 = 1(𝑦𝑖 ― 𝑦𝑎𝑣𝑔)2

1 𝑁



𝑁

|𝑦𝑖 ― 𝑦𝑖| 𝑖=1

where yi is the experimental bioactivity value of the i-th chemical; ŷi is the Ki or IC50 relative to the i-th chemical predicted by the algorithm and; and yavg is the mean of the experimental values of the compounds for which the algorithm returns a prediction. Notice that all the response values were reported as logarithmic units. In order to avoid the inclusion of outliers, both r2 and MAE values were determined after removing 5% chemicals showing the highest residual values, as recently suggested by Roy et al.71 RESULTS AND DISCUSSION Here, we report the monitoring parameters described in the previous section (pright, p5, 𝑟2, MAE) by discussing how tuning the algorithm (Tcm%T, Nf, k, Nstored, Tcm%P) and the impact of the AC based filter on the model performances.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 28

Ki based prediction The performance of the drug target prediction algorithm was quantified in terms of its ability to associate the right target to the compounds in the Ki calibration set, which contains 251 entries spanning 95 drug targets. In the case of multiple predictions, the right experimental drug target has to match the top-ranked computed drug targets to avoid drop of accuracy. In this respect, Table 2 reports the number of the molecules in the Ki calibration set for which the algorithm returns the correct drug target (that is Nright) and the number of molecules for which the correct drug target is ranked within the top-5 positions (that is N5), along with the corresponding percentages (that are pright and p5), for different similarity thresholds (that are the Tcm%T values equal to Tc0.10%T, Tc0.05% T, Tc0.02% T or Tc0.01% T) and for different FP notations (that is Nf ranging from 1 to 13). Table 2. Performance of the drug target prediction algorithm for the Ki calibration set molecules based on the number and percentage of the correctly predicted drug targets (Nright and pright) and number and percentage of the correct drug target predicted at top-5 positions (N5 and p5) at various similarity thresholds (that is Tcm%T) and considering a different number of fingerprints (that is Nf). The best parameter combination is reported in bold. Nf 1 2 3 4 5 6 7 8 9 10 11 12 13

Nright 200 191 182 167 158 146 136 130 120 116 104 93 72

Tcm%T=Tc0.10% pright N5 79.68 32 76.10 67 72.51 86 66.53 102 62.95 105 58.17 104 54.18 102 51.79 105 47.81 99 46.22 100 41.43 93 37.05 83 28.69 67

p5 12.75 26.69 34.26 40.64 41.83 41.43 40.64 41.83 39.44 39.84 37.05 33.07 26.69

Nright 184 165 154 138 129 125 117 110 100 90 78 62 43

Tcm%T=Tc0.05% pright N5 73.31 66 65.74 93 61.35 110 54.98 110 51.39 105 49.80 106 46.61 104 43.82 100 39.84 90 35.86 81 31.08 72 24.70 60 17.13 42

p5 26.29 37.05 43.82 43.82 41.83 42.23 41.43 39.84 35.86 32.27 28.69 23.90 16.73

Nright 153 122 109 101 94 82 73 68 57 51 39 31 20

Tcm%T=Tc0.02% pright N5 60.96 96 48.61 105 43.43 97 40.24 94 37.45 88 32.67 79 29.08 70 27.09 65 22.71 55 20.32 50 15.54 39 12.35 31 7.97 20

p5 38.25 41.83 38.65 37.45 35.06 31.47 27.89 25.90 21.91 19.92 15.54 12.35 7.97

Nright 120 95 80 73 62 56 44 38 33 31 23 17 11

Tcm%T=Tc0.01% pright N5 47.81 104 37.85 92 31.87 78 29.08 72 24.70 62 22.31 56 17.53 44 15.14 38 13.15 33 12.35 31 9.16 23 6.77 17 4.38 11

p5 41.43 36.65 31.08 28.69 24.70 22.31 17.53 15.14 13.15 12.35 9.16 6.77 4.38

It becomes clear from Table 2 that as Nf increases, N5 increases and Nright decreases. Actually, this holds true if Nf is lower than a certain value and irrespective of Tcm%T. For instance, this trend is observed for Nf lower than 4 in case of Tcm%T = Tc0.05%. Thus, enhancing the chance of finding the correct drug target in the top-5 requires a more detailed structural description, and thus a larger consensus, until a certain threshold beyond which there is instead a drop.

ACS Paragon Plus Environment

Page 13 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Moreover, the values of Nf required for accurate drug target detection reduce at the increase of similarity cut-off Tcm%. Considering for instance Tcm%T = Tc0.05%, the right drug target is almost always ranked within the top-5 positions (that is Nright=N5) only employing a high number of FP notations (that is Nf > 11). On the other hand, the same result is achieved with a more limited FP description (that is Nf > 5) in the case of Tcm%T = Tc0.02%. Among all the possible combinations, the largest value of N5 (110) is obtained setting Nf equal to 3 or 4 and Tcm%T = Tc0.05%. Notice that the first example shows, between the two cases, the highest value of Nright (154 vs 138). Based on this rationale, Tcm%T = Tc0.05% and Nf equal to 3 were chosen to challenge the drug target prediction. Using these settings, the bioactivity prediction stage was then challenged by selecting only those molecules whose drug target was correctly predicted (that is Nright). For the sake of clarity, the similarity cut-off chosen for the drug target prediction stage (that is Tcm%T = Tc0.05%) was also applied as threshold for the bioactivity prediction stage (that is Tcm%P = Tc0.05%). The performances of the similarity search algorithm for the prediction of Ki are summarized in Table 3 where r2 and MAE along with the number and the percentage of the compounds for which the algorithm returns a prediction (that are Npred and ppred, respectively) are reported considering the number of the k top similar molecules ranging from 2 to 4 and the value of Nstored ranging from 1 to 10. Moreover, we analyzed how the application of the AC based filter, which implies a large similarity (that is Tcm%Cliffs at least equal to Tc0.005%) but also a large bioactivity difference (that is

cliffs higher than one logarithmic unit), can affect model performance.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 28

Table 3. Performance of the bioactivity prediction algorithm for the compounds of the Ki calibration set whose drug target was correctly predicted by the target prediction algorithm setting Tcm%T equal to Tc0.05% and Nf equal to 4 (Nright in Table 2) in terms of r2. MAE and ppred considering Tcm%P equal to Tc0.05% with the application of the AC based filter (that is Tcm%Cliffs≥ Tc0.005% and cliffs ≥1). The best parameter combination is reported in bold. Nstored 1 2 3 4 5 6 7 8 9 10

k=2 Npred 96 76 62 49 41 36 25 20 16 11

ppred 62.3 49.4 40.3 31.8 26.6 23.4 16.2 13.0 10.4 7.1

r2 0.487 0.682 0.738 0.722 0.701 0.68 0.72 0.794 0.849 0.900

k=3 MAE 0.866 0.712 0.657 0.675 0.64 0.681 0.549 0.488 0.418 0.406

Npred 92 71 58 45 34 24 18 15 11 9

ppred 59.7 46.1 37.7 29.2 22.1 15.6 11.7 9.7 7.1 5.8

r2 0.662 0.669 0.605 0.68 0.612 0.744 0.713 0.781 0.703 0.764

k=4 MAE 0.788 0.787 0.837 0.751 0.736 0.643 0.592 0.548 0.619 0.534

Npred 82 59 43 33 23 17 12 6 5 5

ppred 53.2 38.3 27.9 21.4 14.9 11.0 7.8 3.9 3.2 3.2

r2 0.641 0.605 0.537 0.495 0.577 0.600 0.786 0.863 0.955 0.955

MAE 0.751 0.772 0.83 0.823 0.659 0.603 0.492 0.424 0.273 0.273

A general trend can be inferred: as Nstored increases, the performance of the model improves in terms of both r2 and MAE at the expense of its coverage (that is ppred). In other words, a compromise is required between the accuracy of the model and its ability to provide a prediction for a large number of compounds.72 Among different parameter combinations, we preferred to choose the one ensuring the lowest MAE value and provided with high model coverage (that is ppred > 40%). As a result, we selected a value of k and Nstored equal to 2 and 3, respectively. It is worth to compare the performance of the bioactivity prediction without and with the application of the AC based filter once the selected monitoring parameters were set. Importantly, including the AC detection based filter allowed a significant improvement of MAE from 0.715 to 0.657 also increasing ppred from 31.2% to 40.3% at the same time. Notice that avoiding to store Ki values coming from molecules labeled as AC can allow to remove too large or too small Ki values that can make stored exceeding one logarithmic unit thus preventing the prediction. In other words, few molecules are pruned by the stored based filter thus increasing Npred. Finally, the performance of the drug target and bioactivity prediction algorithm was challenged by screening three external sets (see Methods section for methodological details), setting the parameters previously selected (Tcm%T = Tc0.05% and Nf = 3 for the drug target fishing stage; k=2, Tcm%P = Tc0.05% and Nstored = 3 for the Ki prediction stage; Tcm%Cliffs≥ Tc0.005% and cliffs≥1 for the AC based filter). The results are reported in Table 4 and Table 5.

ACS Paragon Plus Environment

Page 15 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 4. Performance of the drug target prediction algorithm for the three Ki external sets considering the number and percentage of the correctly predicted drug targets (Nright and pright) and the number and percentage of correct drug target predicted at top-5 positions (N5 and p5) setting Tcm%T = Tc0.05% and Nf = 3. Nright 188 168 167

EXT1 EXT2 EXT3

pright 74.9 66.9 66,5

N5 141 122 129

p5 56.2 48.6 51.4

Table 5. Performance of the bioactivity prediction algorithm for the compounds of the three Ki external sets whose drug target was correctly predicted by the target prediction algorithm (Nright in Table 4), considering r2, MAE, ppred and the percentage of the predicted compounds for which the prediction error is less than one logarithmic unit (p1), for k = 2, Tcm%P = Tc0.05%, Nstored = 3, Tcm%Cliffs≥ Tc0.005% and cliffs≥1. EXT1 EXT2 EXT3

Npred 100 58 76

ppred 53.2 34.5 45.5

r2 0.693 0.742 0.688

MAE 0.473 0.509 0.571

p1 84.0 81.0 77.6

Satisfactory, in all of the three external sets both the percentages of drug targets correctly predicted (that is pright) and the percentage of compounds for which the correct drug target is ranked within the top-5 positions (that is p5) are comparable to those found in the calibration set prediction. The same is found for the Ki prediction in terms of the percentage of predicted compounds (that is ppred). Noteworthy, in all the cases the percentages of the predicted compounds for which the prediction error is less than one logarithmic unit (p1) is at least 75%, r2 ranges from 0.688 to 0.742 and satisfactory MAE values (from 0.473 to 0.571) were obtained. IC50 based prediction Also in the case of the IC50 prediction, the drug target fishing algorithm is effective if it is able to predict the correct drug target for the query molecule and if it is ranked within the top-5 positions. Table 6 reports the number of the IC50 calibration set molecules for which the algorithm returns the correct drug target (that is Nright) and the number of molecules for which their drug target are ranked at the top-5 positions (that is N5), along with the corresponding percentages (that are pright and p5), for different similarity thresholds (that are the Tcm%T values equal to Tc0.10%, Tc0.05%, Tc0.02% or Tc0.01% T) and for different FP notations (that is Nf ranging from 1 to 13). ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 28

Table 6. Performance of the drug target prediction algorithm for the IC50 calibration set molecules considering the number and percentages of the correctly predicted drug targets (Nright and pright) and the number and percentage of correct drug target predicted at top-5 positions (N5 and p5) at various similarity thresholds (that is Tcm%T) and considering a different number of fingerprints (that is Nf). The best parameter combination is reported in bold. Nf 1 2 3 4 5 6 7 8 9 10 11 12 13

Nright 556 486 442 406 374 342 319 293 259 229 201 166 129

Tcm%T=Tc0.10% pright N5 74.4 26 65.1 56 59.2 100 54.4 154 50.1 185 45.8 204 42.7 211 39.2 213 34.7 209 30.7 196 26.9 179 22.2 153 17.3 121

p5 3.5 7.5 13.4 20.6 24.8 27.3 28.2 28.5 28.0 26.2 24.0 20.5 16.2

Nright 479 412 370 340 309 284 258 229 211 179 156 133 97

Tcm%T=Tc0.05% pright N5 64.1 44 55.2 139 49.5 188 45.5 211 41.4 220 38.0 226 34.5 214 30.7 196 28.2 190 24.0 161 20.9 142 17.8 125 13.0 94

p5 5.9 18.6 25.2 28.2 29.5 30.3 28.6 26.2 25.4 21.6 19.0 16.7 12.6

Nright 376 307 268 240 221 200 175 158 139 120 104 85 59

Tcm%T=Tc0.02% pright N5 50.3 123 41.1 214 35.9 218 32.1 212 29.6 203 26.8 185 23.4 163 21.2 149 18.6 132 16.1 115 13.9 100 11.4 81 7.9 58

p5 16.5 28.6 29.2 28.4 27.2 24.8 21.8 19.9 17.7 15.4 13.4 10.8 7.8

Nright 289 228 195 168 140 120 111 102 94 77 63 50 34

Tcm%T=Tc0.01% pright N5 38.7 182 30.5 204 26.1 182 22.5 160 18.7 134 16.1 117 14.9 108 13.7 99 12.6 91 10.3 74 8.4 61 6.7 49 4.6 33

p5 24.4 27.3 24.4 21.4 17.9 15.7 14.5 13.3 12.2 9.9 8.2 6.6 4.4

In parallel to what already described above, the increase of Nf implies the increment of N5 and the decrement of Nright irrespective of the explored Tcm%T value. Again, N5 reaches a maximum increasing Nf until a certain value (that is 6 in case of Tcm%T equal to Tc0.05%) beyond which a decrement is instead observed. It is worth to note that the pright values for the IC50 drug target prediction (Table 6) are lower compared to those found in the Ki drug target prediction (Table 2), especially if high p5 are required. This is probably due to the presence of a larger number of drug targets to which only a limited number of IC50 annotated compounds (that is lower than 50) is associated, thus making predictions more difficult. As seen earlier, the Nf required for accurate drug target detection decreases at the increase of the similarity cut-off. Considering for instance a Tcm%T value equal to Tc0.05%, almost all the molecules are correctly predicted with the respective drug target ranked within the top-5 positions (Nright ≃ N5) only if Nf = 13. On the other hand, considering a Tcm%T value equal to Tc0.02%, the same result is achieved setting Nf>10. In particular, the parameter combination Tc0.05% and Nf = 6 provides the highest value of N5 (226). For this reason, such a combination was preferred for submitting to bioactivity prediction those molecules whose drug target was correctly predicted by the algorithm (Nright). In analogy with the Ki case, it is reasonable extending to the bioactivity prediction stage the ACS Paragon Plus Environment

Page 17 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

same similarity cut-off employed for the target prediction stage (Tcm%P = Tcm%T). The performance of the similarity search algorithm for the prediction of IC50 is summarized in Table 7 for the number of the k top similar molecules ranging from 2 to 4 and the value of the stored bioactivity data Nstored ranging from 1 to 10. Table 7. Performance of the bioactivity prediction algorithm for the compounds of the IC50 calibration set whose drug target was correctly predicted by the target prediction algorithm setting Tcm%T equal to Tc0.005% and Nf equal to 6 (Nright in Table 6) in terms of r2, MAE and ppred considering Tcm%P equal to Tc0.005% with the application of the AC based filter (that is Tcm%Cliffs≥ Tc0.005% and cliffs ≥1). The best parameter combination is reported in bold. Nstored 1 2 3 4 5 6 7 8 9 10

k=2 Npred 172 131 98 77 59 46 41 35 31 30

ppred 60.6 46.1 34.5 27.1 20.8 16.2 14.4 12.3 10.9 10.6

k=3 r2 0.257 0.469 0.39 0.432 0.529 0.614 0.598 0.637 0.629 0.621

MAE 1.038 0.893 0.921 0.867 0.786 0.689 0.679 0.650 0.655 0.675

Npred 164 96 62 46 40 29 23 18 17 16

ppred 57.7 33.8 21.8 16.2 14.1 10.2 8.1 6.3 6.0 5.6

k=4 r2 0.363 0.542 0.544 0.634 0.635 0.553 0.683 0.764 0.753 0.748

MAE 0.992 0.843 0.831 0.701 0.725 0.737 0.635 0.502 0.534 0.560

Npred 147 82 52 38 29 25 21 19 17 16

ppred 51.8 28.9 18.3 13.4 10.2 8.8 7.4 6.7 6.0 5.6

r2 0.378 0.500 0.432 0.505 0.595 0.702 0.750 0.650 0.625 0.571

MAE 0.973 0.859 0.905 0.816 0.718 0.636 0.592 0.630 0.693 0.717

As far as a value of k equal to 2 and to 3 is concerned, the same general trend found for the Ki can be observed also for the IC50 prediction. In particular, the performance of the algorithm in terms of both r2 and MAE improves as Nstored increases or switching from k=2 to k=3 at the expense of ppred accounting for the model coverage. From a broader point of view, it is clear from Table 7 that the performance for the IC50 prediction is not as good as for the Ki cases. This is probably due to fact that, unlike the Ki that is a true equilibrium constant, the IC50, is a measure intrinsically less reliable as it depends on various experimental conditions such as for instance the concentration of the enzyme.73,74 Among the different tested parameters, our choice fell on those (k = 2, Nstored = 5) ensuring the lowest MAE, provided that we have sufficient model coverage (in this case ppred > 20%). Notice that, due to the general lower performance of the IC50 based model, a less restrictive ppred threshold (20% instead of 40%) was set.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 18 of 28

Also in this case, the performance of the target and bioactivity prediction algorithm was challenged submitting to the model three randomly built external sets (see Methods section) setting the parameters optimized through the calibration set (Tcm%T = Tc0.05% and Nf = 6 for the drug target fishing stage; k = 2, Tcm%P = Tc0.05% and Nstored=5 for the IC50 prediction stage; Tcm%Cliffs≥ Tc0.005% and cliffs≥1 for the AC based filter). The results are reported in Table 8 and Table 9. Table 8. Performance of the drug target prediction algorithm for the three IC50 external sets considering the number and percentage of the correctly predicted drug targets (Nright and pright) and the number and percentage of the correct drug target predicted at top-5 positions (N5 and p5) setting Tcm%T= Tc0.005% and Nf = 6. Nright 126 122 106

EXT1 EXT2 EXT3

pright 42.0 40.7 35.3

N5 105 95 88

p5 35 31.7 29.3

Table 9. Performance of the bioactivity prediction algorithm for the compounds of the three IC50 external sets whose drug target was correctly predicted by the target prediction algorithm (Nright in Table 8), considering r2, MAE, ppred and the percentage of the predicted compounds for which the prediction error is less than one logarithmic unit (p1), for k = 2, Tcm%P= Tc0.005%, Nstored = 5, Tcm%Cliffs≥ Tc0.005% and cliffs≥1. EXT1 EXT2 EXT3

Npred 25 27 27

ppred 19.8 22.1 25.5

r2 0.744 0.816 0.796

MAE 0.439 0.435 0.441

p1 80.0 85.2 77.8

Satisfactory, in all of the tree external sets both the percentages of drug targets correctly predicted (pright) and the percentages of compounds for which the correct drug target is ranked within the top5 positions (p5) are comparable to those found in the calibration set. The same is found for the IC50 prediction in terms of the predicted compounds (ppred). Noteworthy, in all the cases the percentage of the predicted compounds for which the prediction error is less than one logarithmic unit is at least 75% thus confirming the high quality of the drug target and bioactivity prediction. Case studies To challenge the real predictive strength of our algorithm, we used as queries five representative compounds published in Journal Medicinal Chemistry since the beginning of 201875–79 whose structures along with relevant experimental data are reported in Figure 4. At the time of writing, no entry is available for these five compounds even in the latest release of ChEMBL and consequently ACS Paragon Plus Environment

Page 19 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

no hits were found after a compound search through its open platform.55 This prospective exercise is thus aimed at showing the potential of our algorithm in finding possible drug targets based on a reverse screening approach which has become an integral part of drug discovery pipelines.80 For the ease of reading, a case-by-case discussion is reported as follows. As shown in Figure 4, the first compound predicted is a 2-amino-4-methylquinazoline derivative behaving as a highly potent phosphatidylinositol 3-kinase inhibitor with an experimental IC50 = 0.96nM.75 Interestingly, MuSSeL returned for this query four possible targets being the top-one the phosphatidylinositol 3kinase p110-alpha subunit. The others detected were the serine/threonine-protein kinase mTOR, the phosphatidylinositol 4-kinase alpha subunit and the phosphatidylinositol 3-kinase p110-gamma subunit.

The

second

compound

contains

as

scaffold

the

N-[4-(quinolin-4-

yloxy)phenyl]benzenesulfonamides and was experimentally tested as very selective inhibitor of the AXL kinases with an outstanding IC50 = 28nM.76 Satisfactorily, MuSSeL returned a spectrum of thirteen targets, being the top-one the tyrosine receptor kinase UFO, the name initially designated for AXL kinases.81 Based on the phenylpiperazine privileged structure, the third compound showed high affinity towards dopamine D4 Receptors Ki = 2.88nM and other D2-like dopamine subtypes but also over M1−M5 muscarinic acetylcholine receptors. As reported in the original paper,77 this compound was able to target also some off-targets namely alpha1a-, alpha1b, beta1- beta2- adrenergic receptors, sigma1, DAT and SART. Interestingly, MuSSeL returned thirtyfive putative targets, which cover most of those experimentally tested. As a fourth instance, we used as a query the first affinity-based probe for the human adenosine A2A receptor.78 Three possible targets were identified in our predictive screening. The best scored was the adenosine A2A receptor. As last case of study, we made the target prediction of a subnanomolar naphthyridinone inhibitor, that is IC50 = 0.11nM, of phosphodiesterase type 4.79 Using this compound as a query, three possible targets namely the MAP kinase-activated protein kinase 2; the cyclin-dependent kinase 2; and the phosphodiesterase 4B were found. Noteworthy, there is a wealth of information concerning these case studies available as Supplementary Materials in a preliminary format provided by MuSSeL. As shown, a ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 28

number of secondary potential targets could also be mapped. These additional data could be very helpful to prevent unwanted adverse side effects which can increase the attrition rate in clinical trials due to toxicity82 or address desired off-target activity which could provide ways of repositioning molecules to treat different diseases.83 For the sake of completeness, we also used the five representative compounds shown in Figure 4 as queries to screen SwissTargetPrediction webserver.49 A very good overlap was observed by comparing the results obtained by MuSSeL with those of SwissTargetPrediction webserver. For a more informed view, the interested reader is referred to the Supplementary Materials, which contain also the reports provided by SwissTargetPrediction webserver.

1

3

2

4

5

Figure 4. Chemical structures and data of compounds discussed as case studies. Compound 1 is a phosphatidylinositol 3-kinase inhibitor, IC50 = 0.96nM.75 Compound 2 is an AXL receptor tyrosine kinase inhibitor, IC50 = 28nM.76 Compound 3 is a dopamine D4 Receptor ligand, Ki = 2.88nM.77 Compound 4 is human adenosine A2A receptor probe, Ki = 3.89nM.78 Compound 5 is phosphodiesterase type 4 inhibitor, IC50 = 0.11nM.79 Overall remarks Our crafted workflow implements strategies aimed at stemming the typical flaws of the similarity search models mostly based on the use of single or limited number of FPs.13,14 The herein proposed algorithm considers in fact an ensemble of 13 different FP types enabling multiple similarity measures, each one capturing an independent space dimension where different k top similar ACS Paragon Plus Environment

Page 21 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

molecules can be simultaneously sampled. In doing that, a double consensus is applied for making prediction. On one side the amount of structural information obtained sampling 13 different FPs allows accurate drug target prediction. On the other, the chance of exploring even diverse k top similar molecules for each FP allows reliable bioactivity prediction. In particular, as far as the target prediction stage is concerned, the model associates a drug target to a query molecule only if, for a sufficient number of FPs (Nf), there is a molecule highly similar to the query (Tcm%T > Tc0.05%). Importantly, the predictions on the calibration set (Table 2 and Table 6) clearly show as the quality of the target prediction increases at the increase of Nf but at the expense of the model coverage. Similarly, MuSSeL returns a Ki or IC50 value only if there are bioactivity data coming from at least Nstored different FP definitions provided that those predictions are within one logarithmic unit and thus highly consistent. Also in this case, increasing Nstored and thus the number of FPs means obtaining a larger consensus (Table 3 and Table 7). Thus, the quality of the model in terms of MAE and r2 increases although again at the expense of its coverage. In other words, the model furnishes two tunable parameters (Nf and Nstored) in order to obtain an acceptable trade-off between the accuracy in prediction and the coverage. Interestingly, our approach can effectively deal with activity cliffs whose occurrence is challenged by employing a multi-fingerprint algorithm.51 Importantly, the inclusion of AC based filter allows to further improve the quality of the prediction making models highly trustable. Noteworthy, it should be emphasized that good models are those reflecting a larger consensus basis, which is ensured by sampling a higher number of FPs rather than of k top similar compounds. CONCLUSIONS In this study, a multi-FPs similarity search algorithm, that is MuSSeL, was developed for the prediction of all the possible drug targets with which a query molecule can interact along with quantitative measures of bioactivity in terms of Ki and IC50 values. MuSSeL exploits as a benchmark the large collection of freely accessible drug bioactivity data furnished by ChEMBL and combines, in a consensus-like fashion, the predictions coming from a similarity search conducted ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 28

using 13 different FP types. The herein proposed models showed very promising performance and can offer a useful and easy-to-run tool capable to pair novel compounds to putative drug targets as well as to repurpose known drugs to apparently unrelated diseases, explicitly accounting for their potential toxicity and/or unwanted side effects. SUPPORTING INFORMATION “ki_targets.txt” and “ic50_targets.txt”: list of Ki and IC50 targets included in the model along with the number of compounds for each target. “ki_calibration.txt”, “ic50_calibration.txt”, “ki_random1.txt”, “ki_random2.txt”, “ki_random3.txt”, “ic50_random1.txt”, “ic50_random2.txt” and “ic50_random3.txt”: molecules included in the Ki and IC50 calibration and random external sets respectively. “JMC_01_report_MuSSeL.pdf”, “JMC_02_report_MuSSeL.pdf”, “JMC_03_report_MuSSeL.pdf”, “JMC_04_report_MuSSeL.pdf” and “JMC_05_report_MuSSeL.pdf”: preliminary format standard reports of MuSSeL for compounds of Figure 4. “JMC_01_report_SWISS.pdf”,

“JMC_02_report_SWISS.pdf”,

“JMC_04_report_SWISS.pdf”

and

“JMC_03_report_SWISS.pdf”,

“JMC_05_report_SWISS.pdf”:

standard

reports

of

SwissTargetPrediction webserver for compounds of Figure 4. CONFLICT OF INTEREST The authors declare no conflict of interest. REFERENCES (1) Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Docking and Scoring in Virtual Screening for Drug Discovery: Methods and Applications. Nat. Rev. Drug Discov. 2004, 3 (11), 935–949. (2) Macalino, S. J. Y.; Gosu, V.; Hong, S.; Choi, S. Role of Computer-Aided Drug Design in Modern Drug Discovery. Arch. Pharm. Res. 2015, 38 (9), 1686–1701. (3) Śledź, P.; Caflisch, A. Protein Structure-Based Drug Design: From Docking to Molecular Dynamics. Curr. Opin. Struct. Biol. 2018, 48, 93–102. (4) Santos, R.; Ursu, O.; Gaulton, A.; Bento, A. P.; Donadi, R. S.; Bologa, C. G.; Karlsson, A.; Al-Lazikani, B.; Hersey, A.; Oprea, T. I.; Overington, J. P. A Comprehensive Map of Molecular Drug Targets. Nat. Rev. Drug Discov. 2017, 16 (1), 19–34. (5) Cereto-Massagué, A.; Ojeda, M. J.; Valls, C.; Mulero, M.; Pujadas, G.; Garcia-Vallve, S. Tools for in Silico Target Fishing. Methods 2015, 71, 98–103. ACS Paragon Plus Environment

Page 23 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(6) Lavecchia, A.; Cerchia, C. In Silico Methods to Address Polypharmacology: Current Status, Applications and Future Perspectives. Drug Discov. Today 2016, 21 (2), 288–298. (7) Chen, X.; Yan, C. C.; Zhang, X.; Zhang, X.; Dai, F.; Yin, J.; Zhang, Y. Drug–Target Interaction Prediction: Databases, Web Servers and Computational Models. Brief. Bioinform. 2016, 17 (4), 696–712. (8) Sam, E.; Athri, P. Web-Based Drug Repurposing Tools: A Survey. Brief. Bioinform. 2017. (9) Dunkel, M.; Günther, S.; Ahmed, J.; Wittig, B.; Preissner, R. SuperPred: Drug Classification and Target Prediction. Nucleic Acids Res. 2008, 36 (Web Server issue), W55-59. (10) Keiser, M. J.; Setola, V.; Irwin, J. J.; Laggner, C.; Abbas, A. I.; Hufeisen, S. J.; Jensen, N. H.; Kuijer, M. B.; Matos, R. C.; Tran, T. B.; Whaley, R.; Glennon, R. A.; Hert, J.; Thomas, K. L. H.; Edwards, D. D.; Shoichet, B. K. Predicting New Molecular Targets for Known Drugs. Nature 2009, 462 (7270), 175–181. (11) Gong, J.; Cai, C.; Liu, X.; Ku, X.; Jiang, H.; Gao, D.; Li, H. ChemMapper: A Versatile Web Server for Exploring Pharmacology and Chemical Structure Association Based on Molecular 3D Similarity Method. Bioinforma. Oxf. Engl. 2013, 29 (14), 1827–1829. (12) Wang, L.; Ma, C.; Wipf, P.; Liu, H.; Su, W.; Xie, X.-Q. TargetHunter: An in Silico Target Identification Tool for Predicting Therapeutic Potential of Small Organic Molecules Based on Chemogenomic Database. AAPS J. 2013, 15 (2), 395–406. (13) Gfeller, D.; Michielin, O.; Zoete, V. Shaping the Interaction Landscape of Bioactive Molecules. Bioinforma. Oxf. Engl. 2013, 29 (23), 3073–3079. (14) Huang, T.; Mi, H.; Lin, C.; Zhao, L.; Zhong, L. L. D.; Liu, F.; Zhang, G.; Lu, A.; Bian, Z. MOST: Most-Similar Ligand Based Approach to Target Prediction. BMC Bioinformatics 2017, 18, 165. (15) Awale, M.; Reymond, J.-L. The Polypharmacology Browser: A Web-Based MultiFingerprint Target Prediction Tool Using ChEMBL Bioactivity Data. J. Cheminformatics 2017, 9, 11. (16) Irwin, J. J.; Gaskins, G.; Sterling, T.; Mysinger, M. M.; Keiser, M. J. Predicted Biological Activity of Purchasable Chemical Space. J. Chem. Inf. Model. 2018, 58 (1), 148–164. (17) Wang, Y.; Zeng, J. Predicting Drug-Target Interactions Using Restricted Boltzmann Machines. Bioinformatics 2013, 29 (13), i126–i134. (18) Cheng, F.; Liu, C.; Jiang, J.; Lu, W.; Li, W.; Liu, G.; Zhou, W.; Huang, J.; Tang, Y. Prediction of Drug-Target Interactions and Drug Repositioning via Network-Based Inference. PLOS Comput. Biol. 2012, 8 (5), e1002503. (19) Chen, H.; Zhang, Z. A Semi-Supervised Method for Drug-Target Interaction Prediction with Consistency in Networks. PLOS ONE 2013, 8 (5), e62975. (20) Chen, X.; Liu, M.-X.; Yan, G.-Y. Drug-Target Interaction Prediction by Random Walk on the Heterogeneous Network. Mol. Biosyst. 2012, 8 (7), 1970–1978. (21) Mei, J.-P.; Kwoh, C.-K.; Yang, P.; Li, X.-L.; Zheng, J. Drug-Target Interaction Prediction by Learning from Local Information and Neighbors. Bioinforma. Oxf. Engl. 2013, 29 (2), 238–245. (22) Alaimo, S.; Pulvirenti, A.; Giugno, R.; Ferro, A. Drug-Target Interaction Prediction through Domain-Tuned Network-Based Inference. Bioinforma. Oxf. Engl. 2013, 29 (16), 2004–2008. (23) Shi, J.-Y.; Yiu, S.-M.; Li, Y.; Leung, H. C. M.; Chin, F. Y. L. Predicting Drug–Target Interaction for New Drugs Using Enhanced Similarity Measures and Super-Target Clustering. Methods 2015, 83, 98–104. (24) Seal, A.; Ahn, Y.-Y.; Wild, D. J. Optimizing Drug–Target Interaction Prediction Based on Random Walk on Heterogeneous Networks. J. Cheminformatics 2015, 7. (25) Yan, X.-Y.; Zhang, S.-W.; Zhang, S.-Y. Prediction of Drug–Target Interaction by Label Propagation with Mutual Interaction Information Derived from Heterogeneous Network. Mol. Biosyst. 2016, 12 (2), 520–531. (26) Ba-alawi, W.; Soufan, O.; Essack, M.; Kalnis, P.; Bajic, V. B. DASPfind: New Efficient Method to Predict Drug–Target Interactions. J. Cheminformatics 2016, 8, 15. ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 28

(27) Hao, M.; Bryant, S. H.; Wang, Y. Predicting Drug-Target Interactions by Dual-Network Integrated Logistic Matrix Factorization. Sci. Rep. 2017, 7. (28) Olayan, R. S.; Ashoor, H.; Bajic, V. B. DDR: Efficient Computational Method to Predict Drug-Target Interactions Using Graph Mining and Machine Learning Approaches. Bioinforma. Oxf. Engl. 2018, 34 (7), 1164–1173. (29) Peragovics, Á.; Simon, Z.; Tombor, L.; Jelinek, B.; Hári, P.; Czobor, P.; MálnásiCsizmadia, A. Virtual Affinity Fingerprints for Target Fishing: A New Application of Drug Profile Matching. J. Chem. Inf. Model. 2013, 53 (1), 103–113. (30) Lu, Y.; Guo, Y.; Korhonen, A. Link Prediction in Drug-Target Interactions Network Using Similarity Indices. BMC Bioinformatics 2017, 18. (31) Koutsoukas, A.; Lowe, R.; Kalantarmotamedi, Y.; Mussa, H. Y.; Klaffke, W.; Mitchell, J. B. O.; Glen, R. C.; Bender, A. In Silico Target Predictions: Defining a Benchmarking Data Set and Comparison of Performance of the Multiclass Naïve Bayes and Parzen-Rosenblatt Window. J. Chem. Inf. Model. 2013, 53 (8), 1957–1966. (32) Abdo, A.; Leclère, V.; Jacques, P.; Salim, N.; Pupin, M. Prediction of New Bioactive Molecules Using a Bayesian Belief Network. J. Chem. Inf. Model. 2014, 54 (1), 30–36. (33) Rayhan, F.; Ahmed, S.; Shatabda, S.; Farid, D. M.; Mousavian, Z.; Dehzangi, A.; Rahman, M. S. IDTI-ESBoost: Identification of Drug Target Interaction Using Evolutionary and Structural Features with Boosting. Sci. Rep. 2017, 7 (1), 17731. (34) Pahikkala, T.; Airola, A.; Pietilä, S.; Shakyawar, S.; Szwajda, A.; Tang, J.; Aittokallio, T. Toward More Realistic Drug-Target Interaction Predictions. Brief. Bioinform. 2015, 16 (2), 325– 337. (35) Reker, D.; Schneider, P.; Schneider, G.; Brown, J. B. Active Learning for Computational Chemogenomics. Future Med. Chem. 2017, 9 (4), 381–402. (36) Liu, Y.; Wu, M.; Miao, C.; Zhao, P.; Li, X.-L. Neighborhood Regularized Logistic Matrix Factorization for Drug-Target Interaction Prediction. PLOS Comput. Biol. 2016, 12 (2), e1004760. (37) Yao, Z.-J.; Dong, J.; Che, Y.-J.; Zhu, M.-F.; Wen, M.; Wang, N.-N.; Wang, S.; Lu, A.-P.; Cao, D.-S. TargetNet: A Web Service for Predicting Potential Drug-Target Interaction Profiling via Multi-Target SAR Models. J. Comput. Aided Mol. Des. 2016, 30 (5), 413–424. (38) Lee, K.; Lee, M.; Kim, D. Utilizing Random Forest QSAR Models with Optimized Parameters for Target Identification and Its Application to Target-Fishing Server. BMC Bioinformatics 2017, 18 (Suppl 16), 567. (39) Keum, J.; Nam, H. SELF-BLM: Prediction of Drug-Target Interactions via Self-Training SVM. PLOS ONE 2017, 12 (2), e0171839. (40) Mervin, L. H.; Bulusu, K. C.; Kalash, L.; Afzal, A. M.; Svensson, F.; Firth, M. A.; Barrett, I.; Engkvist, O.; Bender, A. Orthologue Chemical Space and Its Influence on Target Prediction. Bioinformatics 2018, 34 (1), 72–79. (41) Wen, M.; Zhang, Z.; Niu, S.; Sha, H.; Yang, R.; Yun, Y.; Lu, H. Deep-Learning-Based Drug–Target Interaction Prediction. J. Proteome Res. 2017, 16 (4), 1401–1409. (42) Yuan, Q.; Gao, J.; Wu, D.; Zhang, S.; Mamitsuka, H.; Zhu, S. DrugE-Rank: Improving Drug-Target Interaction Prediction of New Candidate Drugs or Targets by Ensemble Learning to Rank. Bioinforma. Oxf. Engl. 2016, 32 (12), i18–i27. (43) Murtazalieva, K. A.; Druzhilovskiy, D. S.; Goel, R. K.; Sastry, G. N.; Poroikov, V. V. How Good Are Publicly Available Web Services That Predict Bioactivity Profiles for Drug Repurposing? SAR QSAR Environ. Res. 2017, 28 (10), 843–862. (44) Matter, H. Selecting Optimally Diverse Compounds from Structure Databases:  A Validation Study of Two-Dimensional and Three-Dimensional Molecular Descriptors. J. Med. Chem. 1997, 40 (8), 1219–1229. (45) Cereto-Massagué, A.; Ojeda, M. J.; Valls, C.; Mulero, M.; Garcia-Vallvé, S.; Pujadas, G. Molecular Fingerprint Similarity Search in Virtual Screening. Methods 2015, 71, 58–63. (46) Roy, K.; Kar, S.; Das, R. N. Chapter 2 - Chemical Information and Descriptors. In ACS Paragon Plus Environment

Page 25 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Understanding the Basics of QSAR for Applications in Pharmaceutical Sciences and Risk Assessment; Academic Press: Boston, 2015; pp 47–80. (47) Willett, P.; Barnard, J. M.; Downs, G. M. Chemical Similarity Searching. J. Chem. Inf. Comput. Sci. 1998, 38 (6), 983–996. (48) Tanimoto TT. IBM Internal Report 17th. 1957. (49) Gfeller, D.; Grosdidier, A.; Wirth, M.; Daina, A.; Michielin, O.; Zoete, V. SwissTargetPrediction: A Web Server for Target Prediction of Bioactive Small Molecules. Nucleic Acids Res. 2014, 42 (Web Server issue), W32-38. (50) Floris, M.; Manganaro, A.; Nicolotti, O.; Medda, R.; Mangiatordi, G. F.; Benfenati, E. A Generalizable Definition of Chemical Similarity for Read-Across. J. Cheminformatics 2014, 6 (1), 39. (51) Cruz-Monteagudo, M.; Medina-Franco, J. L.; Pérez-Castillo, Y.; Nicolotti, O.; Cordeiro, M. N. D. S.; Borges, F. Activity Cliffs in Drug Discovery: Dr Jekyll or Mr Hyde? Drug Discov. Today 2014, 19 (8), 1069–1080. (52) Bajorath, J. Representation and Identification of Activity Cliffs. Expert Opin. Drug Discov. 2017, 12 (9), 879–883. (53) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40 (Database issue), D1100–D1107. (54) Bento, A. P.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Krüger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; Nowotka M.; Papadatos G.; Santos, R.; Overington, J. P. The ChEMBL Bioactivity Database: An Update. Nucleic Acids Res. 2014, 42 (Database issue), D10831090. (55) Gaulton, A.; Hersey, A.; Nowotka, M.; Bento, A. P.; Chambers, J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L. J.; Cibrián-Uhalte, E.; Davies, M,; Dedman, N.; Karlsson, A.; Magariños, M. P.; Overington, J. P.; Papadatos, G.; Smit, I.; Leach, A. R. The ChEMBL Database in 2017. Nucleic Acids Res. 2017, 45 (D1), D945–D954. (56) Bender, A. Databases: Compound bioactivities go public https://www.nature.com/articles/nchembio.354 (accessed May 15, 2018). (57) Fourches, D.; Muratov, E.; Tropsha, A. Curation of Chemogenomics Data. Nat. Chem. Biol. 2015, 11 (8), 535. (58) Mangiatordi, G. F.; Alberga, D.; Altomare, C. D.; Carotti, A.; Catto, M.; Cellamare, S.; Gadaleta, D.; Lattanzi, G.; Leonetti, F.; Pisani, L.; Stefanachi, A.; Trisciuzzi, D.; Nicolotti, O. Mind the Gap! A Journey towards Computational Toxicology. Mol. Inform. 2016, 35 (8–9), 294–308. (59) Landrum, G. RDKit: Open-Source Cheminformatics. 2006. 2006. (60) O’Boyle, N. M.; Morley, C.; Hutchison, G. R. Pybel: A Python Wrapper for the OpenBabel Cheminformatics Toolkit. Chem. Cent. J. 2008, 2, 5. (61) Steinbeck, C.; Han, Y.; Kuhn, S.; Horlacher, O.; Luttmann, E.; Willighagen, E. The Chemistry Development Kit (CDK):  An Open-Source Java Library for Chemo- and Bioinformatics. J. Chem. Inf. Comput. Sci. 2003, 43 (2), 493–500. (62) Steinbeck, C.; Hoppe, C.; Kuhn, S.; Floris, M.; Guha, R.; Willighagen, E. L. Recent Developments of the Chemistry Development Kit (CDK) - An Open-Source Java Library for Chemo- and Bioinformatics http://www.ingentaconnect.com/content/ben/cpd/2006/00000012/00000017/art00005 (accessed May 15, 2018). (63) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50 (5), 742–754. (64) Gobbi, A.; Poppinger, D. Genetic Optimization of Combinatorial Libraries. Biotechnol. Bioeng. 1998, 61 (1), 47–54. (65) Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications. J. Chem. Inf. Comput. Sci. 1985, 25 (2), ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 28

64–73. (66) Nilakantan, R.; Bauman, N.; Dixon, J. S.; Venkataraghavan, R. Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors. J. Chem. Inf. Comput. Sci. 1987, 27 (2), 82–85. (67) PubChem Substructure Fingerprint v1.3. Ftp://Ftp.Ncbi.Nlm.Nih.Gov/ Pubchem/Specifications/Pubchem_fingerprints.Txt. (68) Klekota, J.; Roth, F. P. Chemical Substructures That Enrich for Biological Activity. Bioinforma. Oxf. Engl. 2008, 24 (21), 2518–2525. (69) Riniker, S.; Landrum, G. A. Open-Source Platform to Benchmark Fingerprints for LigandBased Virtual Screening. J. Cheminformatics 2013, 5 (1), 26. (70) Maggiora, G.; Vogt, M.; Stumpfe, D.; Bajorath, J. Molecular Similarity in Medicinal Chemistry. J. Med. Chem. 2014, 57 (8), 3186–3204. (71) Roy, K.; Das, R. N.; Ambure, P.; Aher, R. B. Be Aware of Error Measures. Further Studies on Validation of Predictive QSAR Models. Chemom. Intell. Lab. Syst. 2016, 152, 18–33. (72) Gissi, A.; Gadaleta, D.; Floris, M.; Olla, S.; Carotti, A.; Novellino, E.; Benfenati, E.; Nicolotti, O. An Alternative QSAR-Based Approach for Predicting the Bioconcentration Factor for Regulatory Purposes. ALTEX 2014, 31 (1), 23–36. (73) Cha, S. Tight-Binding Inhibitors-I. Kinetic Behavior. Biochem. Pharmacol. 1975, 24 (23), 2177–2185. (74) Kalliokoski, T.; Kramer, C.; Vulpetti, A.; Gedeck, P. Comparability of Mixed IC50 Data – A Statistical Analysis. PLOS ONE 2013, 8 (4), e61007. (75) Lin, S.; Wang, C.; Ji, M.; Wu, D.; Lv, Y.; Zhang, K.; Dong, Y.; Jin, J.; Chen, J.; Zhang, J.; Sheng, L.; Li, Y.; Chen, X.; Xu H. Discovery and Optimization of 2-Amino-4-Methylquinazoline Derivatives as Highly Potent Phosphatidylinositol 3-Kinase Inhibitors for Cancer Treatment. J. Med. Chem. 2018, 61 (14), 6087–6109. (76) Szabadkai, I.; Torka, R.; Garamvölgyi, R.; Baska, F.; Gyulavári, P.; Boros, S.; Illyés, E.; Choidas, A.; Ullrich, A.; Őrfi, L. Discovery of N-[4-(Quinolin-4Yloxy)Phenyl]Benzenesulfonamides as Novel AXL Kinase Inhibitors. J. Med. Chem. 2018, 61 (14), 6277–6292. (77) Del Bello, F.; Bonifazi, A.; Giorgioni, G.; Cifani, C.; Micioni Di Bonaventura, M. V.; Petrelli, R.; Piergentili, A.; Fontana, S.; Mammoli, V.; Yano, H.; Matucci, R.; Vistoli, G.; Quaglia, W. 1-[3-(4-Butylpiperidin-1-Yl)Propyl]-1,2,3,4-Tetrahydroquinolin-2-One (77-LH-28-1) as a Model for the Rational Design of a Novel Class of Brain Penetrant Ligands with High Affinity and Selectivity for Dopamine D4 Receptor. J. Med. Chem. 2018, 61 (8), 3712–3725. (78) Yang, X.; Michiels, T. J. M.; de Jong, C.; Soethoudt, M.; Dekker, N.; Gordon, E.; van der Stelt, M.; Heitman, L. H.; van der Es, D.; IJzerman, A. P. An Affinity-Based Probe for the Human Adenosine A2A Receptor. J. Med. Chem. 2018, 61 (17), 7892–7901. (79) Roberts, R. S.; Sevilla, S.; Ferrer, M.; Taltavull, J.; Hernández, B.; Segarra, V.; Gràcia, J.; Lehner, M. D.; Gavaldà, A.; Andrés, M.; Cabedo, J.; Vilella, D.; Eichhorn, P.; Calama, E.; Carcasona, C.; Miralpeix, M. 4-Amino-7,8-Dihydro-1,6-Naphthyridin-5(6H)-Ones as Inhaled Phosphodiesterase Type 4 (PDE4) Inhibitors: Structural Biology and Structure–Activity Relationships. J. Med. Chem. 2018, 61 (6), 2472–2489. (80) Ziegler, S.; Pries, V.; Hedberg, C.; Waldmann, H. Target Identification for Small Bioactive Molecules: Finding the Needle in the Haystack. Angew. Chem. Int. Ed. 2013, 52 (10), 2744–2792. (81) Janssen, J. W.; Schulz, A. S.; Steenvoorden, A. C.; Schmidberger, M.; Strehl, S.; Ambros, P. F.; Bartram, C. R. A Novel Putative Tyrosine Kinase Receptor with Oncogenic Potential. Oncogene 1991, 6 (11), 2113–2120. (82) Lounkine, E.; Keiser, M. J.; Whitebread, S.; Mikhailov, D.; Hamon, J.; Jenkins, J. L.; Lavan, P.; Weber, E.; Doak, A. K.; Côté, S.; Shoichet, B. K.; Urban, L. Large-Scale Prediction and Testing of Drug Activity on Side-Effect Targets. Nature 2012, 486 (7403), 361–367. (83) Chong, C. R.; Jr, D. J. S. New uses for old drugs https://www.nature.com/articles/448645a ACS Paragon Plus Environment

Page 27 of 28 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(accessed Oct 4, 2018).

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table of Contents Graphic A New Approach For Drug Target and Bioactivity Prediction: the Multi-fingerprint Similarity Search aLgorithm (MuSSeL) Domenico Alberga, Daniela Trisciuzzi, Michele Montaruli, Francesco Leonetti, Giuseppe Felice Mangiatordi and Orazio Nicolotti

ACS Paragon Plus Environment

Page 28 of 28