Influence of Descriptor Implementation on Compound Ranking Based

Apr 24, 2018 - Most of the common molecular descriptors have numerous different implementations. This can influence the results of compound prioritiza...
0 downloads 3 Views 6MB Size
Subscriber access provided by UNIV OF DURHAM

Pharmaceutical Modeling

Influence of Descriptor Implementation on Compound Ranking Based on Multi-Parameter Assessment Ekaterina A Sosnina, Dmitry I. Osolodkin, Eugene V. Radchenko, Sergey Sosnin, and Vladimir A. Palyulin J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.7b00734 • Publication Date (Web): 24 Apr 2018 Downloaded from http://pubs.acs.org on April 25, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Influence of Descriptor Implementation on Compound Ranking Based on Multi-Parameter Assessment Ekaterina A. Sosnina,„,…,¶ Dmitry I. Osolodkin,„,§,∥ Eugene V. Radchenko,„,¶ Sergey Sosnin,…,¶ and Vladimir A. Palyulin∗,„,¶ „Department of Chemistry, Lomonosov Moscow State University, Moscow 119991, Russia …Center for Computational and Data-Intensive Science and Engineering, Skolkovo Institute

of Science and Technology, Moscow 143026, Russia ¶Institute of Physiologically Active Compounds RAS, Chernogolovka 142432, Russia §Chumakov Institute of Poliomyelitis and Viral Encephalitides, Chumakov FSC R&D IBP

RAS, Moscow 108819, Russia ∥Sechenov First Moscow State Medical University, Moscow 119991, Russia E-mail: [email protected] Phone: +7-495-939-39-69. Fax: +7-495-939-02-90

Abstract Most of the common molecular descriptors have numerous different implementations. This can influence the results of compound prioritization based on the multiparameter assessment (MPA) approach that allows a medicinal chemist to simultaneously analyze and achieve the desired balance of the diverse and often conflicting molecular and pharmacological properties. In this study, we analyzed the feasibility of using different implementations of common descriptors (logP, logS, TPSA, logBB,

1

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

hERG, nHBA) interchangeably in predesigned sets of requirements in the course of multi-parameter compound optimization. The influence of methods of descriptor calculation, continuity and discreteness of their values and applicability domains, as well as of the nature of desirability functions in an MPA profile were examined in terms of the stability of MPA compound ranking. It was shown that the interchangeable use of different methods of descriptor calculation is reliably acceptable only for continuously distributed parameters transformed by a smooth desirability function. If a descriptor in an MPA scheme is discretely distributed, only the implementation that was used for building the scoring profile may be used for assessment. An inconsistency of assessment due to different applicability domains of descriptors was also demonstrated. Keywords: drug development, multi-parameter assessment, ranking, descriptor implementations, applicability domain

Introduction To prioritize the most relevant compounds for a particular task, multi-parameter assessment (MPA) schemes may be used to rank the compounds according to the properties represented by molecular descriptors. 1–3 The first widely recognised attempt to design such schemes was made by C. Lipinski with his famous rule of five designed to decrease the risks associated with the investigation of poorly bioavailable, nonselective, and/or synthetically unfeasible compounds in preclinical development. 4 A limitation of the approach is the absence of predefined criteria to choose the set of descriptors that should underlie the final assessment, 5 as well as the particular descriptor implementations. 6 The desirability function approach is widely used in MPA techniques for drug discovery and development. 7,8 It is based on the idea that the acceptability of a drug candidate with multiple quality characteristics, represented by numerical molecular descriptors, depends to a certain extent on each of them. 9 In an MPA workflow the value of each molecular descriptor is transformed into the range [0, 1] with the help of a project-specific desirability

2

ACS Paragon Plus Environment

Page 2 of 33

Page 3 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

function (DF), derived empirically from the descriptor distribution for a set of compounds meeting given requirements. Then the normalized values are combined using the arithmetic or geometric mean into a desirability index (DI), which also ranges from zero (compound does not have required properties and is not suitable for the task) to one (compound is a perfect candidate for further investigation). 9,10 Thus, it is possible to fold the influence of multiple descriptors into a single aggregated function and to optimize a single objective function instead of multiple ones. A set of DFs forms a scoring profile (SP). SPs may be designed based on the available experimental data to reflect the most suitable features of compounds, such as the ability to penetrate into the CNS, 11,12 to be administered as oral drugs, 13 or to use a specific drug delivery system. 14 A well-known example of an SP is the Quantitative Estimate of Drug-likeness (QED), 13 assessing the similarity of a compound to the approved oral drugs based on eight molecular descriptors: molecular weight (MW), lipophilicity (logP), number of hydrogen bond donors (HBD) and acceptors (HBA), polar surface area (PSA), number of rotatable bonds (ROTB) and aromatic rings (AROM), and number of undesirable substructures (ALERTs). DF for each descriptor corresponds to the distribution of its values for approved oral drugs. Another approach is the Probabilistic Scoring, 15 based on scoring functions that correspond to DFs. 16 In this approach the probability for a compound to meet the experiencebased criteria is calculated taking into account statistical uncertainty of the descriptors. In our previous studies we have shown that the use of different atomic charge calculation schemes influences the quality and predictive ability of 3D QSAR models 17,18 and the conformational space of small molecules explored by the molecular dynamics simulation. 19 More advanced schemes, better describing the electrostatic potential of a molecule, improve the outcomes and highlight the importance of correct descriptor choice for MPA. In this study we are evaluating how the use of different implementations of a number of common descriptors (logP, logS, TPSA, logBB, hERG, nHBA) affect the compound as-

3

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

sessment. It is important to understand (1) to what extent does a method of descriptor calculation influence the MPA results, (2) whether the result is stable with respect to the use of different descriptor implementations, and (3) how the DI ranking of compounds depends on descriptor variation. The objective of the investigation is not to assess the accuracy of scoring profiles or descriptor calculation methods, but only the influence of changes in descriptor implementation on the ranking of compounds. We investigated predesigned SPs to assess the possibility of using different methods of descriptor calculation interchangeably. From the open structure databases, several datasets were prepared based on random sampling and different descriptor filters. The DIs of the structures were calculated based on predesigned SPs and different descriptors calculation techniques and then compared to each other using the Spearman’s rank correlation coefficients.

Methods Database preparation Several open databases were used in the study with different property filters applied to them to create sample sets (Table 1): ZINC, 20,21 ZINClick, 22,23 Commercial Compound Collection (CoCoCo), 24,25 GDB-17, 26,27 and published compounds assessed as glycogen synthase kinase 3β (GSK-3β) inhibitors (extended from ref. 28, also used in refs. 29 and 30). ZINC and CoCoCo represent commercially available compounds for medicinal chemistry with rather different preselection rules, ZINClick is a generated set of synthetically accessible triazoles, GDB-17 is a generated ’chemical universe’ set of all theoretically conceivable compounds containing up to 17 atoms of C, N, O, S, and halogens, and the published GSK-3β inhibitors were chosen as compounds that were studied in real medicinal chemistry projects and constructed by more or less rational design approaches against a certain target. Structure canonicalization and processing were carried out manually and automatically using the ChemAxon Standardizer (Dearomatize, Clean 2D and Transform Nitro options were se-

4

ACS Paragon Plus Environment

Page 4 of 33

Page 5 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

lected). 31 Detailed description and the list of compounds for each dataset are given in the Supporting information File SI1 and File SI2, respectively. Table 1: Datasets used in this work. Dataset CoCoCo DL1, CoCoCo DL2

Number of structures 1500 1500

CoCoCo nDL1, CoCoCo nDL2

1500 1500

GDB17 1500, GDB17 3000 GDB17LL 1500, GDB17LL 3000 GDB17LLnoSR 1500, GDB17LLnoSR 3000

1500 3000 1500 3000 1500 3000

GSK1904

1904

GSK3015

3515

ZC All1 ZC All2

1500 1500

ZC Div

1000

ZC DL1

1500

ZC DL2

1500

ZINC1, ZINC2

1500 1500

a

Source and filters for selection Random sample of CoCoCo-SC brary (Drug-like: MW < 400 2 a TPSA < 120 ˚ A) Random sample of CoCoCo-SC brary (Drug-like: MW > 450 2 a ˚ TPSA > 125 A ) Random sample of GDB-17-Set

ASINEX lig/mol and ASINEX lig/mol and

Random sample of GDB-17 Lead-like Set (MW 100-350 g/mol, clogP 1-3) Random sample of GDB-17 Lead-like Set (MW 100-350 g/mol, clogP 1-3) without small rings (3-4 ring atoms) GSK3β inhibitors extracted from articles and patents with numerical IC50 value. Structures extracted from articles and patents with boolean info on inhibitory activity against GSK3β (active/inactive at 10 µM). Random sample of “All” subset of ZINClick v.13 Random sample from updates in “All” subset of ZINClick v.15 Random sample of “Diversity-set” subset of ZINClick v.13 Random sample of “Drug-Like” subset of ZINClick v.13 Random sample from updates in “Drug-Like” subset of ZINClick v.15 Randomly selected subset of ZINC 2 a (MW > 450 g/mol and TPSA > 130 ˚ A)

as calculated by Instant JChem.

5

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Calculation of Descriptors All descriptors (Table 2) were calculated for all the compounds in StarDrop version 6.0.1 32 using the default schemes. In addition, logS, logP, logBB, hERG pIC50, TPSA, and nHBA were calculated according to other schemes, differing in underlying technique and predictive ability. Detailed descriptions of the methods are provided in the Supporting information File SI1.

Desirability Indices StarDrop software was used as the core of the workflow. For each compound in each set all available combinations of descriptors were used to calculate the DIs according to the following SPs: (1) “QED StarDrop Properties” (QED, prioritize compounds similar to approved oral drugs), (2) “Intravenous Non CNS” (INCNS, CNS non-penetrating compounds designed for intravenous administration), and (3) “Oral CNS” (OCNS, CNS penetrating orally available compounds). QED is based on a combination of the DFs for molecular weight, lipophilicity, number of hydrogen bond donors and acceptors, polar surface area, number of rotatable bonds and aromatic rings, and the number of alerts for undesirable substructures based on ref. 13. INCNS and OCNS are calculated based on the values of solubility, lipophilicity, blood-brain barrier penetration, CYP2C9 affinity, and inhibition value of human ether-a-gogo-related gene potassium channel (hERG), as well as the categorical assessment of bloodbrain barrier penetration, human intestinal absorption, P-glycoprotein transport, plasma protein binding, and CYP2D6 affinity. Both SPs prioritize the compounds with logP values between 0 to 3.5. Compounds with LogS ¿ 1 are prioritized by both profiles, but this descriptor is more important for OCNS. HIA+ category is significantly prioritized in the case of OCNS, as well the other descriptors, compared with INCNS. Hereafter the following word code will be used to represent DI with a certain combination of methods for descriptors calculation: OCNS[logS,logP,logBB,hERG] and INCNS[logS,logP,logBB,hERG] for INCNS and OCNS DIs respectively, and QED[logP,TPSA,nHBA] for QED DI, where logS, logP, 6

ACS Paragon Plus Environment

Page 6 of 33

Page 7 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 2: Variable descriptors and schemes of their calculation. Calculation Description References scheme Solubility, LogS logSALOGP S2 ANNa ; 24 E-state indices and some other topological pa33–36 rameters including indicator variables for aliphatic hydrocarbons and aromaticity. 37 logSALOGP S3 ASNNb ; 330 descriptors: atom and bond-type E-state indices, numbers of hydrogen and non-hydrogen atoms. logSSD.6 RBFc technique for prediction of logarithm of the intrin15,32 sic aqueous solubility for neutral compounds; over 100 descriptors of structure size and counts of different atomic or specific fragments. logSSD.pH RBFc technique for prediction the logarithm of the appar15,32 ent solubility at pH 7.4 for ionised compounds. For neutral compounds logSSD.6 model is used; for charged ones a separate model is used based on 28 substructure and property based descriptors. logSSD.W S RBFc technique; 167 different 2D SMARTS based descrip32,38 tors (substructure and property). Lipophilicity, LogP logPALOGP S2 Property-based method. ASNNb ; 75 descriptors: atom and 39,40 bond type E-state indices, number of hydrogen and nonhydrogen atoms. logPALOGP S3 Property-based method. ASNNb ; 330 descriptors: atom 37 and bond type E-state indices, number of hydrogen and non-hydrogen atoms. logPDLT.X Atom-based method utilizing XLOGP model and improved 41–43 CDK descriptors, including correction factors for some intramolecular interactions and certain adjustments. logPDLT.A Atom-based Ghose-Crippen approach based on CDK de41,44 scriptors without correction factors. logPSD.6 RBF technique; 100 2D-descriptors including atom and 15,32 functionality counts logPJChem Fragmental ChemAxon logP model with pool of the frag45–47 ments for training set based on [36], [37] and Physprop database. logPXLOGP 3 Latest version of the XLOGP atom-based method, im48,49 proved by taking the known logP value of a structural analog into account. Inhibition of human ether-a-go-go-related gene potassium channel, hERG pIC50 hERGSD.6 Non-linear Gaussian Processes; only data from patch-clamp 15,32 measurements in mammalian cells; 158 substructure and property based descriptors. 7

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 33

hERGM SU Artificial neural networks based on fragmental descriptors. Blood-Brain Barrier Penetration, LogBB logBBSD.5 Non-linear RBFc ; 29 substructure and property based descriptors. logBBSD.6 Non-linear RBFc ; 36 substructure and property based descriptors; expanded applicability domain of the model. logBBM SU Artificial neural networks based on fragmental descriptors. Topological polar surface area, TPSA T P SADLT CDK descriptors based on Ertl’s approach. T P SAJChem Ertl’s approach. T P SASD.QED Ertl’s approach. Ertl’s approach excluding phosphorus and sulfur as acceptors. T P SASD.6 Number of hydrogen bond acceptor, nHBA nHBADLT CDK descriptors which count any oxygen and any nitrogen where the formal charge of the atom is non-positive. Exceptions are ether oxygen adjacent to at least one aromatic carbon and an oxygen adjacent to a nitrogen or a nitrogen adjacent to an oxygen. nHBAJChem Count of oxygen and nitrogen atoms except the cases when their direct neighbors are connected to another atom by a double bond or if they are parts of an aromatic system. nHBASD.QED QED Scoring Profile model based on atoms with non-positive charge. nHBASD.6 Count of nitrogen and oxygen atoms. a

ANN, Artificial Neural Networks,

b

ASNN, Associative Neural Network,

c

50 32,51 15,32 52 41,53 45,53 32,53 32,53 41

45

32 15,32

RBF, Radial Basis Function

hERG, logBB, TPSA, and nHBA are the methods for descriptors calculation described in Table 2. Desirability of each descriptor was calculated according to the predefined StarDrop DFs, 32 and their geometric mean was taken as the final DI value.

Compound ranking Compounds in each dataset were ranked according to the DI values ascending from 1 (compound with the lowest value of DI in the set) to N (best-scored compound); N was equal to the number of structures in the set. Compounds with equal DI values received the same rank equal to the arithmetic mean of their ordinal rankings.

8

ACS Paragon Plus Environment

Page 9 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Pairwise correlation of ranks Spearman’s rank correlation coefficient rs was used to estimate the consistency of rankings and to evaluate the dependence of compounds ranking on descriptor variation. Because some structures possess the same DI value, a correction factor, taking into account the tied ranks, might be relevant in the calculation. For this reason, Spearman’s rank correlation coefficient (eq. 1) and its modified version with the correction factor T by Woodbury 54,55 (eq. 2) were computed.

rs = 1 −

2 6 ∑N r=1 (ar − br ) N3 − N

2 6[∑N r=1 (ar − br ) + T ] , N3 − N M M ∑i=1A (n3A,i − nA,i ) + ∑j=1B (n3B,j − nB,j )

(1)

rsw = 1 −

T=

(2)

12

where ar and br are the ranks of the rth DI value from calculations A and B; r = 1, 2, ..., N , equal to the number of compounds in the set; T is the correction factor for tied ranks; nA,i and nB,i are the numbers of values in each of MA and MB ties observed in calculations A and B. We found that the presence of tied ranks does not affect the correlation. In our study, the correlation coefficient values calculated with and without the correction factor T are equal up to 6 decimal places. As suggested to be a general rule in [56], the correction factor is not necessary if the number of tied ranks is less than 25 percent of the total number of pairs, which may be achieved only in a small database with weak structure diversity. Therefore, the original Spearman’s correlation coefficient was used in the study to estimate the consistency of rankings. Rank correlation matrices were visualized as heat maps (Supporting information File SI3). The Python 3 script based on NumPy v.1.11.2 for the calculation of correlation coefficients and matrix visualization is available in the Supporting information File SI4 and File SI5. The Pearson correlation coefficients (R2 ) for different descriptor implementations (Sup9

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 33

porting information file File SI6) were calculated with pearsonr function from SciPy 1.0.0.

Results and Discussion Data Structure and Interpretation Ranking based on MPA was performed for all datasets using all possible combinations of descriptors in order to check how the ranking depends on a descriptor calculation scheme. Three DIs (QED, OCNS, INCNS) were calculated for every compound using descriptors and SPs described in Methods section. Then the compounds were ranked within each dataset based on the DI values. Variation of compound ranks upon the change of descriptors values was assessed by Spearman’s correlation coefficient and visualized as correlation matrix heatmaps (Supporting information File SI3, Figures S1–S78). Symmetric square matrices of Spearman’s rank correlation coefficient values were obtained. Each element of a matrix is located at the intersection of SP schemes with a certain descriptor calculation methods, and every row or column is labeled with the name of this combination (Figure 1). The numbers of rows and columns are equal to the number of investigated combinations of descriptor calculation methods. Thus, the size of matrices in the case of QED SP is 112×112 and in the cases of INCNS and OCNS it is 140×140. Rank correlation for QED DI is considerably better than for INCNS and OCNS DIs, and the rank correlation coefficients lie in the ranges [0.6, 1.0] and [0.0,1.0] respectively.

Influence of a DF shape on a DI As noted before, the rank correlation for the QED DI is much better than for the INCNS and OCNS DIs (Figure 2). The difference is noticeable even when the calculation methods are changed only for a single descriptor: rs = 0.81 between INCNS[logSSD.6 ,logPSD.6 , logBBSD.6 ,hERGSD.6 ] and INCNS[logSSD.6 ,logPALOGP S3 ,logBBSD.6 ,hERGSD.6 ] and rs = 0.99 between QED[logPSD.6 ,T P SASD.6 ,nHBASD.6 ] and QED[logPALOGP S3 ,T P SASD.6 ,nHBASD.6 ]

10

ACS Paragon Plus Environment

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Correlation matrix heatmap of ranks for OCNS SP

with different methods for descriptors: LogP TPSA nHBA TPSADLT nHBADLT logPALOGPS2 nHBAJChem TPSADLT logPALOGPS2 nHBASD.QED TPSADLT logPALOGPS2 TPSADLT nHBASD.6 logPALOGPS2 nHBADLT TPSAJChem logPALOGPS2

Page 11 of 33

with different methods for descriptors: LogP TPSA nHBA logPALOGPS2 TPSADLT nHBADLT logPALOGPS2 TPSADLT nHBAJChem logPALOGPS2 TPSADLT nHBASD.QED logPALOGPS2 TPSADLT nHBASD.6 logPALOGPS2 TPSAJChem nHBADLT logPALOGPS2 TPSAJChem nHBAJChem logPALOGPS2 TPSAJChem nHBASD.QED logPALOGPS2 TPSAJChem nHBASD.6 logPALOGPS2 TPSASD.QED nHBADLT

(a)

(b)

Figure 1: Graphical representation of rank correlation of OCNS DIs for the ZINC2 dataset: (a) colored correlation matrix map; (b) scaled up fragment of the matrix. (Figure 3). This difference may be attributed to a method of descriptor value transformation by DF. The DFs are derived from the empirical distributions of descriptors for sets of undoubtedly acceptable compounds, for example, approved drugs. The architecture of functions is arbitrarily defined by a researcher and may be represented as a continuous or a piecewise function. Therefore, the shape of DF is responsible for DI value (Figure 3). The OCNS and INCNS DIs are based on step functions that do not provide a smooth transformation of descriptor values. As a result, the rank correlation for the INCNS DI may be weak when different methods for descriptor calculation are used. Variations in a single descriptor may lead to a pronounced fluctuations in the DI value (Figure 3a-c). The shape of a desirability function is reflected in the plots of the pairwise comparison of DI ranks (Figure 3c, f). It is clearly visible when only one descriptor was implemented by different methods (Figure 4). The predicted ranks are split into well-correlated groups, and the number of 11

ACS Paragon Plus Environment

logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5)

nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6)

logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3)

PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3)

PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6) PSA(DLT) PSA(DLT) PSA(DLT) PSA(DLT) PSA(JChem) PSA(JChem) PSA(JChem) PSA(JChem) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD-QED) PSA(SD6) PSA(SD6) PSA(SD6) PSA(SD6)

logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS)

logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(ALOGPS2) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(AlOGPS3) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD6) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-pH) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS) logS(SD-WS)

nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6) nHBA(DLT) nHBA(JChem) nHBA(SD-QED) nHBA(SD6)

0.70

0.75

0.7

0.80

0.85

0.8

0.90

0.9

0.95

Page 12 of 33

hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6)

Journal of Chemical Information and Modeling

logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(ALOGPS2) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(AlOGPS3) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTX) logP(DLTA) logP(DLTA) logP(DLTA) logP(DLTA) logP(SD6) logP(SD6) logP(SD6) logP(SD6) logP(JChem) logP(JChem) logP(JChem) logP(JChem) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3) logP(XLOGP3)

logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5) logBB(SD6) logBB(SD6) logBB(MSU) logBB(SD5)

1.00

hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6) hERG(SD6) hERG(MSU) hERG(SD6) hERG(SD6)

0.0

1.0

0.2

0.0

(a)

0.4

0.6

0.5

0.8

1.0

1.0

(b)

Figure 2: Rank correlation matrix maps for QED (a) and INCNS (b) based on ZC DL2 dataset. groups is equal to the number of “steps” of the desirability function. Small variations in the descriptor values near the step boundaries can dramatically affect the resulting ranking. If the uncertainty of descriptor calculation is taken into account, the picture becomes more scattered (Figure 3d). The DFs for QED DI are sigmoid or asymmetric double-sigmoid functions converting the descriptor values into the desirability scores in a smooth manner. This means that the DF transforms descriptor values into the desirability scores more predictably and the final DI values do not significantly depend on descriptor calculation methods. According to the correlation maps, even when descriptors vary, the QED values remain stable (Figure 3a,e,f). Therefor, in MPA it is preferable to use DFs with a smooth conversion of values. Step functions are less effective due to a significant discrimination of close values upon transformation.

12

ACS Paragon Plus Environment

0.2

100 0

0 -2

4

0

2

logP

4

6

1500 1000 500

1500

rs=0.81

0

500

1000

rs=0.88

1500

0

rank on INCNS logPALOGPS3

(b)

2

rank on INCNS logPSD.6

0.4

200

0

0.6

300

1000

0.8

400

500

1

500

rank on INCNS logPSD.6

6

600

0

Frequency

for INCNS DIs

(c)

500

1000

1500

rank on INCNS logPALOGPS3

(d)

0.4

200

0.2

100

for QED DIs

0

0 -2

0

2

4

rs=0.99

6

0

logP

(e)

1500 1000

0.6

300

500

0.8

400

1500

1

500

rank on QED logPSD.6

(a)

600

rs=0.99

0

logPALOGPS3

6

1000

4

500

-2

2

Frequency

0

0

0

-2

rank on QED logPSD.6

R2=0.67

Desirability score

logPSD.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Desirability score

Page 13 of 33

500

1000

rank on QED logPALOGPS3

(f)

1500

0

500

1000

rank on QED logPALOGPS3

1500

(g)

Figure 3: Influence of a DF on a DI value for structures from GSK3515 dataset. (a) Correlation of logP values, calculated by logPSD.6 and logPALOGP S3 methods. LogP DF (red line) for (b) INCNS and (e) QED SPs with distribution of logP values, calculated by logPSD.6 (black) and logPALOGP S3 (green) methods. (c) and (f) - rank correlation for the INCNS and the QED DIs, respectively, without taking uncertainty into account. (d) and (g) - rank correlation for the INCNS and the QED DIs with account for uncertainty. All descriptors except logP are calculated using StarDrop defaults. Step functions and descriptor uncertainty Including an uncertainty into a DI calculation in StarDrop is supposed to smoothen the outcome of rigid cut-offs, or step DFs. In our case the uncertainty emerges due to deviations of descriptor values caused by the application of QSPR or QSAR methods for their calculation. The rank correlation for results obtained without and with accounting for uncertainty differs considerably: rs = 0.73 (Figure 3c) and rs = 0.88 (Figure 3d) between INCNS[logSSD.6 , logPSD.6 ,logBBSD.6 ,hERGSD.6 ] and INCNS[logSSD.6 ,logPALOGP S3 ,logBBSD.6 ,hERGSD.6 ], respectively. It was shown that accurate estimation of uncertainty improves the results of multiparameter optimisation. 57 However, such an improvement of correlation is not enough to make different methods of descriptor calculation completely interchangeable. This is evident

13

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(a)

Page 14 of 33

(b)

Figure 4: Influence of the DF shape on the resulting DI ranks based on different methods of logP calculation for the GSK1904 dataset. on INCNS correlation matrix maps (Supporting information File SI3, Figures S34–S65) in case of StarDrop and ALOGPS methods. For other descriptor calculation methods, it was not possible to take the uncertainty into account due to unavailability of standard deviations. In the case of smooth DFs, the uncertainty does not make a meaningful contribution: rs = 0.99 in both cases (Figure 3f,g). Thus, the SPs based on smooth DFs may be considered as sufficiently accurate.

Contribution of a single descriptor Importance of a descriptor in an SP. In some cases the DIs based on step DFs correlate extremely well: rs = 0.99 between DIs differing only by the methods for logBB calculation, INCNS[logSSD.6 ,logPSD.6 ,logBBM SU ,hERGSD.6 ] and INCNS[logSSD.6 ,logPSD.6 , logBBSD.6 ,hERGSD.6 ] (Figure 5). On the basis of the aforementioned results we might expect that the calculation methods for this descriptor would correlate perfectly. However, the logBBM SU and logBBSD.6 are absolutely uncorrelated: R2 = 0.04. It occurs due to a small variation of the logBB descriptor contribution into the INCNS DI: the DF causes

14

ACS Paragon Plus Environment

Page 15 of 33

only small changes in of the desirability scores, thus this descriptor has a minor importance. Although the logBB values calculated by two different models are absolutely uncorrelated, the values of corresponding INCNS DIs are almost the same.

-1

-0.5 0 -0.5

0.5

1

1.5

-1

0.6

200 0.4

150 100

0.2

50

-1.5 -2

0

0 -2

logBBMSU

-1.2

-0.4

logBB

0.4

1.2

(b)

(a)

1500 1000

-1.5

250

rs=0.99

500

0

rank on INCNS logBBSD.6

0.5

Frequency

-2

0.8

300

Desirability score

1

1

-2

350

R2=0.04

0

1.5

logBBSD.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0

500

1000

1500

rank on INCNS logBBMSU

(c)

Figure 5: Influence of descriptor with low importance (logBB) on rank correlation for GSK1904 dataset: (a) rank correlation of INCNS DIs calculated using the default StarDrop methods and different logBB descriptors; (b) logBB DF (red line) and distribution of the logBB values calculated by logBBSD.6 (black) and logBBM SU (green) methods; (c) correlation of the logBB values calculated by logBBSD.6 and logBBM SU methods. On the other hand, when the importance of a descriptor is high and the SP is represented by step functions, even a small deviation of a descriptor value can significantly change the DI value (Figure 6). For example, the logP values for GSK1904 dataset are well correlated, R2 = 0.75, but the rank correlation of DIs based on step functions is low and does not allow to use them interchangeably: rs = 0.86 between DIs differing only in logP calculation schemes, INCNS[logSSD.6 ,logPSD.6 ,logBBSD.6 ,hERGSD.6 ] and INCNS[logSSD.6 , logPXLOGP 3 ,logBBSD.6 ,hERGSD.6 ]. Thus, if a descriptor contributes to an SP significantly, it has to be calculated by the method used in the development of this SP, otherwise the DI ranking becomes unstable. In the cases when the importance of a descriptor is negligible, it is possible to exchange one method of descriptor calculation for another even if they correlate poorly, although their usefulness in an MPA scheme is questionable. Perhaps, an SP using descriptors with such a minor importance should be refined, because even strong fluctuations of descriptor values do not affect the outcome. 15

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

2

R2=0.75

-2

0.4

100 0.2

50

0 -1 0 -1

150

1

2

3

4

5

6

7

0

0 -2 -1 0 1 2 3 4 5 6 7

logPXLOGP3

(a)

logP

(b)

1500

0.6

1000

200

3

500

4

rank on INCNS logPSD.6

0.8

rs=0.86 0

250

Frequency

5

1 -2

1

-2

300

6

Desirability score

7

logPSD.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 33

0

500

1000

1500

rank on INCNS logPXLOGP3

(c)

Figure 6: Influence of a descriptor with high importance (logP) on rank correlation for the GSK1904 dataset: (a) correlation of the logP values calculated by logPSD.6 and logPXLOGP 3 methods; (b) logP DF (red line) and distribution of the logP values calculated by logPSD.6 (black) and logPXLOGP 3 (green) methods; (c) rank correlation of INCNS DIs calculated using the default StarDrop methods and different logP descriptor. Variability of descriptor values in a dataset. Different methods of descriptor calculation may be absolutely interchangeable if they provide values matching the flat or almost flat region of a DF. For example, the rank correlation for INCNS[logSSD.6 ,logPSD.6 , logBBSD.6 ,hERGSD.6 ] and INCNS[logSSD.6 ,logPJChem ,logBBSD.6 ,hERGSD.6 ] DIs of GDB17LL 1500 dataset is good (rs = 0.92) (Figure 7g) because almost all logP values are transformed into the same desirability value (Figure 7e). However, for the GDB17 1500 compounds with wider logP values range, the rank correlation of same DIs is dramatically worse (rs = 0.42). Similar situation occurs for the QED SP since the logP values are close to the maximum of the DF.

Influence of specific descriptor implementation Continuous descriptors. The methods of descriptor calculation may affect the final DI score even when a smooth sigmoid or double-sigmoid DF is used. It may be illustrated by a comparison of rankings based on different methods of logP calculation (Figure 8a). The overall rank correlation coefficients are good enough (rs not less than 0.85), but vary depending on descriptor calculation methods. The distributions of logP values calculated by different methods are similar for most of them except logPDLT.X and logPDLT.A (Figure 16

ACS Paragon Plus Environment

0.2

100 0

0

-3

-4

logPJChem

-1

1

3

5

-4

2 1 -3

-2

0 -1 0 -1

1

-2

2

3

4

R2=0.76

-3 -4

logPJChem

(e)

5

Frequency

3

200

0.8

150

0.6

100

0.4

50

0.2

0

0

-3

-1

1

1500 1000 500

1500

0

3

logP

5

1500

(f)

500

1000

rank on QED logPJChem

1500

(d) rs=0.42

rs=0.94 0

250

4

1000

(c) 1

1500

(b)

500

rank on INCNS logPJChem

1000

(a) 5

5

0

logP

rs=0.99

1500

R2=0.34

-3

0.4

200

1000

5

500

4

rank on QED logPSD.6

-2

3

rank on INCNS logPSD.6

2

0

1

500

0 -1 0 -1

rank on INCNS logPSD.6

-2

0.6

300

0

-3

400

Desirability score

1

rs=0.92

rank on QED logPSD.6

0.8

Desirability score

-4

2

Frequency

logPSD.6

3

0

1

500

1500

600

4

1000

5

logPSD.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

500

Page 17 of 33

0

500

1000

15001500

rank on INCNS logPJChem

0

(g)

500

1000

rank on QED logPJChem

1500

(h)

Figure 7: Influence of descriptor value ranges on DI value for GDB17LL 1500 (a-d) and GDB17 1500 (e-f) datasets. (a, e) - correlation of logP values, calculated by logPSD.6 and logPJChem methods; (b, f) - logP QED DF (green dotted line) and INCNS DF (blue dashed line) with distribution of logP values, calculated by logPSD.6 (green) and logPJChem (black) methods; (c, g) - rank correlation for the INCNS DIs; (d, h) - rank correlation for the QED DIs. All descriptors except logP are calculated using StarDrop defaults. 8b). Larger differences between logP value distributions lead to lower correlation between DI rankings based on these methods. The lowest rank correlation coefficients correspond to logPDLT.A method. It is one of the earliest methods for the lipophilicity prediction, and its accuracy was limited due to the lack of training data. A similar, but not so prominent drop is observed with logPDLT.X . Both models are atom-based, but the latter one includes the correction factors for some intramolecular interactions and certain adjustments proposed by the CDK developers. Quality of these models is not sufficient, so generally they are not used nowadays. The remaining models (either substructure- or property-based) are considered as the state of the art 58,59 and correlate well with each other. The logP values provided by them are distributed similarly (Figure 8b), and the DIs based on these models also correlate well. Since these models are based on different approaches, it may be concluded that the result of

17

ACS Paragon Plus Environment

1

500

0.8

L

Frequency

A .X LT

D .A LT

D he m

JC

400 0.6 300 0.4

G

P 3

SD

.6

200 0.2

O

100

L

Desirability score

600

O

A G L P O 3 G G P P S3 S2

Page 18 of 33

X

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

lo gP A lo L O gP GP S2 A lo L O gP GP S3 D lo LT. gP X D lo LT. gP A J lo Che gP m S lo D.6 gP lo l l l l l l gP ogP ogP ogP ogP ogP ogP XL O

Journal of Chemical Information and Modeling

0 0.7

0.8

0.9

−3 −2 −1 0

1

2

3

4

5

6

0

logP

1.0

(a)

(b)

Figure 8: Interchangeability of logP descriptors of QED SP for GSK3515 dataset. (a) Rank correlation matrix map with logP methods indication. (b) Distribution of logP values calculated by: logPSD.6 (black), logPALOGP S2 (green), logPALOGP S3 (light green), logPJChem (orange), logPXLOGP 3 (blue), logPDLT.X (pink dashed), logPDLT.A (purple dashed). QED DF (red dotted). DI calculation does not depend on underlying assumptions. Discrete descriptors. Counts of different molecular features are present in the scoring profiles along with continuously distributed descriptors. These discrete descriptors also show significant influence on the ranking results. For example, strong influence of nHBA descriptor on the DI ranking is obvious in CoCoCo nDL2 (Figure 9a). Different QED DI outcomes are obtained for them despite the use of smooth sigmoid DFs. Similar to the logP influence discussed above, this occurs due to the low correlation of methods (Figure 9b). The nHBA descriptor values calculated by different schemes are very different and not correlated. As a result, the QED DIs based on them also do not correlate. The number of hydrogen bond acceptors is one of the discrete descriptors that take only a certain set of values. Other examples of such descriptors are the numbers of hydrogen bond donors or aromatic rings. They are determined by the structure of a compound and calculated directly from the molecular graph. Given that such descriptors may be defined

18

ACS Paragon Plus Environment

Page 19 of 33

LT

D BA

nH

m

he

D

QE

D.

.6

S SD JC BA BA BA nH nH nH

1

400

one by one

Frequency

DL T

A nH B

A

JC he QE m D

0.4

SD .

A

nH B

200

100

SD .

6

0.2

A

one by one

nH B

0.6

Desirability score

0.8 300

nH B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0.7

0.8

0.9

0

3

5

1.0

7

9

11

13

0 15

nHBA

(a)

(b)

Figure 9: Interchangeability of nHBA descriptors of QED SP for CoCoCo nDL2 dataset. (a) Rank correlation matrix map with nHBA methods indication. The nHBA count methods nHBADLT , nHBAJChem , nHBASD.QED , and nHBASD.QED are cycled in every row and column. (b) Distribution of nHBA values calculated by: nHBADLT (purple), nHBAJChem (orange), nHBASD.QED (blue), nHBASD.6 (black). QED DF (red dotted). in numerous ways related to their subjective understanding, the results of their calculation may be different (Figure 10) and not correlated (Figure 9b). It leads to a dramatic influence of a variation of discrete descriptor calculation methods on the compounds assessment and limits their interchangeability. In the MPA only the method used for training of the SP gives acceptable results.

Inconsistency of DIs out of descriptor applicability domain Descriptor calculation methods are interchangeable if the values provided by different methods are close. But sometimes the descriptor calculation methods demonstrating similar values for one dataset may give different results for another. For example, the distributions of the logP values calculated by logPSD.6 , logPALOGP S2 , logPALOGP S3 , logPJChem , logPXLOGP 3 , are very similar in the case of CoCoCo DL2 dataset, but are different in the case of CoCoCo nDL2 dataset (Figure 11). The same situation is observed for the distributions of

19

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

O

H2 N

S

O

H N

N N

O

O

O− N+ O

O

nHBASD.6 = 12 H2 N

O S

O

H N

O

O

N N

O− N+ O

O

nHBASD.QED = 8 H2 N

O S

O

H N

O

O

N N

O− N+ O

O

nHBAJChem = 7 H2 N

O S

O

H N

O

O

N N

O− N+ O

O

nHBADLT = 9 Figure 10: Example of difference in the nHBA calculation by different methods.

20

ACS Paragon Plus Environment

Page 20 of 33

Page 21 of 33

nHBA values (Figure 12).

0.6

300 0.4

200

0.2

100

0.8

400

0.6

300 0.4

200

0.2

100

0 0 −2 −1 0 1 2 3 4 5 6 7 8 logP

0 0 −2 −1 0 1 2 3 4 5 6 7 8 logP

(a)

(b)

0.8

400

0.6

300 0.4

200

0.2

100

500 Frequency

500

1 Desirability score

1

0.8

400

0.6

300 0.4

200

0.2

100

0 0 −2 −1 0 1 2 3 4 5 6 7 8 logP

Desirability score

400

500

Desirability score

0.8

Frequency

Frequency

500

1 Desirability score

1

Frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0 0 −2 −1 0 1 2 3 4 5 6 7 8 logP

(c)

(d)

Figure 11: Distribution of logP values of compounds from CoCoCo DL2 (a), CoCoCo nDL2 (b), GDB17LL 1500 (c) and GDB17 1500 (d) datasets calculated by: logPSD.6 (black), logPALOGP S2 (green), logPALOGP S3 (light green), logPJChem (orange), logPXLOGP 3 (blue). QED DF (green dotted line), INCNS and OCNS DFs (red dot-dashed line). The reason lies in the content of the datasets. The CoCoCo DL2 dataset contains 2 compounds with MW125 ˚ A . Structures with such characteristics are less likely to be involved in

the creation of descriptor calculation schemes. Since the QSAR models demonstrate better predictivity for compounds belonging to their applicability domain, the descriptor values for compounds outside of the applicability domain differ and poorly correlate. A similar result was observed for the GDB17 datasets. Although the GDB17 1500 logP 21

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

0.6

300 0.4

200

0.2

100 0

0

2

4

6

0.8

400

0.6

300 0.4

200

0.2

100 0

0 8 10 12 14 16 18 nHBA

0

2

4

6

(a)

0 8 10 12 14 16 18 nHBA

(b) 1

400

0.6

300 0.4

200

0.2

100 0

0

2

4

6

500 Frequency

0.8

Desirability score

500

1 0.8

400

0.6

300 0.4

200

0.2

100

0 8 10 12 14 16 18 nHBA

0

0

2

(c)

4

6

Desirability score

400

500 Frequency

0.8

Desirability score

Frequency

500

1 Desirability score

1

Frequency

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 33

0 8 10 12 14 16 18 nHBA

(d)

Figure 12: Distribution of nHBA values of compounds from CoCoCo DL2 (a), CoCoCo nDL2 (b), GDB17LL 1500 (c) and GDB17 1500 (d) datasets calculated by: nHBASD.6 (black), nHBASD.QED (cyan), nHBAJChem (orange), nHBDLT (purple). QED DF (green dotted line). values computed by different methods were distributed similarly, they were consistently lower than ordinary drug-like values (Figure 11d) and could not be used interchangeably. On the other hand, logP values for GDB17LL 1500 dataset were in the narrow interval coinciding with the flat section of DF, thus being interchangeable due to transformation into close desirability values. Thus, the decrease of rank correlation upon use of different descriptor implementations appears not only for large MW molecules, but for small ones too, and depends on the models’ applicability domain. As a result, the interchangeable use of different methods for descriptor calculation is highly dependent on the nature of the compounds. A consistent multi-parameter assessment may be obtained only for the compounds belonging to the common drug-like (e.g., Lipinski-

22

ACS Paragon Plus Environment

Page 23 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

compliant) chemical space. The multi-parameter assessment of structures that lie outside the descriptor applicability domain is less reliable, regardless of the descriptor type (continuous or discrete). For such compounds the DI is unstable and varies strongly with the technique of descriptors calculation.

Conclusions A ranking based on a multi-parameter assessment approach is a convenient tool for compound assessment in drug design. It allows one to analyze simultaneously a multitude of different descriptors representing various characteristics of the compounds in order to reach their desired balance for a certain task. We analyzed the feasibility of using different methods of descriptor calculation interchangeably in predesigned SPs. We investigated the stability of compound assessment determined by profiles designed for the drug-likeness, CNS oral, and intravenous availability. The influence of the DF type, continuity and discreteness of descriptors, as well as their applicability domain were studied. When different methods of descriptor calculation are used, compound assessment is highly dependent on the accuracy of assessment profile determination and representation. Thus the representation of functions forming an SP affects the compounds assessment and so their ranking. Using smooth functions is highly recommended in multi-parameter assessment. Using non-smooth step functions greatly reduces the stability of assessment and ranking. Accounting for the uncertainty of descriptor calculation may slightly improve the assessment results in this case, but will not be sufficient for reliable use. We have shown that the descriptors having low importance in SPs do not make a significant contribution into an MPA ranking. Therefore in such cases the choice of the descriptor calculation methods and significant differences in their values have almost no influence on the results. Our findings indicate that there is no difference between the current state-of-art methods

23

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

of calculation of continuously distributed descriptors, such as logP, in the multi-parameter assessment paradigm. These schemes of calculation are largely interchangeable. On the other hand, calculation schemes for discrete count-based descriptors, such as nHBA, give variable results. Therefore only the method originally used for training the SP gives acceptable results. We also demonstrated the inconsistency of ranking when evaluated compounds do not belong to the common drug-like chemical space. The corresponding DI values and ranks are unstable and vary with the method of descriptors calculation regardless of the structure of SP.

Acknowledgement The authors thank OptibriumLtd. for providing free trial StarDrop license for this study. We are grateful to Matthew Segall for providing helpful advice and support in StarDrop usage. We are grateful to the ChemAxon company for kindly providing the academic licenses for the software for structure data management, search and prediction. This work was supported by the Russian Foundation for Basic Research (project 15–03–09084).

Abbreviations MPA, multi-parameter assessment; QED, Quantitative Estimate of Drug-likeness; SP, scoring profile; DI, desirability index; DF, desirability function; ASNN, Associative Neural Network; ANN, Artificial Neural Networks; RBF, Radial Basis Function

Supporting Information File SI1: pdf file with a description of the datasets and methods; File SI2: csv file containing the investigated datasets; 24

ACS Paragon Plus Environment

Page 24 of 33

Page 25 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

File SI3: pdf file with colored correlation matrix maps of ranks for the QED DIs (Figure S2–S33), INCNS DIs (Figure S34–S65), and OCNS DIs (Figure S66–S97); File SI4 and File SI5: py files containing the Python 3 script for drawing the correlation matrices based on csv files with INCNS/OCNS and QED DIs values. File SI6: pdf file with graphs of pairwise comparison of descriptor values and their distributions; This information is available free of charge via the Internet at http://pubs.acs.org

References (1) Lusher, S. J.; McGuire, R.; van Schaik, R.; de Vlieg, J. Data-Driven Medicinal Chemistry in the Era of Big Data. Drug Discov. Today 2013, 19, 859–868. (2) Nicolaou, C. A.; Brown, N. Multi-Objective Optimization Methods in Drug Design. Drug Discov. Today Technol. 2013, 10, 427–435. (3) Segall, M. D. Multi-Parameter Optimization: Identifying High Quality Compounds with a Balance of Properties. Curr. Pharm. Des. 2012, 18, 1292–1310. (4) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and Computational Approaches to Estimate Solubility and Permeability in Drug Discovery and Development Settings. Adv. Drug Deliv. Rev. 1997, 23, 3–25. (5) Garcia-Sosa, A.; Maran, U.; Hetenyi, C. Molecular Property Filters Describing Pharmacokinetics and Drug Binding. Curr. Med. Chem. 2012, 19, 1646–1662. (6) Yusof, I.; Segall, M. D. A Considering the impact drug-like properties have on the chance of success. Drug Discov. Today 2013, 18, 659–666.

25

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(7) Segall, M. Integrating Metabolism and Toxicity Properties. In New Horizons in Predictive Drug Metabolism and Pharmacokinetics; Wilson, A. G., Ed.; RSC Drug Discovery; Royal Society of Chemistry: Cambridge, UK, 2016; Chapter 10, pp 227–246. (8) Cruz-Monteagudo, M.; Borges, F.; Cordeiro, M. N. D. S.; Cagide Fajin, J. L.; Morell, C.; Ruiz, R. M.; Ca˜ nizares-Carmenate, Y.; Dominguez, E. R. Desirability-Based Methods of Multiobjective Optimization and Ranking for Global QSAR Studies. Filtering Safe and Potent Drug Candidates from Combinatorial Libraries. J. Comb. Chem. 2008, 10, 897–913. ¨ urk, B.; Weber, G.-W.; K¨oksal, G. Desirability Functions in Multiresponse (9) Akteke-Ozt¨ Optimization. In Optimization in the Natural Sciences: 30th Euro Mini-Conference, EmC-ONS 2014, Aveiro, Portugal, February 5-9, 2014. Revised Selected Papers; Plakhov, A., Tchemisova, T., Freitas, A., Eds.; Communications in Computer and Information Science; Springer, 2015; pp 129–146. (10) Costa, N. R.; Louren¸co, J.; Pereira, Z. L. Desirability Function Approach: A Review and Performance Evaluation in Adverse Conditions. Chemom. Intell. Lab. Syst. 2011, 107, 234–244. (11) Wager, T. T.; Hou, X.; Verhoest, P. R.; Villalobos, A. Central Nervous System Multiparameter Optimization Desirability: Application in Drug Discovery. ACS Chem. Neurosci. 2016, 7, 767–775. (12) Wager, T. T.; Hou, X.; Verhoest, P. R.; Villalobos, A. Moving Beyond Rules: The Development of a Central Nervous System Multiparameter Optimization (CNS MPO) Approach To Enable Alignment of Druglike Properties. ACS Chem. Neurosci. 2010, 1, 435–449. (13) Bickerton, G.; Paolini, G.; Besnard, J.; Muresan, S.; Hopkins, A. Quantifying the Chemical Beauty of Drugs. Nat. Chem. 2012, 4, 90–98. 26

ACS Paragon Plus Environment

Page 26 of 33

Page 27 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(14) Marasini, N.; Yan, Y. D.; Poudel, B. K.; Choi, H.-G.; Yong, C. S.; Kim, J. O. Development and Optimization of Self-Nanoemulsifying Drug Delivery System with Enhanced Bioavailability by BoxBehnken Design and Desirability Function. J. Pharm. Sci. 2012, 101, 4584–4596. (15) StarDrop Reference Guide Version 6.0. Optibrium Ltd., 2014; Manual. (16) Quantitative

Estimate

of

Drug-likeness

in

StarDrop.

Optibrium

Ltd.

https://www.optibrium.com/community/downloads/scoring-profiles/ 156-quantitative-estimate-of-drug-likeness-in-stardrop/

(accessed

De-

cember 1, 2017). (17) Tsareva, D. A.; Osolodkin, D. I.; Shulga, D. A.; Oliferenko, A. A.; Pisarev, S. A.; Palyulin, V. A.; Zefirov, N. S. General Purpose Electronegativity Relaxation Charge Models Applied to CoMFA and CoMSIA Study of GSK-3 Inhibitors. Mol. Inf. 2011, 30, 169–180. (18) Osolodkin, D. I.; Shulga, D. A.; Tsareva, D. A.; Oliferenko, A. A.; Palyulin, V. A.; Zefirov, N. S. The Choice of Atomic Charges Calculation Scheme in 3D-QSAR Modelling of GSK-3ß Inhibition by Paullones. Dokl. Biochem. Biophys. 2010, 434, 274–278. (19) Shulga, D. A.; Osolodkin, D. I.; Palyulin, V. A.; Zefirov, N. S. Simulation of Intramolecular Hydrogen Bond Dynamics in Manzamine A as a Sensitive Test for Charge Distribution Quality. Nat. Prod. Commun. 2012, 7, 295–299. (20) ZINC. http://zinc.docking.org/ (accessed June 15, 2016). (21) Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC: A Free Tool to Discover Chemistry for Biology. J. Chem. Inf. Model. 2012, 52, 1757–1768. (22) ZINClick. http://www.symech.it/ZINClick (accessed June 15, 2016).

27

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(23) Massarotti, A.; Brunco, A.; Sorba, G.; Tron, G. C. ZINClick: A Database of 16 Million Novel, Patentable, and Readily Synthesizable 1,4-Disubstituted Triazoles. J. Chem. Inf. Model. 2014, 54, 396–406. (24) Commercial Compound Collection. http://www.cococo-database.it (accessed June 15, 2016). (25) Del Rio, A.; Barbosa, A. J. M.; Caporuscio, F.; Mangiatordi, G. F. CoCoCo: a Free Suite of Multiconformational Chemical Databases for High-Throughput Virtual Screening Purposes. Mol. BioSyst. 2010, 6, 2122–2128. (26) GDB Databases. http://gdb.unibe.ch/downloads/ (accessed March 1, 2018). (27) Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J.-L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB–17. J. Chem. Inf. Model 2012, 52, 2864–2875. (28) Osolodkin, D. I.; Palyulin, V. A.; Zefirov, N. S. Structure-Based Virtual Screening of Glycogen Synthase Kinase 3ß Inhibitors: Analysis of Scoring Functions Applied to Large True Actives and Decoy Sets. Chem. Biol. Drug Des. 2011, 78, 378–390. (29) Karpov, P. V.; Osolodkin, D. I.; Baskin, I. I.; Palyulin, V. A.; Zefirov, N. S. OneClass Classification as a Novel Method of Ligand-based Virtual Screening: The Case of Glycogen Synthase Kinase 3ß Inhibitors. Bioorg. Med. Chem. Lett. 2011, 21, 6728– 6731. (30) Osolodkin, D. I.; Palyulin, V. A.; Zefirov, N. S. Glycogen Synthase Kinase 3 as an Anticancer Drug Target: Novel Experimental Findings and Trends in the Design of Inhibitors. Curr. Pharm. Des. 2013, 19, 665–679. (31) ChemAxon Standardizer, JChem 15.6.15.0. ChemAxon Ltd. 2015; http://www. chemaxon.com. 28

ACS Paragon Plus Environment

Page 28 of 33

Page 29 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

(32) StarDrop, version 6.0.1; Optibrium Ltd.: Cambridge, UK. (33) Tetko, I. V.; Tanchuk, V. Y.; Kasheva, T. N.; Villa, A. E. P. Estimation of Aqueous Solubility of Chemical Compounds Using E-State Indices. J. Chem. Inf. Comput. Sci. 2001, 41, 1488–1493. (34) VCCLAB, Virtual Computational Chemistry Laboratory. 2005; http://www.vcclab. org (accessed June 15, 2016). (35) Huuskonen, J. J.; Villa, A. E.; Tetko, I. V. Prediction of Partition Coefficient Based on Atom–Type Electrotopological State Indices. J. Comput. Aided Mol. Des. 1999, 88, 229–233. (36) Huuskonen, J. J.; Livingstone, D. J.; Tetko, I. V. Neural Network Modeling for Estimation of Partition Coefficient Based on Atom–Type Electrotopological State Indices. J. Chem. Inf. Model. 2000, 40, 947–955. (37) Sushko, I.; Novotarskyi, S.; Krner, R.; Pandey, A. K.; Rupp, M.; Teetz, W.; Brandmaier, S.; Abdelaziz, A.; Prokopenko, V. V.; Tanchuk,

V. Y.; Todeschini, R.;

Varnek, A.; Marcou, G.; Ertl, P.; Potemkin, V.; Grishina, M.; Gasteiger, J.; Schwab, C.; Baskin, I. I.; Palyulin, V. A.; Radchenko, E. V.; Welsh, W. J.; Kholodovych, V.; Chekmarev, D.; Cherkasov, A.; Aires-de-Sousa, J.; Zhang, Q. Y.; Bender, A.; Nigsch, F.; Patiny, L.; Williams, A.; Tkachenko, V.; Tetko, I. V. Online Chemical Modeling Environment (OCHEM): Web Platform for Data Storage, Model Development and Publishing of Chemical Information. J. Comput. Aided Mol. Des. 2011, 25, 533–554. (38) Additional Physicochemical Models for StarDrop. Optibrium Ltd. https://www. optibrium.com/community/downloads/models/122-stardrop-physchem-models (accessed December 1, 2017). (39) Tetko, I. V.; Tanchuk, V. Y. Application of Associative Neural Networks for Prediction

29

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

of Lipophilicity in ALOGPS 2.1 Program. J. Chem. Inf. Comput. Sci. 2002, 42, 1136– 1145. (40) Tetko, I. V.; Tanchuk, V. Y.; Villa, A. E. P. Prediction of n-Octanol/Water Partition Coefficients from PHYSPROP Database Using Artificial Neural Networks and E-State Indices. J. Chem. Inf. Comput. Sci. 2001, 41, 1407–1421. (41) DruLiTo,

Drug

Likeness

Tool.

http://www.niper.gov.in/pi_dev_tools/

DruLiToWeb/DruLiTo_index.html (accessed June 15, 2016). (42) Neron, B.; Menager, H.; Maufrais, C.; Joly, N.; Maupetit, J.; Letort, S.; Carrere, S.; Tuffery, P.; Letondal, C. Mobyle: a New Full Web Bioinformatics Framework. Bioinformatics 2009, 25, 3005–3011. (43) Hoppe, C. Improving the CDK Implementation of the XlogP Descriptor. CDK News 2006, 3, 10–11. (44) Ghose, A. K.; Crippen, G. M. Atomic Physicochemical Parameters for ThreeDimensional Structure-Directed Quantitative Structure-Activity Relationships I. Partition Coefficients as a Measure of Hydrophobicity. J. Comput. Chem. 1986, 7, 565–577. (45) ChemAxon Instant JChem 15.9.14.0. ChemAxon Ltd. 2015; http://www.chemaxon. com. (46) Viswanadhan, V. N.; Ghose, A. K.; Revankar, G. R.; Robins, R. K. Atomic Physicochemical Parameters for Three Dimensional Structure Directed Quantitative StructureActivity Relationships. 4. Additional Parameters for Hydrophobic and Dispersive Interactions and Their Application for an Automated Superposition of Certain Naturally Occurring Nucleoside Antibiotics. J. Chem. Inf. Comput. Sci. 1989, 29, 163–172. (47) Klopman, G.; Li, J.-Y.; Wang, S.; Dimayuga, M. Computer Automated Log P Calcu-

30

ACS Paragon Plus Environment

Page 30 of 33

Page 31 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

lations Based on an Extended Group Contribution Approach. J. Chem. Inf. Comput. Sci. 1994, 34, 752–781. (48) Wang, R.; Fu, Y.; Lai, L. A New Atom-Additive Method for Calculating Partition Coefficients. J. Chem. Inf. Comput. Sci. 1997, 37, 615–621. (49) Cheng, T.; Zhao, Y.; Li, X.; Lin, F.; Xu, Y.; Zhang, X.; Li, Y.; Wang, R.; Lai, L. Computation of Octanol-Water Partition Coefficients by Guiding an Additive Model with Knowledge. J. Chem. Inf. Model. 2007, 47, 2140–2148. (50) Radchenko, E. V.; Rulev, Y. A.; Safanyaev, A. Y.; Palyulin, V. A.; Zefirov, N. S. Computer-Aided Estimation of the hERG-Mediated Cardiotoxicity Risk of Potential Drug Components. Dokl. Biochem. Biophys. 2017, 473, 128–131. (51) StarDrop Legacy Reference Guide Version 6.1. Optibrium Ltd., 2015; Manual. (52) Dyabina, A. S.; Radchenko, E. V.; Palyulin, V. A.; Zefirov, N. S. Prediction of BloodBbrain Barrier Permeability of Organic Compounds. Dokl. Biochem. Biophys. 2016, 470, 371–374. (53) Ertl, P.; Rohde, B.; Selzer, P. Fast Calculation of Molecular Polar Surface Area as a Sum of Fragment-Based Contributions and Its Application to the Prediction of Drug Transport Properties. J. Med. Chem. 2000, 43, 3714–3717. (54) Woodbury, M. A. Rank Correlation when There are Equal Variates. Ann. Math. Stat. 1940, 11, 359–362. (55) Amerise, I. L.; Tarsitano, A. Correction Methods for Ties in Rank Correlations. J. Appl. Stat. 2015, 42, 2584–2596. (56) McGrew, J. C.; Monroe, C. B. An Introduction to Statistical Problem Solving in Geography, 2nd ed.; Waveland Press, 2009.

31

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(57) Fieldsend, J. E.; Everson, R. M. Multi-Objective Optimisation in the Presence of Uncertainty. In 2005 IEEE Congress on Evolutionary Computation, Edinburgh, UK, September 2-5, 2005; pp 243–250. (58) Mannhold, R.; Poda, G.; Ostermann, C.; Tetko, I. Calculation of Molecular Lipophilicity: State-of-the-Art and Comparison of Log P Methods on More Than 96,000 Compounds. J. Pharm. Sci. 2009, 98, 861–893. (59) Pyka, A.; Babu´ska, M.; Zachariasz, M. A Comparison of Theoretical Methods of Calculation of Partition Coefficients for Selected Drugs. Acta. Pol. Pharm. 2006, 63, 159–167.

32

ACS Paragon Plus Environment

Page 32 of 33

Page 33 of 33 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Graphical TOC Entry

33

ACS Paragon Plus Environment