In Silico Prediction of Volume of Distribution in ... - ACS Publications

Sep 7, 2016 - ABSTRACT: We present three in silico volume of distribution at steady state. (VDss) models generated on a training set comprising 1096 c...
1 downloads 10 Views 595KB Size
Article pubs.acs.org/jcim

In Silico Prediction of Volume of Distribution in Humans. Extensive Data Set and the Exploration of Linear and Nonlinear Methods Coupled with Molecular Interaction Fields Descriptors Franco Lombardo* and Yankang Jing Modelling, Computation and Molecular Properties Group, Biogen, 225 Binney Street, Cambridge, Massachusetts 02142, United States S Supporting Information *

ABSTRACT: We present three in silico volume of distribution at steady state (VDss) models generated on a training set comprising 1096 compounds, which goes well beyond the conventional drug space delineated by the Rule of 5 or similar approaches. We have performed a careful selection of descriptors and kept a homogeneous Molecular Interaction Field-based descriptor set and linear (Partial Least Squares, PLS) and nonlinear (Random Forest, RF) models. We have tested the models, which we deem orthogonal in nature due to different descriptors and statistical approaches, with good results. In particular we tested the RF model, via a leave-class-out approach and by using a set of 34 additional compounds not used for training. We report comparable results against in vivo scaling approaches with geometric mean-fold error at or below 2 (for a set of 60 compounds with animal data available) and discuss the predictive performance based on the ionization states of the compounds. Lastly, we report the findings using a two-tier approach (classification followed by regression) based on VDss ranges, in an attempt to improve the prediction of compounds with very high VDss. We would recommend, overall, the RF model, with 33 descriptors, as the primary choice for VDss prediction in humans.



other words “high” does not mean “good” or “problem solved”, and “low” does not mean “poor” or “problematic efficacy”. Furthermore, VDss is largely determined by physicochemical properties, as repeatedly shown by in vitro2,3,9and in silico models,4−7 as well as analysis of trends reported by Obach et al.10 and based on computed parameters for a large and diverse set of compounds. Binding dictates a large proportion of the VDss outcome, although transporters may be invoked, and an analysis by Waters and Lombardo,11 utilizing the Øie-Tozer equation behavior, pointed out this aspect. In several cases the application of this approach, empirical but tested by many VDss prediction applications,12 yielded an aberrant fraction unbound in tissues (1 < f ut < 0). These findings illustrated that some influence, especially on anti-infectives, may be exerted by transporters even though Smith et al.1 tend to consider it minimal. In general it may be agreed upon the fact that physicochemical properties, translating into binding and passive diffusion, largely determine the outcome of a compound VDss value. The VDss value thus cannot be generally categorized as “good” or “bad”, while it could be categorized as low, moderate, and high in comparison, for example, to total body water estimated to be, in mammals, about 0.6−0.7 L/kg. The general lack of physical reality of VD is clearly exemplified, and very

INTRODUCTION Volume of distribution, in the form of volume of distribution at steady state (VDss), is essentially a proportionality constant that connects the amount (dose) of a drug in the body to its concentration found in plasma and tissues, generally with plasma as the main compartment examined. VDss, together with clearance, determines the frequency of dosing since they both determine the half-life of a compound in plasma (t1/2), when the terminal phase VD (VDβ) is used, or the mean residence time in the body (MRT) if VDss is used. Several misconceptions, however, surround its use, and Smith et al.1 have very recently published a thorough perspective article on the subject. We summarize some of the points raised by the authors as an introduction to the application and physicochemical determinants of VDss, while other references on the subject of application, physicochemical determinants, and prediction of VDss2−8 may also be consulted. The first point Smith et al.1 illustrated is the aforementioned role of VDss in determining, with clearance, the frequency of dosing, and a compound with a lower VDss may have to be dosed more frequently (at a lower individual dose) than a compound with a larger VDss. We would like to add that it is important not to consider a certain threshold value, typically expressed in liters per kg of weight, and label a compound “good” or “bad” depending on such threshold. There is no guarantee that a compound will reach a site of action and be pharmacologically active or not, on the basis of its VDss. In © 2016 American Chemical Society

Received: January 29, 2016 Published: September 7, 2016 2042

DOI: 10.1021/acs.jcim.6b00044 J. Chem. Inf. Model. 2016, 56, 2042−2052

Article

Journal of Chemical Information and Modeling Table 1. List and Type of the 40 Descriptors Used in This Worka descriptor

type

model

[S;s] total cationic charge total anionic charge species abundance CW7-8b WN6b WO1-4b W7-8b %FU4-10c DIFF PSA logD 8.5 logD 9.5 logD 10.5 logD 11 logD 11.5 logD 12 logD 12.5 logD 8

SMARTS, aliphatic and aromatic S total positive charge (+) at pH 7.4

RF_33 RF_33, PLS_32

total of negative charge (−) at pH 7.4 abundance of charged species at pH 7.4 hydrophilic volume per surface unit H-acceptor regions-amide NH probe H-bond donor regions-carbonyl O probe polar and strong HBD and HBA regions- H2O probe percentage of neutral species at each pH diffusivity (modified Stokes−Einstein equation) polar surface area calculated logD at pH 8.5 calculated logD at pH 9.5 calculated logD at pH 10.5 calculated logD at pH 11 calculated logD at pH 11.5 calculated logD at pH 12 calculated logD at pH 12.5 calculated logD at pH 8

logD 9

calculated logD at pH 9

logD 10

calculated logD at pH 10

WN5b

H-acceptor regions-amide NH probe

WN1b logD 5 logD 6 logD 7.5 DRDRAC DRDRDO FLEX

H-acceptor regions-amide NH probe calculated logD at pH 5 calculated logD at pH 6 calculated logD at pH 7.5 3D pharmacophoric descriptors: dry, dry, acceptor 3D pharmacophoric descriptors: dry, dry, donor molecular flexibility: distance between atom ’i’ in one chosen conformer of 50 and atom ’i’ in all other 49 random conformers

RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32 RF_33, PLS_32, PLS_11 RF_33, PLS_32, PLS_11 RF_33, PLS_32, PLS_11 RF_33, PLS_32, PLS_11 PLS_11 PLS_11 PLS_11 PLS_11 PLS_11 PLS_11 PLS_11

a A more detailed explanation can be found at http://www.moldiscovery.com/software/vsplus/. bIncreasing number after descriptor refers to higher interactive energy. cCalculated at each unit in the interval reported.

This set of data covers a physicochemical, pharmacological, structural, and naturally VDss space which is unprecedented, to the best of our knowledge. These data, available as Supporting Information with the original references and annotations, could also be used to attempt to further the accuracy of VDss prediction methods and to study its physicochemical underpinnings in greater detail. In doing so we also kept an eye on the simplicity, and we considered the physical meaning and interpretation of descriptors. The latter attribute, perhaps at odds with other authors in the field who may term certain in silico models as “black boxes”, is not an absolute or even a main requirement when an accurate and rapid prediction is achievable. We offer the argument that even apparently “simple” multiple linear regressions may contain descriptors that are uninterpretable in “synthetic terms” by medicinal chemists and that chemical translatability may be inferred by the results utilizing diverse structures even when an explicit interpretation “descriptor to structure” is not possible. We do consider, as an added value, the “portability” and simplicity of descriptors calculation, and we prefer, but not at the cost of lowering the performance significantly, to maintain a general uniformity on the software package used. We similarly consider as valuable its applicability within fairly large

directly, by its extremely large range, which grossly exceeds, in many cases, any physically realistic volume. The VDss data in this work range from a very low value 0.04 L/kg (indocyanine green), to an extremely high one 700 L/kg (hydroxychloroquine). That is nearly 5 orders of magnitude. Nevertheless, all the compounds used in this work have reached, at a minimum, Phase I clinical stage, with a great many number of them going forward and into clinical practice. This observation attests to the fact that they were not considered good or bad, solely on the basis of VDss which was certainly examined preclinically. Smith et al.,1 on the subject of prediction, modulation, and application, stated that structural modifications can be applied to increase or decrease VDss and the duration of action while maintaining clearance. They cautioned the reader, however, that modulation of VDss, largely through charge and lipophilicity changes, is not straightforward. Thus, the availability of fairly accurate in silico methods, which are able to (i) guide medicinal chemistry efforts by providing a very rapid answer and (ii) operate in virtual mode or, in general, (iii) without recourse to any experimental data and prior to synthesis, as well as a filter for library design, is a welcome addition to the armory of in silico drug design tools, and the exploration of its limits of applicability and accuracy is a worthwhile effort. 2043

DOI: 10.1021/acs.jcim.6b00044 J. Chem. Inf. Model. 2016, 56, 2042−2052

Article

Journal of Chemical Information and Modeling computation and IT networks, such as the ones generally present in industry.

begin our discussion with Table 2, where we report the results of our LCO testing.

RESULTS AND DISCUSSION We begin by reporting our initial results, based upon a single model approach vs the approach discussed later of using a classification scheme, before quantitative prediction, in a twotier modeling effort. All methods were applied across all compounds, which were obtained utilizing a very large and carefully assembled data set comprising 1096 compounds as the overall training set, on the basis of stringent data quality criteria and through the use of primary references. They are basic requirements for any modeling effort, and they have been at the very core of our long-standing interest in modeling human PK data.2−5,8,10−14 Single Model Approach. The selection of descriptors, detailed in the Experimental Section, led to two sets which we would define “orthogonal”, since there are only 4 continuous descriptors common to the two sets of 33 and 11 descriptors respectively, which were examined across models. Furthermore, on the basis of elimination and testing process we identified three models, characterized by different statistical methods: (i) a method based on PLS statistics but with only the set of 11 descriptors (model “PLS_11”) and 6 latent variables, (ii) a Partial Least Squares (PLS) model utilizing 6 latent variables and the same 32 continuous descriptors as in the RF model (model “PLS_32”), and (iii) a Random Forest (RF) model utilizing 32 continuous molecular interaction fields descriptors plus one binary descriptor SMARTS [S;s] (model “RF_33”), as reported in Table 1. We reasoned that especially the first and the third models could be considered “orthogonal” to each other because, while sharing a common training set, they only share a few descriptors (3 logD values in the pH range of 8−10 plus a descriptor related to the compound H-bond acceptor character). They also utilize different statistical methods, PLS being a linear method, and RF being a nonlinear method. In addition, we have identified a reasonably small set of descriptors, considering the vast range of structural diversity, well beyond the confines of the Rule of 5, spanning over 1,000 compounds and, as mentioned, nearly 5 orders of magnitude in VDss values. In these efforts we utilized the entire range of compounds, regardless of their actual VDss value. Validation Using Leave-Class-Out (LCO). We have adopted, as in past work,3,5 the leave-class-out (LCO) method which ensures that molecules in the test set are not analogs of compounds in the training set. This method seeks to validate a model by removing all analogs in a class, building a new model, and then predicting the (close) analogs in that class. This approach may be “pessimistic” toward the estimate of prediction accuracy, as suggested by Sheridan,15 since it will have no structural analogs of a given compound in the training set used to predict the new one. We think, however, that both a random or rational selection (the latter based on clusters),16 which will contain analogs even in a cross-validation scheme, may both offer an optimistic view of the estimate of prediction, as pointed out by Sheridan.15 This is part of the reason why we prefer to test our models with a LCO approach, rather than via a random or rational selection. The additional aim is to try to reproduce the challenge of a model eventually called upon to predict relatively close analogs, without the benefit of human data on those analogs in the training set. Armed with these considerations we

Table 2. Leave-Class-Out Statistics for the Three VDss Models



LCO classa α-adrenergics (17) benzodiazepines (24) β-adrenergics (30) β-lactams antibiotics (cephalosporins) (40) β-lactams antibiotics (penams and penems) (37) calcium channel blockers (10) fluoroquinolones (20) morphinans (14) NSAIDs (17) nucleotides/nucleosides (41) steroids (34) tricyclic and tetracyclic antidepressants (14) overall GMFE (298) % < 2-fold % < 3-fold

consensus PLS_11 and RF_33

PLS_11

PLS_32

RF_33

2.51 2.03 1.76 2.08

2 2.01 1.78 1.94

1.84 1.88 1.7 1.64

2.04 1.91 1.71 1.77

1.73

1.74

1.42

1.44

2.44

2.54

2.16

2.27

1.67 2.01 2.65 2.04

1.92 2.1 2.49 2.25

3.08 2.27 1.75 2.35

2.21 2.07 2.12 2.16

2.27 3.22

2.16 3.69

2.38 2.24

2.3 2.68

2.08 55% 79%

2.08 57% 89%

1.96 62% 80%

1.96 61% 83%

a

The number of compounds in each class is given in boldface in parentheses.

Sheridan suggested a “time-split” as a useful type of crossvalidation, using test data which would have been generated by a screening group, after the model was built,17 even though it will contain analogs, by definition, in an industrial setting. The particular nature of human data, however, is such that it can only be assembled through careful scrutiny of literature reports to have a large and diverse test set. The result is that it does not allow the adoption of a “time-split” approach while either a LCO or a random or rational selection may be used. It is not possible to assemble a large and diverse proprietary data set with i.v. human data, under any circumstances, form a single source. A recent initiative stemming from an industrial consortium, and encompassing several major pharmaceutical companies, could only report on a relatively small set of 18 compounds with human i.v. data,18where the aim of that work was an assessment of human PK scaling methods from animal and in vitro data. Clearly, no proprietary structure could be disclosed, making any computational work impossible. All three models performed fairly well on a relatively large (total) test set of 298 compounds counting all analogs across all classes, with a 10% higher accuracy (GMFE 1.96 vs 2.08) for the RF_33 vs either PLS_32 or PLS_11. In all these cases the prediction was performed by a model which did not have any of the analogs in each class in the respective training set. However, the RF_33 model had a below average performance when predicting fluoroquinolones (GMFE > 3), while PLS models performed much better on the fluoroquinolones (GMFE of 1.67 and 1.92, respectively). The two PLS models yielded a GMFE higher than 3 when predicting tricyclic and tetracyclic antidepressant (3.22 for PLS_11 and 3.69 for PLS_32 respectively), but the RF_33 model performed significantly better on tricyclic and tetracyclic antidepressants, with a GMFE 2044

DOI: 10.1021/acs.jcim.6b00044 J. Chem. Inf. Model. 2016, 56, 2042−2052

Article

Journal of Chemical Information and Modeling Table 3. Analysis of Performance According to Charge Type for the Random Forest Method (OOB_RF_33)

GMFE % < 2-fold % < 3-fold max FEa compdb a

all

cationic

anionic

neutral

zwitterionic

1096 2.1 61 76 95 kahalalide F

407 2.2 58 74 95 kahalalide F

227 1.9 66 82 42 maxipost

354 2.3 62 76 51 YM155

108 1.9 64 80 16 tigecycline

Maximum fold error. bCompound yielding the maximum fold error.

of 2.24. We note that this behavior is reminiscent of the performance of the PLS model using 11 descriptors in the work of Berellini et al.,5 (GMFE 3.0, N = 8) for the antidepressant class. At the same time the RF algorithm was also adopted in that paper to build models. In that case the RF model was built using with a very large number of descriptors, totaling 280 from different sources, and yielding a GMFE of 3.7 (N = 12) for fluoroquinolones, while the PLS model (11 descriptors) reported in that work yielded a GMFE of 1.9 for the same class of compounds. The antidepressants were predicted better by the RF model and worse by the PLS model (11 descriptors) reported by those authors. Thus, the difference in the number of compounds in the respective classes between the present and previous report seems immaterial. Furthermore, a similar behavior was noted in the work reported by Lombardo et al.,4 where a RF model did show a significantly better prediction for antidepressants (GMFE = 1.85) vs fluoroquinolones (GMFE = 2.94), although it was built on a much smaller data set and with classes having significantly fewer analogs. It could be speculated that the involvement of potential transporters may have a differential impact in different classes of compounds,19 and those effects, assuming they were actually present, would likely not be modeled well. However, this is a possibility which would be hard to demonstrate, and it may be equally operating for other compounds, even though well predicted by the models, given the structural promiscuity of transporter recognition. Waters and Lombardo11 did show, through the use of the Øie-Tozer equation, that aberrant results may fairly frequently be seen in the estimation of fraction unbound in tissues, f ut, and this finding may well be ascribed to the involvement of transporters. They pointed out the possibility of such effects for many anti-infective drugs, but those compounds, at least in the cited work, were shown to largely belong to beta-lactam antibiotics. The latter classes, however, represented by cephalosporin or penams and penems antibiotics, are predicted well across all methods in the present work as well as in the two previous papers cited. Thus, transporter involvement may not necessarily be a reason for the different behavior or be only sporadically present. We also recognize that we have utilized a single conformation, in gas phase, and flexibility, as a descriptor, was only utilized in the PLS_11 model. For very complex molecules, with multiple functional and polar or nonpolar groups, the possibility of conformational changes may be another “lurking” variable. This may have to be explored with an ensemble of conformations and perhaps a distribution function and with or without solvation contributions. In the case of fluoroquinolone antibiotics, we note that the distribution of error is higher for RF_33 than for PLS_11 or PLS_32, although sparfloxacin and gemifloxacin, the analogs with the largest error, are predicted quite poorly by all three models. Upon elimination of the three compounds with the

largest error (lomefloxacin as the third compound) the RF_33 model would still yield a GMFE of about 2.74. In other words, they are not contributing an exceptionally large error. Furthermore, a test based on out-of-bag (OOB) prediction for charge classes yielded a fairly good prediction, and the results for fluoroquinolones could be ascribed to a structural class effect. Zwitterionic compounds are overall predicted quite well, as shown in Table 3, but sparfloxacin and gemifloxacin were predicted with a fold error > 3. This is observed despite the presence of many other analogs which would likely be present in at least a fraction of the trees which had not been trained on those two compounds. The analysis of charge classes is generally useful because it offers an assessment of the performance of the model and of its applicability toward the (charge) class of a compound, which could reliably be estimated or determined via in silico or in vitro physicochemical methods. We observed that each class has a very large outlier, with kahalalide F among all compounds and in the neutral class, yielding a fold error of 95. All in all the prediction is showing a GMFE of 2.1 across classes (all compounds from OOB statistics), which represents a good performance without serious deviations from class to class. Lastly, on the topic of structural features of fluoroquinolones, it has been suggested, on the basis of molecular dynamics simulations, that these molecules may cross the membranes stacked in an antiparallel arrangement, thus reducing charges and polarity.20 In fact, in the model PLS_11 there are no explicit charge terms in the descriptor space, and thus those zwitterionic molecules would not have a positive and negative charge term, which may be consistent with the suggested stacked arrangement. We have also, in the course of this work, attempted to use specific fragments such as pyridines or atoms such as F as additional (binary) descriptors (data not shown), but no improvement was noted. Regardless of a possible explanation based on the in vivo behavior of sparfloxacin and gemifloxacin, we have treated the compounds as charged in the RF_33 or PLS_32 models and did not remove them from any statistics as possible outliers. The tricyclic and tetracyclic antidepressant class, on the contrary, is predicted quite well by RF_33 but above 3-fold error for either PLS_11 or PLS_32. Levoprotiline and maprotiline are predicted quite poorly (fold error >6) by all methods, but they show as very significant outliers in the statistics for model RF_33, as the next highest fold error (FE) belongs to nortriptyline with a fold error of 3.03. This compound is also the next worst outlier for PLS_11 and PLS_32 but with a fold error >6. The same behavior was reported by Berellini et al.5 with the RF model predicting this class better than the corresponding PLS model based on 11 descriptors and using 6 latent variables. The other PLS model in the cited paper, with 95 descriptors from various sources and 5 latent variables, did predict this class well with a GMFE of 1.7. 2045

DOI: 10.1021/acs.jcim.6b00044 J. Chem. Inf. Model. 2016, 56, 2042−2052

Article

Journal of Chemical Information and Modeling Table 4. Comparative Analysis of Methods Performance Using an External Test Seta predicted VDss (L/kg) compound

CAS no.

VDss_h (L/kg)

RF_33

PLS_11 (this work)

PLS_11 (2009)b

PLS_32

consensus RF_33 and PLS_11

16-acetyl gitoxin 3-O-demethylfortimicin A ACHT 4-10 aspoxicillin astromicin cannabinol carocainide cefuzonam CG200745c CGP15720A CGP31608 cimetropium CPW 86-363 dihydroergosine dixyrazine dobesilic acidc edoxaban epothilone folate glibornuride glyceryl 1-nitrate GPX-150c guanylcisteine heptaminol iopentol iotrolan metaproterenol micronomicin pinacidil pipemidic acid proquazone tedizolid temocillin vestipitant viqualine GMFE % < 2-fold % < 3-fold max FEd

7242-07-1 74842-47-0 4037-01-8 63358-49-6 55779-06-1 521-35-7 66203-00-7 82219-78-1 936221-33-9 73998-69-3 107740-67-0 150521-16-7 84080-55-7 7288-61-1 2470-73-7 88-46-0 480449-70-5 958646-17-8 26944-48-9 624-43-1 236095-29-7 40454-21-5 372-66-7 89797-00-2 79770-24-4 586-06-1 52093-21-7 60560-33-0 51940-44-4 22760-18-5 856866-72-3 66148-78-5 334476-46-9 72714-74-0

0.78 0.27 0.73 0.27 0.23 16.98 0.80 0.20 0.45 0.35 0.23 1.12 0.22 4.02 4.06 0.12 1.53 0.09 0.27 0.76 15.45 0.25 2.14 0.20 0.23 6.03 0.15 1.10 0.88 0.52 0.91 0.11 3.80 24.40

1.27 0.38 0.26 0.25 0.39 5.84 1.76 0.25 2.80 0.77 0.44 1.12 0.17 1.63 3.29 0.48 1.72 0.19 0.58 0.75 2.96 0.71 3.68 0.42 0.34 1.46 0.43 1.05 1.03 3.10 1.08 0.21 1.45 10.94 2.0 50 85 6

1.14 1.15 0.42 0.22 1.44 2.49 2.29 0.67 1.63 0.68 0.51 0.57 0.26 3.47 2.94 0.56 1.50 0.05 0.37 0.95 1.52 0.60 1.55 0.57 0.17 2.25 1.84 1.25 1.38 1.31 1.44 0.26 3.38 4.04 2.3 53 73 12

1.28 0.75 0.20 0.22 0.94 2.71 2.37 0.58 2.56 0.72 0.47 0.58 0.10 2.83 3.20 0.78 1.37 0.02 0.44 1.19 1.32 0.62 1.38 0.53 0.20 2.19 1.32 1.22 0.81 0.97 1.40 0.08 3.45 3.68 2.4 47 73 12

0.74 3.07 1.11 0.28 3.31 2.03 2.53 0.35 2.94 1.11 0.73 0.45 0.15 3.16 3.32 0.43 1.51 0.04 0.34 0.75 1.68 1.83 1.44 0.46 0.07 2.74 2.63 1.36 1.03 1.43 1.48 0.18 3.00 3.41 2.6 47 62 18

1.20 0.32 0.71 0.33 0.30 3.64 2.13 1.27 1.74 1.20 0.56 0.82 0.37 2.55 3.11 0.52 1.61 0.12 0.47 0.85 2.24 0.66 2.61 0.49 0.26 1.86 1.14 1.15 1.21 2.20 1.26 0.23 2.41 7.49 2.5 56 71 8

a The set comprises compounds found in the literature after the work was started. None of the compounds was utilized in any training model. bThese predicted values were calculated by the method reported by Berellini et al.5 cThese compounds yielded a fold error >3 in all models. dMaximum fold error.

We have also examined the application of a consensus model by averaging the results of the RF_33 and PLS_11, as shown in Table 2. The resulting performance was generally superior for the RF_33 model. An advantage of the consensus model could be represented by the “orthogonality” of the two methods (different descriptors and algorithm), where a close agreement may offer a higher degree of confidence to the users as to the accuracy of prediction. We asked ourselves the question of whether an agreement between predicted values from each model within a 2-fold error, unbiased by any prior knowledge based on in vivo or in vitro data, could be interpreted as a more accurate prediction than in the absence of such agreement. That is, if a convergence between seemingly orthogonal models may offer an advantage. Unfortunately, that was not the case, and we could not establish a good correlation, linking a < 2-fold error between the two methods with an overall better prediction of the human value by the consensus model. We

conclude that the convergence of the two predictive methods, at least within a factor of 2 from each other, does not necessarily translate in higher confidence for the prediction. We also note that the RF_33 offers a significantly better performance than either PLS model with respect to nonsteroidal anti-inflammatory drugs (NSAIDs) with a GMFE of 1.75 against 2.65 and 2.49 for PLS_11 and PLS_32, respectively. This was also the case in the work of Berellini et al.5 where, with a number of compounds almost identical to our external test set, the RF model was reported performing better than either PLS models reported, even though they showed a smaller gap with the RF model than in this case. We wish to emphasize, in closing the discussion of the LCO approach, that the ensemble of the LCO models represents, in fact, an external test set on nearly 30% of the compounds in the training set. This is because each of the analog classes with N compounds was in turn removed, the model recast with 1096-N 2046

DOI: 10.1021/acs.jcim.6b00044 J. Chem. Inf. Model. 2016, 56, 2042−2052

Article

Journal of Chemical Information and Modeling

or GPX150 within a 3-fold error. The first two have a fairly low VDss of 0.45 and 0.12 L/kg, respectively, and thus they may be difficult to predict accurately. However, the model RF_33 returns an excellent prediction for aspoxicillin and cefuzonam, for example, with VDss values of 0.27 and 0.2 L/kg, respectively. All three models developed in the present work plus PLS_11, which was developed earlier, yielded a very good prediction for compound CPW 86-363 (experimental VDss 0.22 L/kg). The underestimation of the compounds having the three largest VDss values in this external set, by all methods, is puzzling. The RF_33 model does return the largest predicted value in each case, and within a factor of 3 in two out of three cases, but it does not yield what we would consider to be an accurate prediction, i.e. a fold error ≤2. It is, nevertheless, the most accurate of the 4 models tested overall, and it yielded a significantly lower maximum fold error value than all the others, coupled with the lowest GMFE. This is in apparent contrast with the good performance of the tricyclic and tetracyclic antidepressant class in the LCO, since these basic and lipophilic compounds generally have large VDss values. However, a close inspection of the class reveals that maprotiline (VDss 45 L/kg) and nortryptiline (VDss 22 L/kg) are predicted with a 6.4- and 3.0-fold error, respectively. Thus, even the better method may underestimate very large values of VDss. Comparison to Animal Scaling Model. We note that several methods based on in vivo animal data12 perform better than in silico models in charge class tests, as it may be expected, but the higher performing methods (with GMFE values of 1.3− 1.5) seem generally to require data in nonhuman primates or otherwise multiple species. Obviously, against a possibly higher performance, there is also a large difference in the amount of resources needed, aside from ethical considerations, between in silico and in vivo methods of human PK prediction. In vivo methods may therefore be reserved for compounds that have very strong profiles for progression into development, positioning in silico models as (i) corroborating orthogonal predictors of VDss from structure only, (ii) useful tools for virtual screening and simulations on structural changes, and (iii) firstline predictors to prioritize in vivo work on compounds not fully profiled. Following from the latter point regarding in vivo scaling methods, we turned our attention to a comparative analysis between in vivo scaling methods and our RF_33 model, in analogy with previous work on VDss4 using proprietary compounds, and to more recent modeling efforts, on clearance, using a larger set of literature compounds.14 The methods used for comparison are widely used animal scaling models, where of course in vivo and in vitro data (e.g., fraction unbound in plasma) have to be generated, while with in silico models all descriptors are generated from the compound structure only. We base this part of the discussion on Table 5, where we have listed the results of the RF_33, PLS_11, and PLS_32 models plus the consensus (average) between the first and second model. The four models examined for the animal scaling methods are the four best methods recommended by Lombardo et al.12 They comprise a multiple linear regression (MLR) method based on the logarithm of experimental VDss in rat and dog, which is similar to the method reported by Wajima et al.,21 but without a cross-term in our case, and the monkey proportionality method which is simply the value of the in vivo VDss in monkey from intravenous studies, initially indicated by Ward and Smith22 and later found to hold as one

compounds, and the prediction performed on the N compounds belonging to that class only. The results offer, at the same time, great structural diversity among classes but a high degree of similarity within one class and thus a high bar for predictive performance based on a very rugged approach as discussed in the first few paragraphs of this section. Furthermore, the total number of compounds, owing to the large data set, did not vary dramatically in each case and went from 1054 after removal of 41 nucleosides and nucleotides to 1086, after removal of calcium-channel blockers (dihydropyridines). This is, in our opinion, a favorable indication of the stability of the models. Validation Using External Data. We had, after this work had started, continued to search the literature to expand the data set, and we had found additional compounds, which we decided to use as an additional data set. The results, reported in Table 4, also include the performance of the PLS model based on 11 descriptors reported previously.5 The performance of the RF_33 is slightly better than all the other models with a GMFE value of 2.0 vs a range of 2.3 to 2.6. The PLS model with 11 descriptors reported previously yielded a 2.4 GMFE value. One of the first observations was that, among those compounds, metaproterenol (a β-adrenergic compound) should have been predicted accurately, considering that we had predicted quite well that class without the presence of any such compound in the LCO test (Table 2), and here the models would all have the entire set of 30 β-adrenergic compounds in the training set of 1096 values. The PLS_32 model yields a better result in this case but still above the 2-fold threshold, and all other models do worse, with the RF_33 returning a value with a 4.1-fold error. The training set used encompasses many compounds with a high to very high VDss (Figure 1), although overall a minority (102/1096 at or above

Figure 1. Distribution of compounds according to threshold values.

10 L/kg), and a very large structural diversity. All three of the compounds with a VDss above 10 L/kg (cannabinol, GPX-150 and viqualine), in this external test set, are underpredicted by all methods with fold errors varying from 2.2 (RF_33 for viqualine) to 7.1 (PLS_32 for viqualine). Furthermore, no method seemed to be able to predict CG200745, dobesilic acid, 2047

DOI: 10.1021/acs.jcim.6b00044 J. Chem. Inf. Model. 2016, 56, 2042−2052

Article

Journal of Chemical Information and Modeling Table 5. Comparative Analysis of VDss in Silico Prediction vs Animal Scaling Methodsa in vivo best methods GMFE % < 2-fold % < 3-fold max FE a

in silico methods

MLR

WAJIMA

monkey prop.

av of best methods

RF_33

PLS_32

PLS_11

consensus RF_33 and PLS_11

1.7 72 87 7

1.7 78 90 22

1.6 77 92 15

1.6 80 93 23

1.6 80 90 7

1.8 68 78 7

1.9 67 83 7

1.7 73 83 6

Same data set of 60 compounds for all methods. Data are from ref 12.

of the best methods by Lombardo et al.12 These methods (together with the Øie-Tozer approach which will be discussed later) were found to be the most accurate scaling methods based on animal data from multiple species. It can be seen that they are all well below a fold error of 2, although with a somewhat variable performance regarding their maximum fold error. Their average also shows a very good performance as it may be expected but with the largest maximum fold error. However, they achieve this performance on the basis of data obtained through laborious in vitro and in vivo animal studies. The in silico predictions are also very good, and they were obtained after removal of all 60 compounds from the data set, followed by rebuilding of each model and prediction. It should be noted, for accuracy, that animal-based models are of course ̈ toward any of the compounds administered. This is not “naive” the case for a computational method since, even the removal of the 60 compounds being discussed, from a newly trained model, does not ensure that the model would not have seen at least some analogs of the 60 compounds, with a large data set. Nevertheless, we note that the RF_33 model compares well with the average of best methods from in vivo studies, and it yields a lower maximum-fold error. The other two, although yielding a good GMFE, do not perform as well in terms of accuracy at the 2-fold error threshold, and thus we would lean toward the use of RF_33, with the caveat discussed above on the presence of analogs. We had to treat the Øie-Tozer method, Table 6, separately because the data set reported had only 38 compounds for this

performance of the models is also compared against this in vivo method which utilizes data from all three animal species listed. It is noted that the latter method does offer the best performance, but it may not be widely applicable, needing data in nonhuman primates as well as in rat and dog. The RF_33 model (again these 38 compounds were excluded from the training set) still performed the best since it (i) yielded a GMFE of 1.7, (ii) had a fairly high observed accuracy, at the 2fold error threshold (76%), (iii) was only slightly inferior to the in vivo model, and yielded (iv) a relatively low maximum-fold error of 7, although the in vivo model yielded a value of 3 for the latter. The other two in silico models did not perform as well although the PLS_11 may be viewed as being only slightly inferior to the RF_33 model. OOB Prediction of Compounds with High and Low VDss Values. We explored this aspect, searching for a possible solution, by looking at the out-of-bag (OOB) performance of the RF_33 model, for the training set, across compounds having a very large VDss, defined as being above 20 L/kg, and indeed we found that for the 38 compounds in that particular range, the GMFE for the RF_33 model was 8.3. It is likely that their low representation (38/1096) did not allow the models to train well on them. These very high values may be due to specific interactions with cells, membranes, and subcellular organelles, and the accurate prediction of these compounds may require more sophisticated and specific approaches, while yielding a doubtful outcome in any statistical model. Again, a conformational (and logD preference) may be imparting a recognition-binding feature that one conformer may not bring out. Hydroxychloroquine, the compound having by far the largest VDss in the set with a VDss of 700 L/kg, yields a fold error of 66.2. At the other end of the range of values we have examined the performance of the model against compounds with fairly low VDss. For compounds with a VDss value at or below 1 L/kg, a threshold slightly above the total body water of 0.6−0.7 L/kg, and which comprise about one-half of the 1096 compounds in the training set, we obtained a GMFE of 2.0 in the OOB test. It may be possible, by duplicating some or all of the compounds with very large VDss values, to “balance” the training set toward a more homogeneous distribution or to shift it, by choosing fewer low VDss compounds, to reach a diminished “bias” toward the low end. We have not attempted to test this hypothesis, with either RF or PLS models. An alternative approach may be to divide the compounds in classes, attempt to classify their volume accurately, and then use a quantitative model specific for each class, in a two-tier approach: i.e. a qualitative classification across a threshold value followed by two independent regression methods based solely on each set of compounds. Two-Tier Modeling Approach. The consideration that, (i) the RF_33 model utilizing all compounds, molecules with large VDss values (≥20 L/kg) yielded a very poor performance

Table 6. Comparative Analysis against the Øie-Tozer Methoda in silico methods

PLS_11

consensus PLS_11 and RF_33

1.5 79

1.7 76

1.8 66

1.9 71

1.7 71

95

89

82

84

89

3

7

7

6

6

Øie-Tozer (ratdog-monkey)a GMFE % < 2fold % < 3fold max FE

RF_33 PLS_32

a

Same test set of 38 compounds. Same 38 compounds utilized for the Øie-Tozer test in ref 12. Part of the 60 compounds set used for modeling efforts shown in Table 5.

method, which generally requires VDss and f up for several species, to maximize accuracy, and it utilizes species-dependent parameters. The reason for the use of a smaller set is that, for 22 out of the 60 compounds, an aberrant fraction unbound in tissues from animal ( f ut, necessary for its application) yielded aberrant results with a f ut either >1 or