Article pubs.acs.org/jcim
Structural and Physico-Chemical Interpretation (SPCI) of QSAR Models and Its Comparison with Matched Molecular Pair Analysis Pavel Polishchuk,*,†,‡ Oleg Tinkov,§ Tatiana Khristova,‡,∥ Ludmila Ognichenko,‡ Anna Kosinskaya,‡ Alexandre Varnek,∥,⊥ and Victor Kuz’min‡ †
Institute of Molecular and Translational Medicine, Faculty of Medicine and Dentistry, Palacký University and University Hospital in Olomouc, Hněvotínská 1333/5, 779 00 Olomouc, Czech Republic ‡ A. V. Bogatsky Physico-Chemical Institute of National Academy of Sciences of Ukraine, Lustdorfskaya doroga 86, 65080 Odessa, Ukraine § T. G. Shevchenko Transdniestria State University, ul. 25 Oktyabrya 107, 3300 Tiraspol, Transdniestria, Republic of Moldova ∥ Laboratoire de Chémoinformatique, UMR 7140 CNRS, Université de Strasbourg, 1 rue Blaise Pascal, 67000 Strasbourg, France ⊥ Laboratory of Chemoinformatics and Molecular Modeling, Butlerov Institut of Chemistry, Kazan Federal University, Kremlevskaya 18, Kazan, Russia S Supporting Information *
ABSTRACT: This paper describes the Structural and Physico-Chemical Interpretation (SPCI) approach, which is an extension of a recently reported method for interpretation of quantitative structure−activity relationship (QSAR) models. This approach can efficiently be used to reveal structural motifs and the major physicochemical factors affecting the investigated properties. Its efficacy was demonstrated both on the classical Free−Wilson data set and on several data sets with different end points (permeability of the blood−brain barrier, fibrinogen receptor antagonists, acute oral toxicity). Structure−activity patterns extracted from QSAR models with SPCI were in good correspondence with experimentally observed relationships and molecular docking, regardless of the machine learning method used. Comparison of SPCI with the matched molecular pair (MMP) method clearly shows an advantage of our approach over MMP, especially for small or structurally diverse data sets. The developed approach has been implemented in the SPCI software tool with a graphical user interface, which is publicly available at http://qsar4u.com/pages/sirms_qsar.php.
■
INTRODUCTION Currently, a number of linear and nonlinear techniques are used for quantitative structure−activity relationship (QSAR) modeling. Most of these, including support vector machine (SVM), random forest (RF), and neural net models, have high predictive ability and have become gold standards. However, these models are difficult to interpret and hence often considered “black boxes”. Recent trends in QSAR modeling, on the other hand, demonstrate the clear value of model interpretation to better understand structure−property relationships.1,2 Different techniques were developed in the past to interpret specific RF,3,4 SVM4,5 and neural net6,7 models. Several method-independent approaches based on calculation of partial derivatives or local gradients of descriptors8−10 have been suggested. The main idea of these approaches is to assess the contribution of descriptors and then to transfer this knowledge to the structural level, e.g., by color coding.9,11−13 The latter requires the use of only interpretable descriptors (e.g., fragmental descriptors, group signatures, etc.), which limits the application of these techniques. Interpretation of QSAR models is important not only to reveal underlying structure−property relationships but also for additional knowledge-based validation of models: interpreted results should correspond to known facts acquired via science. © 2016 American Chemical Society
If these contradict each other, then there are two possible scenarios: (i) the model is wrong and should be discarded or (ii) the accepted knowledge is insufficient and should be extended or reconsidered on the basis of the model interpretation results. Interpretation of QSAR models should thus become a routine postprocessing step. In order to achieve this, more universal approaches to model interpretation should be developed that can be applied to models based on different descriptors and machine learning methods. Several articles have been published recently on new approaches for the interpretation of QSAR models that allow us to estimate the contributions of structural fragments from QSAR models built by any machine learning method.14−17 Many of these approaches utilize the idea of matched molecular pair (MMP) analysis18 to calculate atom or fragment contributions. Sushko et al.16 implemented this idea explicitly in the prediction-driven MMP approach to calculate the effect of molecular transformations from QSAR models. Riniker and Landrum17 proposed similarity maps to visualize atomic contributions from QSAR models based on fingerprints and any machine learning method. They proposed estimating the Received: June 21, 2016 Published: July 15, 2016 1455
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling
different physicochemical factors. This extended interpretation approach, which integrates structural and physicochemical interpretation (SPCI) of QSAR models, has been applied to classification and regression tasks with different end points. Another goal of this study is to compare the results of structural interpretation of QSAR models with MMP analysis, which is a very popular tool for finding structure−property relationships because of its simplicity and clarity. Possible limitations, advantages, and disadvantages of QSAR interpretation and MMP approaches are discussed in this paper as well. Physicochemical Interpretation. The main goal of physicochemical interpretation is estimation of fragment contributions to the investigated property in terms of different physicochemical factors, such as electrostatic interactions, hydrophobic interactions, H-bonding, etc. For physicochemical interpretation, compounds should be represented by several types of descriptors. Each descriptor type represents a chemical structure from a certain physicochemical viewpoint (e.g., electrostatic, hydrophobic, H-bonding, etc.). To estimate the contribution of a physicochemical property of fragment C (see Figure 1), only descriptors of a given type are calculated for structure B, whereas the values of other types of descriptors are identical to those for the initial compound A. The difference between predicted property values for A and B is the contribution of a selected physicochemical factor to the overall contribution of fragment C. This procedure should be repeated independently for each type of descriptor in order to estimate the contributions from all of the physicochemical factors. It is important to note that the results of interpretation of QSAR models are markedly influenced by the data set used. Interpretation of results should not be considered absolute but as reflecting the relative importance of selected fragments. The simplest example is when the data set contains weak and strong binders, in which case some fragments can have zero contribution. However, this does not mean that these fragments are not important for binding and receptor recognition. They are just less important than other fragments. If nonbinders are added to the same data set of weak and strong binders, the fragments that previously had zero contributions may become more important with high positive contributions. This can be compared to positive and negative coefficients relative to the intercept in an ordinary linear model. Understanding the overall fragment contribution obtained as a result of structural interpretation is relatively simple: the higher the contribution, the more important the fragment is. Interpretation of fragment contributions in terms of physicochemical factors is somewhat different and shows the relative contribution of different physicochemical terms to the overall contribution. Thus, if a fragment has a low overall contribution, it will probably have a low contribution of physicochemical factors. However, that does not mean that such factors are not important for the investigated property. Theoretically, it is possible that a fragment with a low overall contribution may have big contributions from physicochemical factors that have opposite signs (e.g., a big positive contribution of an electrostatic term may be compensated by a big negative contribution of a hydrophobic term). Strategies for Fragment Selection and Assembly. Let us define local and global interpretation analysis. Local analysis is performed for one particular compound possessing several putative binding groups. It is presumed to answer the question of which of these groups is relatively more important. Global analysis consists of grouping fragment contributions of different
contribution of each atom as a difference between the values predicted for a particular compound and for the same compound after removal of the atom. Polishchuk et al.14 proposed a more general approach to estimate not only atom but fragment contributions, where whole fragments are virtually removed from molecules. Somewhat different ideas were used in the work of Webb et al.,15 who proposed the feature combination network approach for interpretation of binary classification models. In Figure 1, we briefly recapitulate our earlier proposed technique for structural interpretation of QSAR models:14 the
Figure 1. Structural and Physico-Chemical Interpretation (SPCI) of QSAR models. A is the compound of interest, C is the fragment of interest whose contribution is to be calculated, and B is the part of A remaining after removal of C. f(S) is a QSAR model returning a predicted value for the specified structure S. W(C) is the overall contribution of fragment C. AE, AH, AD, and AHB are descriptors of compound A representing electrostatic, hydrophobic, dispersive, and H-bonding terms, respectively. BHB is a descriptor of compound B representing the H-bonding term. WHB(C) is the contribution of fragment C regarding H-bonding effects.
contribution of fragment C of compound A can be estimated as the difference between the predicted activity values for compound A and virtual compound B (the antifragment of C) obtained by removal of C from A. This simple procedure can be used for the interpretation of both regression and binary classification QSAR models based on any combination of machine learning methods and descriptors. In the case of regression models, predicted numerical values are used for the calculation of fragment contributions, and thus, the contributions have the same units as the investigated property value and reflect the change in the value of the investigated property with the addition of certain fragments. In the case of binary classification models, predicted probabilities of belonging to the active class of compounds are used. Thus, the fragment contributions are probabilities to change class upon addition of those fragments. The developed approach for structural interpretation can estimate contributions of scaffolds and linkers as well as contributions of single substituents. After removal of the linker or scaffold, the remaining structure will consist of two or more disconnected fragments. This creates a certain limitation, since not all descriptors can be calculated for such multifragment structures, which can be chemically not meaningful. However, simplex descriptors and fingerprints can handle such structures perfectly. Therefore, in this study we used simplex descriptors because they provide great flexibility and opportunity to analyze the contributions of any fragments. The described structural interpretation approach answers the questions “How does a certain fragment influence an investigated property?” and “What is its contribution?”, but it fails to explain the reason for this. In this study, we report an extension of the previously developed structural interpretation approach14 that can help shed light on a mechanistic interpretation of QSAR models in terms of contributions of 1456
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling Table 1. Different Strategies for Fragment Selection and Assembly scenario 1 2 3
do specific interactions exist that cannot be disregarded?
is the position of a ligand toward its target known?
fragment selection and grouping
NO (e.g., passive diffusion through membranes, solubility, lipophilicity, etc.) YES (e.g., ligand−receptor interactions)
NOT RELEVANT
manual selection on the basis of researcher experience
YES NO
selection according to the ligand’s pose relative to the target recommendation: select fragments from homogeneous sets of compounds that have a common scaffold and presumably act by the same mechanism
molecules and comparing them with the aim of explaining or extending experimentally observed trends in a structure− property relationship. Whereas local analysis can be made for any compound of a data set, the results of global interpretation depend on the fragment selection and grouping strategy (Table 1). The simplest case (scenario 1 in Table 1) occurs when a specific orientation of compounds relative to their target is not expected or can be disregarded (e.g., solubility, lipophilicity, passive diffusion through membranes, etc.). Fragment selection and grouping can be guided by general considerations and depend on the decision of the researcher. If the orientation of compounds relative to their target is important and known from experimental studies or modeling (scenario 2 in Table 1), then this information should be taken into account during fragment selection and grouping. For example, if it is known that certain fragments of investigated compounds form Hbonds with the same amino acid residue, then these fragments can be grouped and analyzed together in order to obtain relevant interpretation results. In the worst case (scenario 3 in Table 1), when the orientation of the investigated compounds is important but unknown, analysis of only homogeneous sets of compounds is recommended. This can also be performed by MMP analysis18 or SAR matrices,19 tacitly assuming the identical interaction mode for all of the investigated compounds. However, data sets that comprise compounds with unknown but different mechanisms of action are most common and therefore should be carefully analyzed. This last case is what justified the term “black box”. Statistical Assessment of Calculated Contributions. Appropriate standard parametric/nonparametric statistical tests, such as the t test, Wilcoxon rank test, etc., can be applied to test the statistical significance of contributions in the case of global analysis. However, this does not take into account model error, which may be quite large and affect the calculated contributions. We propose not only to apply standard statistical tests but also to compare calculated contributions to model error estimated from cross-validation or external testing or to express contributions in units of model error (i.e., units of rootmean-square error (RMSE)) and set a reasonable threshold value for separate significant and nonsignificant contributions. Comparison of the MMP and SPCI Approaches to Interpretation of QSAR Models. Both the MMP and SPCI approaches have their own advantages and disadvantages. Some of their characteristics are listed in Table 2. The choice of an appropriate approach depends on the data set and task. However, combining them as complementary may provide different insights into an investigated problem. SPCI of QSAR models works on any data set, independent of size and structural diversity, if predictive models can be built. MMP can be applied mainly to big data sets containing sufficient molecular pairs, which implicitly limits the diversity of the data set. The accuracy of QSAR interpretation depends on experimental and prediction errors. However, for accurate
Table 2. Comparison of the MMP and SPCI Approaches SPCI tasks result data set size data set diversity accuracy ranking of fragments context dependence
MMP
regression and classification structural and physicochemical interpretation any if predictive models can be built depends on experimental and prediction errors all fragments can be ranked simultaneously (“infinite” molecular series) yes, if nonadditive models are used
regression and classification structural interpretation only large enough should contain matched pairs, which may limit diversity depends on experimental errors and the number of pairs only fragments appearing in molecular pairs (or series) can be ranked yes
models, the prediction error is usually close to the experimental one. The results of the MMP approach are affected by experimental errors and the number of pairs.20 It is important to note whether the experimental values were measured using the same assays or conditions, since in the case that they are different, this can substantially affect the end-point values as a result of systematic errors and the introduction of additional noise in the data set. In our experience, QSAR models for such data sets are usually poor, and only after selection of compounds tested under similar conditions do the QSAR models become predictive. Such a curation step is often absent in MMP analysis, which may distort the results. Therefore, the data should be curated before MMP analysis, or its heterogeneity should be taken into account during estimation of the statistical significance of the results.20 SPCI of QSAR models can rank any arbitrary set of fragments in one series, while MMP is restricted to pairwise comparison. Thus, if one finds by MMP that A < B and A < C, it is impossible to estimate the relationship between B and C if there are no B/C pairs. Even if two pairs A < B and B < C are found, there is no guarantee that A < C. To solve this issue and improve the reliability of pairwise relationships, an extension of MMP was proposed: molecular matched series where fragments attached to the same scaffolds are ranked at once.21 However, the length of the series is obviously limited, and longer series may be less statistically significant. From the MMP point of view, the results of SPCI of a QSAR model look like those obtained from an “infinite” molecular series. Data Sets. Structural interpretation methodology is very similar to Free−Wilson analysis. For this reason, we applied the approach developed for structural interpretation to the original data set taken from the publication of Free and Wilson22 to demonstrate the conformity of our results with a classical, wellproven approach. Three other data sets were also used: (i) compounds with known blood−brain barrier permeability by passive diffusion (BBB±) to represent the first scenario in Table 1; (ii) antagonists of fibrinogen receptors with known affinity values (pIC50) and previously established docking poses 1457
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling carried out in our earlier study23 in order to represent the second scenario in Table 1; (iii) compounds with measured acute oral toxicity in rats (pLD50), representing the third case in Table 1. The first of these tasks is a classification problem, and the other two are regression problems. Free−Wilson Data Set. This contained 29 compounds with associated LD50 values expressed in mg/10 g after intraperitoneal injection in mice (Figure 2).22 The weight units were not converted into molar units in order to ensure that the results would be comparable with published values of contributions.
Figure 3. (top) Binding pattern of tirofiban, a commercial antagonist of the fibrinogen receptor (PDB code 2VDM). (bottom) General representation of antagonists of the fibrinogen receptor, which have an Asp mimetic and an Arg mimetic joined by a linker, and several examples of the corresponding fragments.
Figure 2. Structures of compounds of the data set from the Free and Wilson paper:22 R = H, CH3; R1 = H, CH3, C2H5; R2 = N(CH3)2, N(C2H5)2, morpholino; R3 = H, phenyl; R4 = nothing, −CONH−.
Blood−Brain Barrier (BBB) Permeability Data Set. Information on 321 compounds with associated data on their BBB permeabilities (178 permeable and 143 nonpermeable) was collected from various articles.24−32 We confirmed by checking the literature that all of the selected compounds passed through BBB by passive diffusion and not by active transport. This data set illustrates the first scenario in Table 1, where possible specific interactions can be disregarded. Data Set of Fibrinogen Receptor Antagonists. The ArgGly-Asp (RGD) sequence of fibrinogen is responsible for the interaction of fibrinogen with its receptor and subsequent thrombus formation.33,34 For a long time, researchers focused on the development of antagonists of the fibrinogen receptor that mimic the RGD sequence.35−37 The data set consisted of 325 antagonists of the fibrinogen receptor (RGD mimetics) with measured affinities expressed as pIC50 values collected from ChEMBL database38 and our own studies. All of the selected compounds were tested using similar protocols, as different protocols could result in a wide range of affinity values varying by up to 1 order of magnitude.39 Each compound in this data set can be represented as consisting of three parts: Arg and Asp mimetics connected by a linker moiety (Figure 3). Ligand−protein interactions of these compounds were established in our previous work in docking studies.23 Arg mimetics containing a basic nitrogen atom form a charged H-bond with Asp224 and Ser225 of the fibrinogen receptor. Asp mimetics contain a carboxylic acid residue that coordinates with Mg2+ ions inside the protein cavity and substituents that form H-bonds with Arg214 and Asn215. There are several hydrophobic residues in the binding site that may interact with the Arg mimetic and the linker moiety of the ligand. The linker part of the ligand is exposed and may be solvated or form H-bonds with Asp232 via water molecules.23 Compounds of this data set have no clear scaffold, which could be a problem for MMP analysis. Data Set of Compounds with Acute Oral Toxicity in Rats. The data set for this end point was obtained from the Toxicity Estimation Software Tool (TEST), version 4.1, provided by the U.S. Environmental Protection Agency.40 LD50 values were
converted from mass to molar units and expressed as −log(LD50), with LD50 in mol/kg. After removal of salts, undefined isomeric mixtures, polymers, and mixtures, 7205 compounds remained for modeling in the data set. This data set is the most difficult one for analysis since it includes compounds with different and mainly unknown mechanisms of action. It was chosen to demonstrate the applicability of the developed approach to real complex data sets. Simplex Representation of Molecular Structure (SiRMS). The two-dimensional simplex representation of molecular structure (2D SiRMS) was chosen because of its great flexibility, its convenience in representing chemical structures, and the usually high predictive ability of the obtained models.11,41 It is perfectly suited to QSAR modeling and subsequent structural and physicochemical interpretation. Simplexes are tetraatomic fragments of fixed composition and topology. Simplex descriptors are counts of identical simplexes in a structure. Simplexes can be bound (all atoms in a simplex are connected by bonds) and unbound (one or more atoms are not connected to others in a simplex). The latter feature is important for interpretation because it allows us to encode structures consisting of separate fragments, thus making it possible to calculate the contributions of linkers and scaffolds. Another important feature of SiRMS is labeling of atoms according to their physicochemical properties, such as partial atomic charges (representing electrostatic interactions), lipophilicity (hydrophobic interactions), polarizability (dispersive interactions), H-bond donor/acceptor (H-bonding), etc. In the case of atomic properties represented by real numbers (e.g., partial atomic charges), the whole range of values is divided into a specific number of bins (usually four to seven), and each bin receives its own label that is used as an atom label during the stage of simplex generation (Figure 4). In this study, we used simplex descriptors labeled by partial atomic charge, lipophilicity, refractivity, and H-bonding. All of these parameters were calculated using the ChemAxon cxcalc software tool.42 When we carried out structural interpretation, 1458
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling
Figure 4. Simplex representation of molecular structure.
negative, false positive, and false negative compounds, respectively. The structures of compounds were standardized using ChemAxon Standardizer,44 and 2D simplex descriptors were calculated with respect to (i) partial atomic charges to represent electrostatic interactions, (ii) lipophilicity to represent hydrophobic interactions, (iii) atomic polarizabilities to represent dispersive interactions, and (iv) H-bond donor/acceptors using an open-source SiRMS implementation.45 For comparison reasons, MMP analyses were carried out using the method described by Hussain and Rea46 and implemented with RDKit software.47
we removed the descriptors of all four groups mentioned above for each selected fragment. For the physicochemical interpretation, we removed one group of descriptors at a time for each fragment and repeated the procedure for each group of descriptors independently. Modeling. The Python sklearn package was used for modeling.43 SVM with radial basis function (RBF) kernel, RF, and gradient boosting method (GBM) models were built to overcome the classification problem. For regression problems, SVM, RF, GBM, and partial least squares (PLS) techniques were used. Consensus predictions were produced by averaging the predictions of regression models or by choosing the major voted class among classification models. The performance of all models was assessed by fivefold cross-validation. The tuning parameters of the models were optimized by a grid search. The followed statistical parameters were used for assessment of the predictive performance of regression and classification models based on cross-validation: 2
Q =1−
RESULTS AND DISCUSSION Free−Wilson Data Set Analysis. The performance of fivefold cross-validation for this task was poor, as could be expected because of the small size of the data set (Table 3).
Table 3. Fivefold Cross-Validation Performance for Regression Models Built in This Study
∑i (yi ,pred − yi ,obs )2 2 ∑i (yi ,pred − yobs ̅ )
Free−Wilson data set
∑i (yi ,pred − yi ,obs )2
RMSE =
RF GBM SVM PLS consensus
N−1
specificity =
TN TN + FP
sensitivity =
TP TP + FN
balanced accuracy = κ=
■
acute oral toxicity data set
Q2
RMSE
Q2
RMSE
Q2
RMSE
0.34 0.33 0.43 0.26 0.38
1.47 1.48 1.37 1.56 1.43
0.72 0.68 0.70 0.67 0.73
0.81 0.86 0.82 0.88 0.79
0.61 0.56 0.54 0.44 0.60
0.59 0.63 0.64 0.71 0.60
The overall contributions of substituents calculated from individual models were in strong agreement with each other (RPearson = 0.81−0.97). Good agreement between contributions calculated from the consensus QSAR model and the contributions reported by Free and Wilson was found (Figure 5). BBB Data Set Analysis. All of the models built had reasonable predictive performance (Table 4). Fragments that represent rings and common functional groups were chosen for interpretation of QSAR models. The overall fragment contributions calculated from different models were generally in good agreement with each other (RPearson = 0.78−0.90). The clear trend of the structure−property relationship is observed in Figure 6. The 1,3-thiazolyl moiety and carbamoyl, nitro, and carboxyl groups have negative effects on the BBB permeability, and cyclic and acyclic amide groups may also reduce the BBB permeability, whereas only the CF3 group has a clear positive influence on the BBB permeability. These findings are in good agreement with previously reported studies indicating that a large number of H-bond donors/acceptors, large polar surface
specificity + sensitivity 2
accuracy − baseline 1 − baseline
accuracy =
RGD mimetics data set
TP + TN N
baseline = (TN + FP)(TN + FN) + (TP + FN)(TP + FP) N2
where Q2 is the coefficient of determination of cross-validation performance, RMSE is the root-mean-square error, yi,obs and yi,pred are the observed and predicted values, respectively, of the target property of the ith compound, yo̅ bs is the mean observed value of the target property, N is the number of compounds, and TP, TN, FP, and FN are the numbers of true positive, true 1459
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling
hydrogen bonds is the most important factor in the low permeability of compounds containing thiazole, nitro, and carboxylic groups because of the strong interaction with the water medium and the necessity of desolvation before passage through a membrane. At the same time, the CF3 group is unlikely to form H-bonds, and this is preferable for BBB permeability. The carbamoyl group has a large negative contribution of electrostatic factors, which means it may have an unfavorable distribution of partial atomic charges. These findings are in good agreement with earlier established rules and accumulated knowledge. A number of studies have indicated a negative effect of a large number of H-bond donors/acceptors, which should be less than 3. At the same time, the topological polar surface area should be less than 80 Å2.49−53 MMP analysis of this data set returned a few reasonable results, but all were statistically insignificant, probably because of the small number of observed transformations (Table 5), which is probably due to the small size of the data set and its structural diversity. Replacement of hydrogen, ester, or carbonyl groups with alkyl chains or increases in the size of alkyl chains enhances compound permeability. This corresponds to experimental observations and can be interpreted as indicating that an increase in lipophilicity and a decrease in the number of H-bond donors/acceptors may enhance BBB permeability.48 The selected transformations did not take into account the molecular context. If this were done, too few matched molecular pairs would remain. This shows one of the limitation of the MMP approach: it cannot be used to analyze relatively small, diverse data sets. On the other hand, it is difficult to collect a large amount of data for some end points measured under identical or very similar experimental conditions to avoid a bias introduced by different assays. Analysis of the RGD Mimetics (Fibrinogen Receptor Antagonists) Data Set. RF, GBM, SVM, and PLS models were built for the compounds in this data set and exhibited satisfactory and comparable predictive performance (Table 3). As mentioned above, the compounds in this data set can be virtually split into three parts that interact with corresponding amino acid residues in the binding site of the fibrinogen receptor. Therefore, analysis of contributions was performed separately for these three groups of fragments (Figure 10). The concordance between fragment contributions calculated across different models was high (RPearson = 0.89−0.98). For this reason, analysis of the interpretation results was done for the consensus model only (Figure 11). The distributions of fragment contributions calculated from all of the individual models are given in Figure S1 in the Supporting Information. The two-sided Wilcoxon rank test was applied to test the statistical significance of the contributions. However, the calculated contributions are affected by the accuracy and predictive performance of the models. For this reason, it is feasible to compare contributions relative to a modeling error (RMSE). Contributions that are within 1 unit of RMSE can be considered insignificant, and their analysis should be done with care. Clear trends of structure−affinity relationships were observed for each group of fragments: Arg mimetics, linkers, and Asp mimetics (Figure 11). Cyclic secondary amines as Arg mimetics increase the affinity for the fibrinogen receptor more than pyridyl, amidino, and guanidino groups. Unexpectedly, there were five outliers in the L3 linker group. More thoughtful analysis revealed that all five cases correspond to the most
Figure 5. Average contributions of substituents for the Free−Wilson data set calculated using the models obtained in this work and the original paper.22 Numbers in parentheses are the numbers of compounds having the specified substituent(s).
Table 4. Fivefold Cross-Validation Performance for the BBB Data Set balanced accuracy sensitivity specificity κ
RF
GBM
SVM
consensus
0.76 0.81 0.71 0.52
0.77 0.84 0.69 0.54
0.75 0.79 0.70 0.49
0.76 0.83 0.69 0.52
area, and the presence of carboxylic acid groups may decrease the BBB permeability because of multiple factors: better water solubility, high plasma protein binding, or P-gp recognition.48 Analysis of the distribution of fragment contributions may indicate their context dependence or may help us find outliers or specific key fragments. For example, thiazolyl ring, carboxylic acid, and different amide groups have negative contributions with substantial variance, which implies a significant effect of the molecular surroundings on the contributions of these fragments. Many contribution values of hydroxyl groups are near zero except for several large negative values. Close analysis revealed that these cases refer to benzodiazepine compounds containing a hydroxyl group (Figure 7). This means that removal of the hydroxyl group from these compounds could significantly increase their BBB permeability, unlike other compounds of the data set. Indeed, the found relationship can be confirmed by comparison between the BBB permeabilities of the mentioned compounds and the corresponding nonhydroxylated parent compounds (Figure 7). Such fragments can be considered as “activity triggers” or “emerging patterns”, and researchers may consider removing or replacing them to significantly change the end-point value of the compound. Another example of the context dependence of calculated contributions is shown in Figure 8, where the ester group has significantly different contribution values in different surroundings. Different models also have different distributions of fragment contributions. For example, the variance of contributions of the aromatic hydroxyl group is higher in the case of the GBM model (standard deviation (SD) = 0.27) than for RF (SD = 0.12) or SVM (SD = 0.10). Therefore, it is reasonable to focus on the interpretation of consensus models to decrease noise and bias introduced by individual models. Physicochemical interpretation of the consensus QSAR model (Figure 9) revealed that the possible formation of 1460
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling
Figure 6. Distribution of fragment contributions calculated using the RF, GBM, SVM, and consensus models. Only fragments occurring in at least 10 compounds are shown. Numbers in brackets: M is the the number of compounds containing a fragment, and N is the number of fragments across the whole data set (some compounds have several identical fragments, and their contributions were estimated separately). Asterisks refer to statistical significance calculated by the two-sided Wilcoxon rank test (p value): ***, p < 0.001; **, p < 0.01, *, p < 0.05.
Figure 8. Consensus contributions of ester groups to BBB permeability, demonstrating the context dependence of the fragments’ contributions.
within the range of observed values of the training set compounds. This causes the contributions of almost any fragment contained in the most active compounds to be positive. The same is true for the contributions of fragments in the least active compounds. Thus, special attention should be paid to such compounds and their analysis. Asp mimetics were the most diverse part of the RGD peptidomimetics. Fragment D8, which occurred in many compounds, has a very large range of contribution values due to the substantial influence of molecular context. At the same time, fragment D6, which is also present in different molecular surroundings, has a smaller range of contributions. In general, the variance of contribution values of Arg mimetics is substantially smaller than those of linkers and Asp mimetics.
Figure 7. Consensus contributions of hydroxyl groups to BBB permeability in benzodiazepine derivatives.
active compounds in the data set. This points to some restrictions due to QSAR modeling. Models that cannot extrapolate (e.g., RF and GBM) always return predicted values 1461
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling
Figure 9. Median fragment contributions of different physicochemical factors estimated from the consensus model. Only fragments occurring in at least 10 compounds are shown. Definitions of M and N were given in the Figure 6 caption.
Table 5. Contributions of the Most Frequent MMPs for the BBB Data Set number of matched molecular pair
positive changes (p valuea)
[*:1]O≫[*:1][H] [*:1]C≫[*:1]CCC [*:1][H]≫[*:1]CCC [*:1]C[*:2]≫[*:1]CCC([*:2])C [*:1]C≫[*:1]CCCC [*:1]OC([*:2])O≫[*:1]CCCC[*:2] [*:1][H]≫[*:1]CCCC [*:1]OC([*:2])O≫[*:1]CCCCC[*:2] [*:1][H]≫[*:1]C(C)C [*:1]C≫[*:1]CC(C)C [*:1]OC([*:2])O≫[*:1]CCCCCC[*:2] [*:1]C≫[*:1]CCCCC [*:1][H]≫[*:1]CCCCC [*:1]CO≫[*:1]CCC [*:1]OC([*:2])O≫[*:1]CCC[*:2] [*:1]CC≫[*:1]CCC(C)C [*:1]CC1([*:2])C(O)NC(O)NC1O≫[*:1]CCCCCC[*:2] [*:1]OC(C)O≫[*:1]CCCCCC a
5 4 5 3 4 3 3 3 3 3 3 2 3 3 3 2 2 2
(0.49) (0.68) (0.88) (0.62) (0.26) (0.53) (0.83) (0.43) (0.90) (0.32) (0.32) (0.65) (0.21) (0.21) (0.21) (0.96) (0.96) (0.96)
no changes
negative changes
overall number of MMPs
7 9 6 6 4 5 4 4 3 3 3 4 2 2 2 3 3 3
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
13 13 11 9 8 8 7 7 6 6 6 6 5 5 5 5 5 5
p values from a one-sided binomial test.
nonbinders. Compounds with guanidine (R1) and pyrimidine (R2) groups have the lowest affinity values in the data set (10− 100 μM), and therefore, their contributions are very low. However, that does not mean that these groups are unimportant for receptor recognition. The global and local physicochemical interpretation revealed the large contribution of the electrostatic term (Figure 12), suggesting that this is the main driving force of the ligand− receptor interaction. This assumption can be supported by the following considerations: (1) ligands have at least one positively group and one negatively charged group, which is essential for ligand−receptor recognition (Figure 13); (2) there are commonly one or two charged H-bonds in the ligand− receptor complexes according to earlier molecular docking studies (Figure 13); (3) the desolvation effect of ligands, which depends on the distribution of partial atomic charges, can also play an important role in ligand binding and cannot be estimated directly. The less significant effects of H-bonding
This indicates that the nature of the Arg mimetic may be more important for binding to the fibrinogen receptor than those of the linker and Asp mimetic, whose contributions are highly context-dependent. This information along with established trends in structure−property relationships may be used for drug design per se or as a guideline for the researcher. Is it important to note that the interpretation results depend on the available data sets, and this should always be taken into account in any analysis. For example, it is well-known from pharmacophore modeling, docking, and X-ray data of ligand− protein complexes that fibrinogen receptor antagonists should contain positively and negatively charged groups at a distance of 15−20 Å.23,36,37,54 These groups are essential for binding to the receptor. However, according to the interpretation results, the contributions of some Arg groups are near zero and not statistically significant (Figure 11). Thus, we can conclude that these groups are not very important. However, such results are easily explained by a biased data set that does not contain true 1462
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling
Figure 10. Most frequently occurring fragments in compounds of the RGD peptidomimetics data set.
Figure 11. Distribution of fragment contributions of RGD mimetics calculated from the consensus QSAR model. The definitions of M and asterisks were given in the Figure 6 caption.
Figure 12. Median contributions of physicochemical terms in ligand−protein interactions of the RGD mimetics data set compounds calculated from the consensus model consisting of RF, GBM, SVM, and PLS models. The definition of M was given in the Figure 6 caption.
hydrophobic residues in the binding pocket, and correspondingly, relatively small contributions of hydrophobic effects of fragments are observed in the consensus QSAR model. The
relative to the electrostatic term may be explained by the charged nature of the H-bonds formed between the ligands and Asp224 and Arg214 of the fibrinogen receptor. There are few 1463
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling
ability of the interpretation approach to correctly rank them relative to other structural motifs (Table 6). The toxicophores mentioned ranked at the top among all of the considered fragments (Figure 15). The influence of the molecular context of some toxicophores with common structures such as the carbamate group was analyzed. The contribution of O(methylaminocarbonyl)oxime to toxicity was found to be significantly greater than the contribution of the parent carbamate group. On the other hand, the cyclic carbamate fragment was virtually nontoxic. Second, we analyzed the contributions of other highly ranked fragments from the list of common functional groups and ring systems in order to find new potential toxicophores. Some, like nitrosamine and aziridine, are known mutagens, as shown by Kazius et al.55 and in our recent publication.14 Others, like piperazine and piperidine, which are frequently used in medicinal chemistry, are not well-known as toxicophores. Analysis of the molecular context of these groups revealed the 4-phenylpiperazine and 4-phenylpiperidine moieties as probable toxicophores because they have significantly higher contribution values relative to the contributions of corresponding parent fragments. Halogens per se make relatively small contributions to acute toxicity and may be considered weak toxicophores. Fragments with negative contributions, such as carboxylic or sulfonic acid groups, can be considered as detoxicophores. We have not provided detailed analysis of this data set here. This was merely used as an illustration of the applicability of the proposed interpretation approach to real problems. We took a cursory look at the contributions of different physicochemical factors for this data set (Figure S4) because the compounds have different molecular targets and mechanisms of action that may substantially affect the interpretation results. Thus, only general observations can be made about the large contributions of the dispersive term to the toxic effect of phosphorus-containing compounds, the electrostatic term to the toxicity of compounds comprising 4-hydroxycoumarin and hexachloronorbornene residues, and the H-bonding and dispersive terms for compounds bearing the 2-trifluorobenzimidazole moiety. MMP analysis of this data set returned 494 statistically significant pairs according to the one-sided Wilcoxon rank test. This revealed the toxicity of phosphorus-containing residues relative to hydrogen, hydroxyl, and other groups, nitrosamino moieties relative to ester groups, alkene relative to the dimethyl ether fragment, acrylic acid residues relative to alkyl chains,
Figure 13. Interaction map of a selected ligand with the fibrinogen receptor and calculated contributions of physicochemical terms for separate fragments from the consensus model (ELS, electrostatic; HYD, hydrophobic; HB, hydrogen bonding; DSP, dispersive).
contributions of dispersive interactions are smallest, as these forces are usually very weak and do not substantially influence affinity values. MMP analysis was performed only for pairs that occurred at least five times across the data set and had one-sided Wilcoxon rank test p values less than 0.05 in order to obtain more stable and significant results (Figure 14). The top four pairs from Figure 14 demonstrated a preference of longer chains for highaffinity ligands. Introduction of a phenyl group, removal of a nitro group, and replacement of para-substituted phenyl on meta-substituted phenyl increase the affinity. The MMP approach applied to the RGD mimetics data set simulates the situation when we analyzed all pairs independently of their binding modes. Thus, the results of analysis out of molecular context appear to have little value, as they cannot be directly used for drug design or explain an observed structure−activity relationship. Analysis of Acute Oral Toxicity in Rats. The statistical characteristics of the individual RF, GBM, SVM, and PLS models and their consensus predictions are given in Table 3. Despite the poor predictive performance of the PLS model, its inclusion in the consensus model does not change the predictive ability of the latter or the interpretation results. The fragment contributions calculated from individual models were in good agreement (RPearson = 0.69−0.94). Therefore, further analysis was focused only on a discussion of results of the consensus model. First, contributions of known toxicophores with established mechanisms of action were calculated in order to confirm the
Figure 14. Results of matched molecular pairs analysis for the data set of antagonists of the fibrinogen receptor. Only pairs that occurred five or more times and had one-sided Wilcoxon rank test p values less than 0.05 are shown. The numbers of corresponding pairs are given in parentheses. 1464
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling Table 6. List of Fragments with Known or Presumed Mechanism of Toxic Actiona
a
SMARTS patterns for all of the analyzed fragments are given in Table S1 in the Supporting Information. bThe acute toxicity of thiourea varies with species, strain, age, and iodine content of the diet.68
building and validation → calculation of desired fragment contributions → visualization of contributions. There are two interpretation modes: structural interpretation only and structural and physicochemical interpretation. Different simplex labeling schemes are used for them. In the first case, vertices in simplexes are labeled by the element, whereas in the second case they are labeled according to the value of partial atomic charge, lipophilicity, H-bonding ability, and refraction to represent electrostatic, hydrophobic, H-bonding, and dispersive factors, correspondingly. Therefore, the overall contribution calculated in the second case may be slightly different from that calculated in single structural interpretation mode because different descriptor labeling schemes are used. Modeling and validation are performed automatically without the preliminary variable selection step. Model parameters are tuned by a grid search. Fivefold cross-validation is performed only once (one repetition) with the predefined seed to make it reproducible. Statistics and parameters used for optimal model building can be viewed in a separate window.
thiol and halogens relative to phenyl residues, and many others. A number of “duplicates” (e.g., MMPs that match the same parts of molecules but have different alkyl chain lengths) were identified. For this reason, only a portion of the transformations found are shown on Figure 16. Both approaches, MMP and SPCI of QSAR models, revealed several identical or similar fragments such as phosphoruscontaining moieties. The MMP analysis revealed no toxicity of carbamate residues or fragments containing piperazine because the corresponding MMPs occurred very rarely across the data set. At the same time, SPCI of QSAR models missed a large number of potential toxicophores related to alkene-containing fragments because of supervised selection of fragments for interpretation. To avoid this kind of bias, molecules can be cut into fragments automatically by applying certain rules similar to those used in the MMP approach. Knowledge-Mining Tool. The SPCI approach has been implemented in an open-source tool, also called SPCI, for knowledge mining of chemical data sets.69 The overall procedure is straightforward: sdf file → automatic model 1465
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling
Figure 15. Contributions of different molecular fragments to toxicity calculated on the basis of the consensus QSAR model. The definitions of M, N, and asterisks are given in the Figure 6 caption. A full list of contributions of all fragments is given in Figure S3.
Figure 16. Selected statistically significant (p < 0.05 from one-sided Wilcoxon rank test) MMPs from the toxicity data set.
The exploratory analysis described in this paper can be rapidly and easily executed for any data set on the basis of several predefined fragmentation schemes, including (1) common functional groups and small rings, (2) all ring systems available in the modeling data set, (3) Murcko scaffolds detected in the modeling data set, and (4) two automatic fragmentation schemes that use SMARTS to define bonds to cleave during fragmentation. In the last fragmentation scheme, only fragments with at most three attachment points are created in order to avoid combinatorial explosion. User-defined
fragments in SMARTS/SMILES format (a tab-separated list of SMART/SMILES and their names) may also be used for calculation of fragment contributions. There are a lot of command line options in the underlying Python scripts that provide great customization and tuning of the whole system. However, to simplify the usage and program interface, many of those parameters were set to some reasonable default values. In the last version (0.1.5), the predictor module was added, which returns predictions for new data sets and estimates the applicability domain on the basis of the fragment control 1466
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling
data sets where data are collected from different sources. Both MMP analysis and SPCI of QSAR models can be used for knowledge mining of chemical data sets, but SPCI can be considered a good alternative to MMP analysis since (i) the SPCI approach provides structure−activity relationship trends that can be considered as “infinite” molecular series and (ii) the SPCI approach is more suitable for analysis of small or diverse data sets.
approach. Only basic visualization of structure−activity relationship trends was implemented in the SPCI software. More advanced and flexible visualization is provided by the rspci R package (https://github.com/DrrDom/rspci). More details of the developed software tools are available in the SPCI manual and on the author page (http://qsar4u.com/ pages/sirms_qsar.php). Three GitHub repositories were created: (i) the standalone SPCI software tool with graphical user interface (https://github.com/DrrDom/spci); (ii) a Python tool to carry out fragmentation of the data set compounds for further descriptor calculation with external software and prediction by QSAR models and for calculation of fragment contributions (https://github.com/DrrDom/spciext); and (iii) an R package for customized visualization of fragment contributions (https://github.com/DrrDom/rspci). Alternatively, the described technique of structural interpretation of QSAR models has been successfully evaluated against a variety of learner algorithms and descriptor types (e.g., ECFP6 fingerprints) built using the QSAR Workbench software with code developed by GlaxoSmithKline-nice.
■
ASSOCIATED CONTENT
* Supporting Information S
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.6b00371. Overall contributions of fragments from the data set of affinity for fibrinogen receptor calculated from RF, GBM, SVM, PLS and consensus models (Figure S1); average physicochemical contributions of fragments from the data set of affinity for fibrinogen receptor calculated from RF, GBM, SVM, PLS and consensus models (Figure S2); overall fragments contributions calculated from the consensus model of the toxicity data set (Figure S3); contributions of physicochemical factors calculated from the consensus model of the toxicity data set (Figure S4); SMARTS patterns of all analyzed fragments for the toxicity data set (Table S1); tuning parameters used to build the best individual QSAR models (Table S2) (PDF) The four data sets (XLSX)
■
CONCLUSION In this paper, we have extended the approach of interpretation of QSAR models to provide context-dependent contributions of structural motifs and physicochemical factors that can be effectively used to reveal structure−property relationships in a chemically meaningful way. The information can be used directly for fragment-based drug design or to establish structure−property relationship trends and uncover possible mechanism(s) of drug action. We believe that interpretation of QSAR models should become an essential step in QSAR analysis as a knowledge-based validation procedure. The developed SPCI approach may be applied in analysis of classification and regression tasks. The applicability of the approach developed has been successfully tested on the classical Free−Wilson data set and three other data sets (blood−brain barrier permeability, affinity of RGD mimetics for fibrinogen receptor, and in vivo acute toxicity). We have shown that individual models built using different statistical methods provide similar interpretation results, but to avoid bias and noise introduced by individual models, the use of consensus models appears to be preferable. The interpretation results in all cases were relevant to existing structure−property knowledge in corresponding domains; in other words, all of the models were knowledge-based-validated. However, it should be noted that that interpretation results depend on the data used for modeling, and these may substantially bias them. The importance of taking into account binding poses of compounds was demonstrated for the data set of RGD peptidomimetics with previously discovered binding modes by molecular docking. Analysis is difficult if not impossible if one discards this information. In this study, we used only simplex descriptors as the most suitable and flexible molecular representation for structural and physicochemical interpretation, but other descriptors can also be used for structural interpretation as reported in recent articles, including Dragon and MOE descriptors14,70 or ECFP.71 The importance of proper estimation of the significance of SPCI results and MMPs should also be noted. The use of standard statistical tests is insufficient because the results are affected by modeling and experimental error. However, it is very difficult to estimate experimental errors accurately in large
■
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected]. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS The authors thank Chris Luscombe of GlaxoSmithKline for discussion and collaboration. T.K. thanks the French Embassy in Ukraine for a Ph.D. fellowship. P.P. thanks the French Embassy in Ukraine for a short-term scholarship. The research was partially supported by Russian Science Foundation Grant 14-43-00024.
■
REFERENCES
(1) Cherkasov, A.; Muratov, E. N.; Fourches, D.; Varnek, A.; Baskin, I. I.; Cronin, M.; Dearden, J.; Gramatica, P.; Martin, Y. C.; Todeschini, R.; et al. QSAR Modeling: Where Have You Been? Where Are You Going To? J. Med. Chem. 2014, 57, 4977−5010. (2) Gasteiger, J. Chemoinformatics: Achievements and Challenges, a Personal View. Molecules 2016, 21, 151. (3) Kuz’min, V. E.; Polishchuk, P. G.; Artemenko, A. G.; Andronati, S. A. Interpretation of QSAR models based on Random Forest method. Mol. Inf. 2011, 30, 593−603. (4) Carlsson, L.; Helgee, E. A.; Boyer, S. Interpretation of Nonlinear QSAR Models Applied to Ames Mutagenicity Data. J. Chem. Inf. Model. 2009, 49, 2551−2558. (5) Rosenbaum, L.; Hinselmann, G.; Jahn, A.; Zell, A. Interpreting linear support vector machine models with heat map molecule coloring. J. Cheminf. 2011, 3, 11. (6) Baskin, I. I.; Ait, A. O.; Halberstam, N. M.; Palyulin, V. A.; Zefirov, N. S. An approach to the interpretation of backpropagation neural network models in QSAR studies. SAR QSAR Environ. Res. 2002, 13, 35−41.
1467
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling (7) Guha, R.; Jurs, P. C. Interpreting Computational Neural Network QSAR Models: A Measure of Descriptor Importance. J. Chem. Inf. Model. 2005, 45, 800−806. (8) Chen, H.; Carlsson, L.; Eriksson, M.; Varkonyi, P.; Norinder, U.; Nilsson, I. Beyond the Scope of Free−Wilson Analysis: Building Interpretable QSAR Models with Machine Learning Algorithms. J. Chem. Inf. Model. 2013, 53, 1324−1336. (9) Marcou, G.; Horvath, D.; Solov’ev, V.; Arrault, A.; Vayer, P.; Varnek, A. Interpretability of SAR/QSAR Models of any Complexity by Atomic Contributions. Mol. Inf. 2012, 31, 639−642. (10) Stålring, J.; Almeida, P. R.; Carlsson, L.; Helgee Ahlberg, E.; Hasselgren, C.; Boyer, S. Localized Heuristic Inverse Quantitative Structure Activity Relationship with Bulk Descriptors Using Numerical Gradients. J. Chem. Inf. Model. 2013, 53, 2001−2017. (11) Kuz’min, V. E.; Artemenko, A. G.; Polischuk, P. G.; Muratov, E. N.; Hromov, A. I.; Liahovskiy, A. V.; Andronati, S. A.; Makan, S. Y. Hierarchic System of QSAR Models (1D-4D) on the Base of Simplex Representation of Molecular Structure. J. Mol. Model. 2005, 11, 457− 467. (12) Hasegawa, K.; Migita, K.; Funatsu, K. Atom Coloring for Chemical Interpretation and De Novo Design for Molecular Design. In Knowledge-Oriented Applications in Data Mining; Funatsu, K., Ed.; InTech: Rijeka, Croatia, 2011. (13) An, Y.; Sherman, W.; Dixon, S. L. Kernel-Based Partial Least Squares: Application to Fingerprint-Based QSAR with Model Visualization. J. Chem. Inf. Model. 2013, 53, 2312−2321. (14) Polishchuk, P. G.; Kuz’min, V. E.; Artemenko, A. G.; Muratov, E. N. Universal Approach for Structural Interpretation of QSAR/ QSPR Models. Mol. Inf. 2013, 32, 843−853. (15) Webb, S.; Hanser, T.; Howlin, B.; Krause, P.; Vessey, J. Feature combination networks for the interpretation of statistical machine learning models: application to Ames mutagenicity. J. Cheminf. 2014, 6, 8. (16) Sushko, Y.; Novotarskyi, S.; Korner, R.; Vogt, J.; Abdelaziz, A.; Tetko, I. Prediction-driven matched molecular pairs to interpret QSARs and aid the molecular optimization process. J. Cheminf. 2014, 6, 48. (17) Riniker, S.; Landrum, G. Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods. J. Cheminf. 2013, 5, 43. (18) Leach, A. G.; Jones, H. D.; Cosgrove, D. A.; Kenny, P. W.; Ruston, L.; MacFaul, P.; Wood, J. M.; Colclough, N.; Law, B. Matched Molecular Pairs as a Guide in the Optimization of Pharmaceutical Properties; a Study of Aqueous Solubility, Plasma Protein Binding and Oral Exposure. J. Med. Chem. 2006, 49, 6672−6682. (19) Wassermann, A. M.; Haebel, P.; Weskamp, N.; Bajorath, J. SAR Matrices: Automated Extraction of Information-Rich SAR Tables from Large Compound Data Sets. J. Chem. Inf. Model. 2012, 52, 1769−1776. (20) Kramer, C.; Fuchs, J. E.; Whitebread, S.; Gedeck, P.; Liedl, K. R. Matched Molecular Pair Analysis: Significance and the Impact of Experimental Uncertainty. J. Med. Chem. 2014, 57, 3786−3802. (21) O’Boyle, N. M.; Boström, J.; Sayle, R. A.; Gill, A. Using Matched Molecular Series as a Predictive Tool To Optimize Biological Activity. J. Med. Chem. 2014, 57, 2704−2713. (22) Free, S. M.; Wilson, J. W. A Mathematical Contribution to Structure-Activity Studies. J. Med. Chem. 1964, 7, 395−399. (23) Polishchuk, P. G.; Samoylenko, G. V.; Khristova, T. M.; Krysko, O. L.; Kabanova, T. A.; Kabanov, V. M.; Kornylov, A. Y.; Klimchuk, O.; Langer, T.; Andronati, S. A.; et al. Design, Virtual Screening, and Synthesis of Antagonists of αIIbβ3 as Antiplatelet Agents. J. Med. Chem. 2015, 58, 7681−7694. (24) Garg, P.; Verma, J. In Silico Prediction of Blood Brain Barrier Permeability: An Artificial Neural Network Model. J. Chem. Inf. Model. 2006, 46, 289−297. (25) Wichmann, K.; Diedenhofen, M.; Klamt, A. Prediction of BloodBrain Partitioning and Human Serum Albumin Binding Based on COSMO-RS σ-Moments. J. Chem. Inf. Model. 2007, 47, 228−233.
(26) Gerebtzoff, G.; Seelig, A. In Silico Prediction of Blood−Brain Barrier Permeation Using the Calculated Molecular Cross-Sectional Area as Main Parameter. J. Chem. Inf. Model. 2006, 46, 2638−2650. (27) Deconinck, E.; Zhang, M. H.; Coomans, D.; Vander Heyden, Y. Classification Tree Models for the Prediction of Blood−Brain Barrier Passage of Drugs. J. Chem. Inf. Model. 2006, 46, 1410−1419. (28) Li, H.; Yap, C. W.; Ung, C. Y.; Xue, Y.; Cao, Z. W.; Chen, Y. Z. Effect of Selection of Molecular Descriptors on the Prediction of Blood−Brain Barrier Penetrating and Nonpenetrating Agents by Statistical Learning Methods. J. Chem. Inf. Model. 2005, 45, 1376− 1384. (29) Fu, X. C.; Song, Z. F.; Fu, C. Y.; Liang, W. Q. A simple predictive model for blood-brain barrier penetration. Pharmazie 2005, 60, 354−358. (30) Zhang, L.; Zhu, H.; Oprea, T.; Golbraikh, A.; Tropsha, A. QSAR Modeling of the Blood−Brain Barrier Permeability for Diverse Organic Compounds. Pharm. Res. 2008, 25, 1902−1914. (31) Katritzky, A. R.; Kuanar, M.; Slavov, S.; Dobchev, D. A.; Fara, D. C.; Karelson, M.; Acree, W. E., Jr; Solov’ev, V. P.; Varnek, A. Correlation of blood−brain penetration using structural descriptors. Bioorg. Med. Chem. 2006, 14, 4888−4917. (32) Muehlbacher, M.; Spitzer, G.; Liedl, K.; Kornhuber, J. Qualitative prediction of blood−brain barrier permeability on a large and refined dataset. J. Comput.-Aided Mol. Des. 2011, 25, 1095−1106. (33) Gartner, T. K.; Bennett, J. S. The tetrapeptide analogue of the cell attachment site of fibronectin inhibits platelet aggregation and fibrinogen binding to activated platelets. J. Biol. Chem. 1985, 260, 11891−11894. (34) Andrieux, A.; Hudry-Clergeon, G.; Ryckewaert, J. J.; Chapel, A.; Ginsberg, M. H.; Plow, E. F.; Marguerie, G. Amino acid sequences in fibrinogen mediating its interaction with its platelet receptor, GPIIbIIIa. J. Biol. Chem. 1989, 264, 9258−9265. (35) Scarborough, R. M.; Naughton, M. A.; Teng, W.; Rose, J. W.; Phillips, D. R.; Nannizzi, L.; Arfsten, A.; Campbell, A. M.; Charo, I. F. Design of potent and specific integrin antagonists. Peptide antagonists with high specificity for glycoprotein IIb-IIIa. J. Biol. Chem. 1993, 268, 1066−1073. (36) Hartman, G. D.; Egbertson, M. S.; Halczenko, W.; Laswell, W. L.; Duggan, M. E.; Smith, R. L.; Naylor, A. M.; Manno, P. D.; Lynch, R. J. Non-peptide fibrinogen receptor antagonists. 1. Discovery and design of exosite inhibitors. J. Med. Chem. 1992, 35, 4640−4642. (37) Egbertson, M. S.; Chang, C. T. C.; Duggan, M. E.; Gould, R. J.; Halczenko, W.; Hartman, G. D.; Laswell, W. L.; Lynch, J. J.; Lynch, R. J. Non-Peptide Fibrinogen Receptor Antagonists. 2. Optimization of a Tyrosine Template as a Mimic for Arg-Gly-Asp. J. Med. Chem. 1994, 37, 2537−2551. (38) Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40, D1100−D1107. (39) Mehrotra, M. M.; Heath, J. A.; Smyth, M. S.; Pandey, A.; Rose, J. W.; Seroogy, J. M.; Volkots, D. L.; Nannizzi-Alaimo, L.; Park, G. L.; Lambing, J. L.; et al. Discovery of Novel 2,8-Diazaspiro[4.5]decanes as Orally Active Glycoprotein IIb-IIIa Antagonists†. J. Med. Chem. 2004, 47, 2037−2061. (40) http://www2.epa.gov/chemical-research/toxicity-estimationsoftware-tool-test. (41) Kuz’min, V. E.; Artemenko, A. G.; Muratov, E. N. Hierarchical QSAR technology based on the Simplex representation of molecular structure. J. Comput.-Aided Mol. Des. 2008, 22, 403−421. (42) cxcalc, version 5.4; Chemaxon: Budapest, Hungary. (43) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825−2830. (44) Standardizer, version 5.4; ChemAxon: Budapest, Hungary. (45) Polishchuk, P. G. SiRMS: Simplex Representation of Molecular Structure, version 0.1; 2013−2015. 1468
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469
Article
Journal of Chemical Information and Modeling (46) Hussain, J.; Rea, C. Computationally Efficient Algorithm to Identify Matched Molecular Pairs (MMPs) in Large Data Sets. J. Chem. Inf. Model. 2010, 50, 339−348. (47) RDKit: Open-Source Cheminformatics. http://www.rdkit.org. (48) Rankovic, Z. CNS Drug Design: Balancing Physicochemical Properties for Optimal Brain Exposure. J. Med. Chem. 2015, 58, 2584− 2608. (49) Ghose, A. K.; Herbertz, T.; Hudkins, R. L.; Dorsey, B. D.; Mallamo, J. P. Knowledge-Based, Central Nervous System (CNS) Lead Selection and Lead Optimization for CNS Drug Discovery. ACS Chem. Neurosci. 2012, 3, 50−68. (50) Wager, T. T.; Chandrasekaran, R. Y.; Hou, X.; Troutman, M. D.; Verhoest, P. R.; Villalobos, A.; Will, Y. Defining Desirable Central Nervous System Drug Space through the Alignment of Molecular Properties, in Vitro ADME, and Safety Attributes. ACS Chem. Neurosci. 2010, 1, 420−434. (51) Wager, T. T.; Hou, X.; Verhoest, P. R.; Villalobos, A. Moving beyond Rules: The Development of a Central Nervous System Multiparameter Optimization (CNS MPO) Approach To Enable Alignment of Druglike Properties. ACS Chem. Neurosci. 2010, 1, 435− 449. (52) Hitchcock, S. A.; Pennington, L. D. Structure−Brain Exposure Relationships. J. Med. Chem. 2006, 49, 7559−7583. (53) Hitchcock, S. A. Structural Modifications that Alter the PGlycoprotein Efflux Properties of Compounds. J. Med. Chem. 2012, 55, 4877−4895. (54) Springer, T. A.; Zhu, J.; Xiao, T. Structural basis for distinctive recognition of fibrinogen γC peptide by the platelet integrin αIIbβ3. J. Cell Biol. 2008, 182, 791−800. (55) Kazius, J.; McGuire, R.; Bursi, R. Derivation and Validation of Toxicophores for Mutagenicity Prediction. J. Med. Chem. 2005, 48, 312−320. (56) Beechey, R. B. The uncoupling of respiratory-chain phosphorylation by 4,5,6,7-tetrachloro-2-trifluoromethylbenzimidazole. Biochem. J. 1966, 98, 284−289. (57) Č olović, M. B.; Krstić, D. Z.; Lazarević-Pašti, T. D.; Bondžić, A. M.; Vasić, V. M. Acetylcholinesterase Inhibitors: Pharmacology and Toxicology. Curr. Neuropharmacol. 2013, 11, 315−335. (58) Littin, K. E.; O’Connor, C. E.; Eason, C. T. Comparative effects of brodifacoum on rats and possums. N. Z. Plant Prot. 2000, 53, 310− 315. (59) Terada, H. Uncouplers of oxidative phosphorylation. Environ. Health Perspect. 1990, 87, 213−218. (60) Grundlingh, J.; Dargan, P.; El-Zanfaly, M.; Wood, D. 2,4Dinitrophenol (DNP): A Weight Loss Agent with Significant Acute Toxicity and Risk of Death. J. Med. Toxicol. 2011, 7, 205−212. (61) Casarett, L. J.; Klaassen, C. D. Casarett and Doull’s Toxicology: The Basic Science of Poisons, 7th ed.; McGraw-Hill Medical: New York, 2008; p 1309. (62) Valchev, I.; Binev, R.; Yordanova, V.; Nikolov, Y. Anticoagulant Rodenticide Intoxication in Animals − A Review. Turk. J. Vet. Anim. Sci. 2008, 32, 237−243. (63) Medina, M. A. The in vivo effects of hydrazines and vitamin B6 on the metabolism of gamma-aminobutyric acid. J. Pharmacol. Exp. Ther. 1963, 140, 133−137. (64) O’Brien, R. D.; Kirkpatrick, M.; Miller, P. S. Poisoning of the rat by hydrazine and alkylhydrazines. Toxicol. Appl. Pharmacol. 1964, 6, 371−377. (65) Proudfoot, A.; Bradberry, S.; Vale, J. A. Sodium Fluoroacetate Poisoning. Toxicol. Rev. 2006, 25, 213−219. (66) Davidson, B.; Soodak, M.; Strout, H. V.; Neary, J. T.; Nakamura, C.; Maloof, F. Thiourea and Cyanamide as Inhibitors of Thyroid Peroxidase: The Role of Iodide. Endocrinology 1979, 104, 919−924. (67) Bukowska, B. Toxicity of 2,4-Dichlorophenoxyacetic Acid Molecular Mechanisms. Polym. J. Environ. Stud. 2006, 15, 365−374. (68) Thiourea; Concise International Chemical Assessment Document 49; World Health Organization: Geneva, 2003. (69) Polishchuk, P. G. SPCI: Structural and Physico-Chemical Interpretation Tool, version 0.1.3; 2014−2015.
(70) Pérez-Garrido, A.; Rivero-Buceta, V.; Cano, G.; Kumar, S.; Pérez-Sánchez, H.; Bautista, M. Latest QSAR study of adenosine A2B receptor affinity of xanthines and deazaxanthines. Mol. Diversity 2015, 19, 975−989. (71) Zhang, Y.-Y.; Liu, H.; Summerfield, S. G.; Luscombe, C. N.; Sahi, J. Integrating in Silico and in Vitro Approaches To Predict Drug Accessibility to the Central Nervous System. Mol. Pharmaceutics 2016, 13, 1540−1550.
1469
DOI: 10.1021/acs.jcim.6b00371 J. Chem. Inf. Model. 2016, 56, 1455−1469