Selective Fusion of Heterogeneous Classifiers for ... - ACS Publications

Feb 23, 2017 - fused to develop a heterogeneous ensemble model for each transporter. 2.1. Data Collection. The abbreviations used for the 13 transport...
2 downloads 3 Views 4MB Size
Article pubs.acs.org/jcim

Selective Fusion of Heterogeneous Classifiers for Predicting Substrates of Membrane Transporters Naeem Shaikh, Mahesh Sharma, and Prabha Garg* Department of Pharmacoinformatics, National Institute of Pharmaceutical Education and Research (NIPER), S. A. S. Nagar, Mohali, Punjab-160062, India S Supporting Information *

ABSTRACT: Membrane transporters play a crucial role in determining fate of administered drugs in a biological system. Early identification of plausible transporters for a drug molecule can provide insights into its therapeutic, pharmacokinetic, and toxicological profiles. In the present study, predictive models for classifying small molecules into substrates and nonsubstrates of various pharmaceutically important membrane transporters were developed using quantitative structure−activity relationship (QSAR) and proteochemometric (PCM) approaches. For this purpose, 4575 substrate interactions for these transporters were collected from the Metabolism and Transport Database (Metrabase) and the literature. The transporters selected for this study include (i) six efflux transporters, viz., breast cancer resistance protein (BCRP/ABCG2), P-glycoprotein (P-gp/MDR1), and multidrug resistance proteins (MRP1, MRP2, MRP3, and MRP4), and (ii) seven influx transporters, viz., organic cation transporter (OCT1/SO22A1), peptide transporter (PEPT1/SO15A1), apical sodium-bile acid transporter (ASBT/NTCP2), and organic anion transporting peptides (OATP1A2/SO1A2, OATP1B/SO1B1, OATP1B3/SO1B3, and OATP2B1/SO2B1). Various types of descriptors and machine learning methods (classifiers) were evaluated for the development of robust predictive models. Additionally, ensemble models were developed by bagging of homogeneous classifiers and selective fusion of heterogeneous classifiers. It was observed that the latter approach improves the accuracy of substrate/nonsubstrate prediction for transporters (average correct classification rate of more than 0.80 for external validation). Moreover, structural fragments important in determining the substrate specificity across the various transporters were identified. To demonstrate these fragments on the query molecule, contour maps were generated. The prediction efficacy of the developed models was illustrated by a good correlation between the reported logBB value of a molecule and its predicted substrate propensity for blood−brain barrier transporters. Conclusively, this comprehensive modeling analysis can be efficiently employed for the prediction of membrane transporters of a drug, thereby providing insights into its pharmacological profile. and drug−drug interactions.3,4 Thus, early-phase prediction and identification of potential transporters for new and existing drug molecules is of vital importance for efficient drug discovery and development. The transporters expressed in the intestine, blood−brain barrier (BBB), blood−placenta barrier, and metabolic organs (e.g., kidney and liver) are of special importance in the drug discovery pipeline. The use of experimental methods for profiling chemical substances against a panel of transporters is limited because of their time-consuming and costly nature. Molecular modeling

1. INTRODUCTION Transporters are the transmembrane proteins expressed at various locations in the human body to regulate the membrane passage of endobiotic and xenobiotic substances. Transporters are mainly classified into two superfamilies, viz., ATP binding cassette (ABC) transporters and solute carrier (SLC) transporters. On the basis of the direction of substrate flux, the transporters are categorized as efflux or influx transporters. Their profound impact on the absorption distribution, metabolism, and elimination (ADME) of administered drug molecules makes them imperative in pharmacological profiling of a drug.1,2 They are also recognized as important determinants of tissue-specific drug targeting, drug resistance, © 2017 American Chemical Society

Received: August 27, 2016 Published: February 23, 2017 594

DOI: 10.1021/acs.jcim.6b00508 J. Chem. Inf. Model. 2017, 57, 594−607

Article

Journal of Chemical Information and Modeling

Figure 1. Workflow of the methodology adopted in the present study.

ABCG2) and MRP1 have also been performed.15−18 However, other drug transporters are relatively less explored for the development of predictive models. Furthermore, most of the reported studies have been performed in the view of identifying plausible inhibitors.1 To date, very limited exploration of multiple-transporter modeling has been made. As one example, systematic QSAR modeling of multiple intestinal transporters was performed by Sedykh et al.19 Moreover, PCM modeling has not been employed for membrane transporters. In this work, QSAR- and PCM-based predictive models were developed for pharmaceutically important membrane transporters to classify small molecules as substrates and nonsubstrates. For this purpose, different types of descriptors and classifiers were evaluated. The ensemble models were developed by employing the selective fusion of heterogeneous classifiers. The important structural fragments in the substrates and nonsubstrates of different transporters were analyzed. A code was written to depict these fragments on the query molecules using contour maps. Finally, the suitability of the developed model was demonstrated by the correlation between the reported logBB values of various molecules and their predicted substrate propensity for various BBB efflux transporters.

approaches for the prediction of plausible transporters involved in membrane passage of a drug provide an appealing alternative solution for large-scale screening. The applicability of these techniques is facilitated by the increasing amount of available drug−transporter interaction data. Several ligand-based and structure-based approaches have been employed for modeling of drug−transporter interactions.5,6 The structure-based approaches require three-dimensional structures of macromolecules to directly evaluate the molecular recognition of drug molecules with transporters. Therefore, these approaches are limited by the availability of crystal structures for the membrane transporters. The ligand-based approaches, on the other hand, require only ligand information for the development of a predictive model. The most commonly used ligand-based approach is the quantitative structure−activity relationship (QSAR) method, in which the information encoded in the structure is correlated with the activity of a molecule by employing machine learning approaches.7 However, traditional QSAR can consider only a single transporter at a time. An advancement in the QSAR technique enabling simultaneous consideration of multiple transporters is proteochemometric (PCM) modeling. The PCM modeling approach concurrently takes into account both the ligand chemical features and the sequence-based descriptors of transporter proteins.8 Another distinct advantage of the combined model for multiple transporters is that it can identify, and therefore predict, promiscuous binding of substrates to multiple proteins, which is a commonly known phenomenon in biological systems. The PCM modeling approach has been successfully employed for various classes of proteins such as kinases9 and G-proteincoupled receptors10 as well as for the development of proteome-wide models.11 The literature is enriched with molecular modeling studies for substrate/nonsubstrate prediction of various transporters.5,6,12 Extensive predictive modeling has been reported in the literature for P-glycoprotein (P-gp/MDR1).13,14 Recently, similar studies on breast cancer resistance protein (BCRP/

2. MATERIALS AND METHODS The selection of transporters for this study was based on the guidelines of the International Transporter Consortium1 and the availability of reported molecules in public domain. As a result, 13 transporter proteins that are known to have a role in drug absorption and disposition, therapeutic efficacy, and safety were selected. Substrate interaction data for these transporters was retrieved from various resources. Several models for substrate propensity prediction were developed by combining various descriptors and machine learning methods. Both a QSAR model for each membrane transporter and combined PCM models for all of the transporters were developed and compared (Figure 1). Finally, the top-performing models were 595

DOI: 10.1021/acs.jcim.6b00508 J. Chem. Inf. Model. 2017, 57, 594−607

Article

Journal of Chemical Information and Modeling Table 1. List of Selected Membrane Transporters and Their Abbreviations Uniprot code

name

abbreviation

Q9UNQ0 P08183 P33527 Q92887 O15438 O15439 Q12908 P46059 O15245 P46721

ATP-binding cassette subfamily G member 2 (breast cancer resistance protein) multidrug resistance protein 1 (P-glycoprotein 1) multidrug resistance-associated protein 1 multidrug resistance-associated protein 2 multidrug resistance-associated protein 3 multidrug resistance-associated protein 4 sodium/taurocholate cotransporting polypeptide, ileal (apical sodium-dependent bile acid transporter) solute carrier family 15 member 1 (peptide transporter 1) solute carrier family 22 member 1 (organic cation transporter 1) solute carrier organic anion transporter family member 1A2

ABCG2 MDR1 MRP1 MRP2 MRP3 MRP4 NTCP2 S15A1 S22A1 SO1A2

Q9Y6L6

solute carrier organic anion transporter family member 1B1

SO1B1

Q9NPD5

solute carrier organic anion transporter family member 1B3

SO1B3

O94956

solute carrier organic anion transporter family member 2B1

SO2B1

other abbreviations ABCG2, BCRP ABCB1, PGY1 ABCC1 ABCC2 ABCC3 ABCC4 ASBT, SLC10A2 PEPT1, SLC15A1 OCT1, SLC22A1 OATP1A2, SLCO1A2, SLC21A3 OATP1B1, SLCO1B1, SLC21A6 OATP1B3, SLCO1B3, SLC21A8 OATP2B1, SLCO2B1, SLC21A9

Figure 2. Transporter interaction data set after balancing. Selected nonsubstrates are decoys selected from the data set of Hou et al.31

considered for model development. This resulted in a data set of 4164 interaction records (2336 substrate and 1828 nonsubstrate) for the 13 membrane transporters. For efficient classification model development, the data set should have a balanced representation of each class. The numbers of nonsubstrates for many transporters were very small. Therefore, a decoy set of nonsubstrates was picked from the passive diffusion data set of Hou et al.31 This comprised passively transported drugs, which therefore could be rationally considered as nonsubstrates for the membrane transporters.19 For selecting the molecules from this data set as reliable nonsubstrates for a transporter, a systematic approach was utilized to remove any false negatives (unknown substrates). First, a combined PCM model was developed with initial unbalanced data, a k nearest neighbors (kNN) classifier (k = 3), and a Morgan fingerprint (functional invariant, radius = 2). More details of Morgan fingerprints (RDKit) are discussed later. Thereafter, the passive-diffusion data set was predicted for

fused to develop a heterogeneous ensemble model for each transporter. 2.1. Data Collection. The abbreviations used for the 13 transporters considered in this study are listed in Table 1. The information about small-molecule substrates for various membrane transporters was retrieved from Metrabase,20 which contains integrated data from ChEMBL, various transporter databases, and the primary literature. This data set was further enriched by addition of data for ABCG2, MDR1 and MRP1 from the literature.15,21−30 All of the molecules were cleaned and salts were removed in RDKit toolkit 2015.09.2, a python toolkit (http://www.rdkit.org/). Compounds possessing rare elements and organometallics were removed. Duplicate entries, identified on the basis of canonical SMILES, were removed. All contradictory entries (i.e., small molecules reported as substrates and nonsubstrates in different studies) were also removed. Only those transporters for which more than 15 substrates as well as nonsubstrates were reported were 596

DOI: 10.1021/acs.jcim.6b00508 J. Chem. Inf. Model. 2017, 57, 594−607

Article

Journal of Chemical Information and Modeling

forest (RF),35 support vector machine (SVM),36 stochastic ̈ Bayes gradient descent (SGD),37 logistic regression (LR), naive (NB),38 and kNN classifiers, as implemented in scikit-learn 0.17, were used for model development. In addition, homogeneous and heterogeneous ensembles of the mentioned classifiers were also developed. The number of trees in RF was kept at 100. For the SVM classifier, the radial basis function kernel was used, and values of the penalty parameter C and kernel coefficient γ for each data set were optimized by a systematic grid search. SGD and LR were used as linear classifiers with L2 regularization. The “Log” loss function was used in the SGD classifier. Different numbers of neighbors (1, 3, and 5) were used for kNN classifiers. The “distance” weight function was employed in kNN classifiers to enhance the influence of closer neighbors. The remaining parameters were kept at the default values. Homogeneous ensembles of these classifiers were developed by the bagging method.39 In this method, random subsets were drawn from the original data set and subjected to the base classifier. The results from these base classifiers were then combined to give the final prediction. A bagging meta-classifier was developed for each selected classifier except for RF (which itself is an ensemble of decision trees). Half of the total samples and features from the initial training data set were randomly drawn to train each base classifier. For SVM, the number of base classifiers was kept at 10, compared with 50 for the rest of the classifiers, to maximize the accuracy with minimum cost of prediction. For heterogeneous ensembles, only the top selected classifiers for each transporter were fused by the weighted averaging method.40 This method obtains the combined output by averaging the outputs of individual learners with different weights assigned. Higher weights were assigned to those classifiers that had the maximum correct classification rate (CCR; balanced accuracy) and minimum difference between recall and specificity. The number of top classifiers to be used for each transporter for the heterogeneous ensemble was optimized by a grid search. Table 3 shows the classifiers and their abbreviations used in this study.

transporters with fewer nonsubstrates. Compounds predicted as nonsubstrates for a particular transporter were randomly coupled with that transporter to balance the data set (see Figure 2). Although this approach might lead to enrichment of molecules similar to known nonsubstrates, the chance of including false negatives (i.e., unknown substrates) as nonsubstrates is reduced. Furthermore, on the basis of molecular weight distribution analysis (mean ± standard deviation = 432.22 ± 197.28 g/mol), 36 outliers (with molecular weights > 1196 g/mol) were removed from the data set. Finally, 4575 drug−transporter interactions (2293 substrate and 2282 nonsubstrate) were selected (Table S1). For each transporter data set, local diversity (pairwise similarities only between kNN at k = 5 and 10) was estimated using Tanimoto coefficients with Morgan fingerprints (functional invariant, radius = 2) (Table S2). 2.2. Descriptors. Sequence-based descriptors for transporter proteins were calculated with the help of the PyDPI toolkit.32 These descriptors include amino acid composition, dipeptide composition, and tripeptide composition; different autocorrelation features; composition, transition, and distribution; conjoint triad descriptors; sequence-order features; and pseudo-amino acid compositions. A total of 10 079 descriptors for each transporter protein were calculated. These descriptors were then scaled between 0 and 1 by implementing the MinMaxScaler method of scikit-learn 0.17.33 All descriptors showing variance less than 0.05 were removed, leaving 5092 descriptors. Table S3 shows frequently selected protein descriptors in PCM models. For small molecules, varieties of descriptors were calculated. Molecular descriptors (physicochemical, topological, etc.) and molecular fingerprints (including MACCS keys and variants of Morgan fingerprints) were calculated using RDKit (Table 2). MACCS keys are 166 Table 2. List of Small-Molecule Descriptors Used in This Study sr. no. 1 2 3 4 5 6 7 8

description

type

number

MACCS key Morgan fingerprint (radius = 2) Morgan fingerprint (radius = 3) Morgan fingerprint (radius = 4) Morgan fingerprint (functional invariant, radius = 2) Morgan fingerprint (functional invariant, radius = 3) Morgan fingerprint (functional invariant, radius = 4) molecular descriptors

Mck Mr2 Mr3 Mr4 Mf2

167 2048 2048 2048 2048

Mf3

2048

Mf4

2048

Xr

Table 3. List of Classifiers and Their Abbreviations Used in the Study

196

structural keys based on SMARTS patterns covering most of the important chemical fragment.34 Morgan fingerprints and their functional invariants are the circular fingerprint equivalents of the well-known extended connectivity fingerprint (ECFP) and functional-class fingerprint (FCFP), respectively. The number beside “ECFP” and “FCFP” denotes the diameter considered for calculation of the fingerprint, whereas in the case of Morgan fingerprints the radius is considered. Therefore, ECFP4 can be considered equivalent to Mr2. Finally, nine sets of compound descriptors were used for model development (Table 2). 2.3. Classifiers. Various types of classifier were used in combination with the discussed set of descriptors. Random

sr. no.

classifier

normal classifier

bagging meta-classifier

1 2 3 4 5 6 7 8

random forest support vector machine stochastic gradient descent logistic regression ̈ Bayes naive k nearest neighbors (k = 1) k nearest neighbors (k = 3) k nearest neighbors (k = 5)

clfrf clfsv clfsg clflr clfnb clfk1 clfk3 clfk5

− clfsvb clfsgb clflrb clfnbb clfk1b clfk3b clfk5b

2.4. Model Development. Basically two types of models were developed in this study, viz., QSAR models and PCM models. QSAR models are the traditional machine learning models in which substrates and nonsubstrates for a single transporter were used to develop a predictive model for that particular transporter. Small-molecule descriptors were used to train the model, assigning the labels 1 and 0 for substrate and nonsubstrate, respectively. Thereby, 13 different models were developed for a single combination of classifier and descriptor. 597

DOI: 10.1021/acs.jcim.6b00508 J. Chem. Inf. Model. 2017, 57, 594−607

Article

Journal of Chemical Information and Modeling

obvious criterion for the selection was the availability of substrate and nonsubstrate information. A total of 3411 unique molecules were collected for 13 transporters from Metrabase and other literature. Imbalance in the data was rectified by rationally selecting decoy nonsubstrates, resulting in a final data set comprising 4575 data points for six efflux and seven influx transporters. The efflux transporters included MDR1, ABCG2, and MRPs (MRP1−4), whereas the influx transporters were OATPs (SO1A1, SO1B1, SO1B3, and SO2B1), S15A1 (PEPT1), S22A1 (OCT1), and NTCP2 (Table 1). The majority of the data have been reported for efflux transporters, specifically MDR1 (46%) and ABCG2 (14%) (Figure 3).

In PCM models, the training data set was represented by concatenating descriptors for both small molecules and the transporters. As a result, for each combination of classifier and descriptor, a single model was developed for all 13 transporters. 2.5. Model Validation. The models were extensively validated using an unseen external set and fivefold crossvalidation. The data sets for some of the transporters were very small, which imposed problems in external data set creation: it not only reduced the sample size but also increased the chances of having external data out of the applicability domain of the training set because of reduced chemotypes. Therefore, a rational methodology was adopted to create the external data set from the original set without altering the coverage and applicability domain of the training set. For this purpose, the training data set was clustered by the affinity propagation method on the basis of Mf2.41 The affinity propagation method was used to identify the samples that are most representative of the total data set. These exemplars were then used as the external data set, as they were well within the chemotypic domain of the training set. Moreover, loss of chemical information to the external set was also reduced. This external set was never used in the feature scaling and feature selection step, so it should remain unseen for robust validation. As a result, on an average 16.5 ± 3.4% of the original data was selected as the external validation set for different transporters (Table S4). Besides external validation, the training data set was further validated by internal fivefold cross-validation. The data set was randomly divided into five equal portions. Each set was used for validation of the model developed using the remaining four subsets. The predictive performance of models was evaluated on the basis of following statistical parameters: recall (sensitivity) =

Figure 3. Percentages of data points (substrate as well as nonsubstrate) reported for the individual transporters in the collected data set. Percentages were calculated with respect to the whole data set excluding the decoy nonsubstrates. Analyses of the overlap are available in Figures S1 and S2.

TP TP + FN

Besides their localization on the cell membrane, i.e., apical or basolateral (Figure 4), other significant factors governing the net flux of substrates are level of expression and relative affinity for these transporters. The numbers of common substrates and nonsubstrates among the selected transporters were identified (Figures S1 and S2). On average one molecule is a substrate for 1.26 transporters. ABC transporters share large numbers of common substrates, with maximum overlapping substrate specificity within various MRPs. In SLC proteins, OATPs share most of the substrates. Among them, major overlap is shown by SO1B1 and SO1B3, which could be attributed to their sequence identity. NTCP2, S15A1, and S22A1 do not share significant numbers of substrates with either fellow SLC or ABC transporters, as they selectively transport specific classes of molecules. Also, there are many substrates common to efflux as well as influx transporters. Notably, MRP2, ABCG2, and MDR1 share a considerable number of substrates with the OATPs, includings steroids, statins, and other molecules reported to be excreted in the bile. Thus, the synergistic effect produced by localization of these efflux and influx transporters guides the net biliary excretion (Figure 4). It is important to note that the data set distribution and overlap analysis was limited by the fact that some transporters were explored more than others for in vitro substrate activities and not every molecule was tested against each of the transporters. 3.2. Model Development. All possible combinations of the eight sets of descriptors and 15 sets of classifiers were employed for the development of PCM and QSAR models (Figure 1). A total of 120 unified PCM models were developed

TN specificity = TN + FP CCR =

recall + specificity 2

precision =

TP TP + FP

accuracy =

TP + TN TP + TN + FP + FN

f 1 score = 2 ×

recall × precision recall + precision

MCC = TP × TN − FP × FN (TP + FP)(TP + FN)(FP + TN)(TN + FN)

where TP is the number of true positive instances, TN is the number of true negative instances, FP is the number of false positive instances, FN is the number of false negative instances, and MCC is the Matthews correlation coefficient.

3. RESULTS AND DISCUSSION 3.1. Data Set Characterization. The membrane transporters selected for this study are of pharmaceutical importance as they are primarily involved in absorption, distribution, and elimination of the administered drug molecules. The other 598

DOI: 10.1021/acs.jcim.6b00508 J. Chem. Inf. Model. 2017, 57, 594−607

Article

Journal of Chemical Information and Modeling

Figure 4. Localization of selected transporters at intestinal epithelia, brain capillary endothelia, kidney proximal tubules, and hepatocytes. Arrows represents the direction of transport. It should be noted that although many transporters are present at these locations, only those transporters used in this study are shown. Localization of transporters is sometimes contentious. The diagrammatic representation is based on the literature.1,4,42

for all transporters. However, the prediction accuracies for each transporter in a combined model were calculated separately (Tables S5 and S6). Similarly, 120 QSAR models were developed for each of the 13 transporters (for a total of 13 × 120 = 1560 models; Tables S7 and S8). Pertaining to the predictive ability, the best combinations of classifiers and descriptors were identified for each transporter. Tables 4 and 5

Table 5. Top-Performing Model (Based on CCR Values) for Each Transporter Using QSAR Modeling

Table 4. Top-Performing Model (Based on CCR Values) for Each Transporter Using PCM Modeling transporter

classifier

descriptor

cross-validation

external validation

ABCG2 MDR1 MRP1 MRP2 MRP3 MRP4 NTCP2 S15A1 S22A1 SO1A2 SO1B1 SO1B3 SO2B1 average

clfk1b clfrf clfk3 clfrf clfk3b clfsv clfrf clfrf clfrf clfk3 clfk3 clfk3b clfk5

Xr Xr Mr3 Mr2 Xr Xr Mck Mck Mr2 Mf3 Xr Mf2 Xr

0.77 0.78 0.86 0.83 0.88 0.88 0.93 0.83 0.84 0.78 0.85 0.82 0.64 0.82

0.88 0.84 0.88 0.86 0.83 0.77 0.70 0.81 0.84 0.81 0.94 0.76 0.82 0.83

transporter

classifier

descriptor

cross-validation

external validation

ABCG2 MDR1 MRP1 MRP2 MRP3 MRP4 NTCP2 S15A1 S22A1 SO1A2 SO1B1 SO1B3 SO2B1 average

clfrf clfk3 clfsgb clfnbb clfsv clfnbb clfnb clfk3 clflr clflr clfnbb clfsgb clfnb

Mr4 Xr Mr4 Mr4 Mf4 Mf4 Mr4 Mf2 Mr2 Mf4 Mr3 Mr4 Mr4

0.76 0.77 0.89 0.85 0.90 0.95 0.95 0.85 0.85 0.90 0.92 0.88 0.81 0.87

0.85 0.86 0.86 0.81 0.74 0.81 0.75 0.85 0.87 0.79 0.81 0.76 0.69 0.80

most of the time. In the case of Morgan fingerprints, lowerradius variants performed better in PCM models. However, higher-radius variants showed better prediction for QSAR models. Also, a homogeneous ensemble of weak linear classifiers (clfsgb and clfnbb) was selected as the top performer in the case of QSAR models. 3.3. Selective Fusion of Heterogeneous Classifiers. The top PCM and QSAR models for each transporter were fused to generate heterogeneous ensemble models. As a result, two final models were developed for each transporter, namely, the PCM-based heterogeneous ensemble (PHE) model and the QSAR-based heterogeneous ensemble (QHE) model. Table 6 and Figure 6 show the results for the PHE and QHE models (see Tables S9−S12 for more details). The cross-validation and external validation results were increased considerably for the

show the results for the best-performing combinations of descriptors and classifiers for the PCM and QSAR models, respectively. In the case of cross-validation, QSAR models outclassed PCM models, whereas the latter showed better performance in external validation (Figure 5). In the top PCM models, molecular descriptor and RF classifier were selected 599

DOI: 10.1021/acs.jcim.6b00508 J. Chem. Inf. Model. 2017, 57, 594−607

Article

Journal of Chemical Information and Modeling

Figure 5. Fivefold cross-validation and external validation results for the best PCM and QSAR models.

PHE and QHE models compared with the individual models. Though the external validation results for the two modeling techniques were comparable, the cross-validation results for the QHE models were better than those for the PHE models. However, in the case of external validation, the ratio of sensitivity to specificity for the PHE models is more balanced compared with that for the QHE models. In general, all of the models had good prediction parameters, with CCR > 0.75 (except for SO2B1, as a result of the smaller data set). In the case of SO2B1, the QHE model outperformed the PHE model. Because of the very low chemical diversity in substrates of NTCP2, its cross-validation results were exceptionally higher for both the PHE and QHE models. Different descriptors and classifiers were selected for the heterogeneous ensembles, as shown in Tables S13 and S14. Significant differences were observed in the types of classifiers and descriptors selected for the QHE and PHE models (Figures S3 and S4). Molecular descriptors and Morgan fingerprint (radius = 2) were selected for most of the transporters in the PHE approach. On the other hand, Morgan fingerprints (radius = 4) were frequently selected in QHE.

Table 6. CCR Values for the Heterogeneous Ensemble Models PHE model

QHE model

transporter

crossvalidation

external validation

crossvalidation

external validation

ABCG2 MDR1 MRP1 MRP2 MRP3 MRP4 NTCP2 S15A1 S22A1 SO1A2 SO1B1 SO1B3 SO2B1 average

0.80 0.80 0.88 0.87 0.89 0.89 0.93 0.83 0.86 0.81 0.85 0.83 0.66 0.84

0.89 0.89 0.95 0.87 0.83 0.90 0.75 0.81 0.89 0.88 0.94 0.76 0.89 0.86

0.81 0.81 0.92 0.88 0.94 0.97 0.95 0.85 0.87 0.91 0.92 0.90 0.86 0.89

0.89 0.87 0.88 0.89 0.89 0.85 0.75 0.85 0.91 0.95 0.88 0.81 0.82 0.87

Figure 6. Fivefold cross-validation and external validation results for the PHE and QHE models. 600

DOI: 10.1021/acs.jcim.6b00508 J. Chem. Inf. Model. 2017, 57, 594−607

Article

Journal of Chemical Information and Modeling Morgan fingerprints with lower radius were mainly found in the PHE approach, whereas Morgan fingerprints with higher radius were predominantly selected in the QHE approach. The MACCS keys was more frequently selected in PHE models than in QHE models. In the case of classifiers, RF was selected for most of the transporters in the PHE approach, followed by SVM, whereas SVM was selected mainly for the QHE approach (Figure S4). Classifiers based on LR, SGD, and NB were mostly selected in the QHE models but rarely in the PHE models. Apart from clfk1b, bagging ensembles were rarely selected in the PHE models, whereas bagging of LR and SGD yielded significant increases in the prediction accuracies of the QHE models. 3.4. Separate PCM Models for ABC and SLC Transporters. In order to analyze the performance of PCM modeling separately in the ABC and SLC transporters, PCM and PHE models were developed for ABC and SLC transporters separately. Table 7 and Figure 7 show the results

for the best-performing PCM and PHE models for each transporter. Figure S5 compares all types of models discussed in this study. Overall, there was slight increment in the performance of PCM modeling compared with the single PCM modeling, especially in the case of SLC PCM models. A similar trend was observed in the heterogeneous ensemble models with further improvement. The decreased performance of the SLC transporters in the global PCM model can be attributed to the lesser numbers of molecules reported and lesser numbers of shared substrates and nonsubstrates. However, the performance for SO2B1 in cross-validation was decreased (recall 0.56 vs 0.59 for PCM and 0.56 vs 0.66 for PHE). 3.5. Fragment Frequency Analysis. As Morgan fingerprints (radius = 4) are the most comprehensive in storing structural information and were predominantly selected in QHE, they were utilized for fragment frequency analysis. First, Morgan fingerprint (radius = 4) bits selected for development of QSAR models were identified. Thereafter, the frequencies of these bits in substrates and nonsubstrates were calculated. Only those bits having substantial frequency differences in substrates and nonsubstrates were considered for further analysis. Table S15 shows the frequency differences for the bit fragments in substrates and nonsubstrates across the selected transporters. The corresponding structural fragments for the important bits were identified and are highlighted in the representative structure. Positive values show a higher percentage of a bit fragment in substrates, whereas negative values represent a higher percentage in nonsubstrates. The fragments that are predominantly present in substrates suggest that these fragments might be involved in substrate recognition for the affinity to particular transporters. Similarly, fragments with negative values could be responsible for making the molecule a nonsubstrate for that transporter. This analysis correlates well with the established structure−activity relationships. For example, hydrophobic and cationic fragments were identified as important fragments for MDR1 substrates. Likewise, the presence of nitrogen as the hydrogen-bond donor, aromatic rings with multiple substitutions, and substituted fused rings was favored in ABCG2 substrates. Furthermore, fragments imparting cationic and hydrophilic features in the molecules were identified for S22A1 substrates. Such analysis may play a

Table 7. CCR Values for Top-Performing PCM and PHE Models While Considering ABC and SLC Transporters Separately PCM models

PHE models

transporter

crossvalidation

external validation

crossvalidation

external validation

ABCG2 MDR1 MRP1 MRP2 MRP3 MRP4 NTCP2 S15A1 S22A1 SO1A2 SO1B1 SO1B3 SO2B1 average

0.76 0.78 0.86 0.84 0.88 0.88 0.94 0.86 0.85 0.82 0.85 0.86 0.63 0.83

0.88 0.88 0.88 0.87 0.74 0.77 0.90 0.79 0.82 0.81 0.88 0.76 0.89 0.83

0.80 0.80 0.89 0.87 0.89 0.89 0.94 0.87 0.86 0.83 0.85 0.87 0.63 0.85

0.91 0.89 0.93 0.87 0.85 0.85 0.90 0.82 0.89 0.95 0.97 0.86 0.89 0.89

Figure 7. Fivefold cross-validation and external validation results for PHE and QHE models while considering ABC and SLC transporters separately. 601

DOI: 10.1021/acs.jcim.6b00508 J. Chem. Inf. Model. 2017, 57, 594−607

Article

Journal of Chemical Information and Modeling Table 8. Frequencies of Structural Fragments across Transportersa

a

FP id. is the Morgan fingerprint (radius = 4) identifier; for more details, refer to Table S13.

vital role in optimizing the molecular structure so that it has the desired transporter specificity profile. Another distinct advantage of such analysis is the comparative evaluation of important fragments for a variety of transporters. Many fragments were found to be important in multiple transporters. These fragments can also justify the overlapping substrate specificities of transporters. Table 8 shows the fragments found to be important for more than three transporters. The carbonyl carbon was found to be frequent in substrates of MRPs and nonsubstrates of SO2B1. Similar observations can be made for other fragments, such as lactone,

carboxylic hydroxyl, amide, etc. Table 9 shows the influence of the presence of halogen on the substrate specificity of membrane transporters. Halogen substitutions were not favored for MDR1 and SO22A1, whereas fluorine and 4fluorophenyl were frequent in the ABCG2 and SO2B1 substrates. Furthermore, to visualize frequent fragments in the query molecule, a code was written to generate contour maps using RDKit.43 The contour maps were generated by assigning weights to the fragments that correspond to their frequencies in substrates and nonsubstrates. Green color represents frequent 602

DOI: 10.1021/acs.jcim.6b00508 J. Chem. Inf. Model. 2017, 57, 594−607

Article

Journal of Chemical Information and Modeling Table 9. Presence of Halogens in Transporter Substrates

Figure 8. Fragment frequency maps of atorvastatin for its transporters. Green color represents fragments frequent in substrates, and red color represents fragments frequent in nonsubstrates. The intensity of the color is proportional to the frequency of the fragment.

play an important role in determining the substrate specificity of atorvastatin for MRP2 and SO1B3. Thus, these maps can be used to illustrate the important fragments responsible for the substrate specificity.

fragments in substrates, whereas red color represents frequent fragments in nonsubstrates. As an example, the fragment frequency maps of atorvastatin for its transporters are shown in Figure 8. Seemingly, unsubstituted phenyl groups appear to 603

DOI: 10.1021/acs.jcim.6b00508 J. Chem. Inf. Model. 2017, 57, 594−607

Article

Journal of Chemical Information and Modeling

Table 10. Predicted Probabilities for BBB Transporters for Passively Diffusible Molecules; Bar Length Corresponds to the Probability of Being a Substrate for the Transporter

a

CID: PubChem Compound Identifier.

3.6. Application of the Developed Models in BBB Permeability Analysis. To demonstrate the utility of the built models, membrane transporters were predicted for molecules with known BBB permeability values. For this purpose, a data set of 582 molecules was collected from the literature.17,44−46 Molecules possessing physiochemical properties facilitating easy diffusion across membrane (i.e., molecular weight