Informing the Human Plasma Protein Binding of Environmental

Sep 29, 2016 - Informing the Human Plasma Protein Binding of Environmental Chemicals by Machine Learning in the Pharmaceutical Space: Applicability Do...
0 downloads 13 Views 2MB Size
Subscriber access provided by CORNELL UNIVERSITY LIBRARY

Article

Informing the human plasma protein binding of environmental chemicals by machine learning in the pharmaceutical space: Applicability domain and limits of predictability. Brandall L. Ingle, Brandon C. Veber, John W. Nichols, and Rogelio Tornero-Velez J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00291 • Publication Date (Web): 29 Sep 2016 Downloaded from http://pubs.acs.org on September 30, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Informing the Human Plasma Protein Binding of Environmental Chemicals by Machine Learning in the Pharmaceutical Space: Applicability Domain and Limits of Predictability. Brandall L. Ingle,† Brandon C. Veber ‡,§ John W. Nichols, ‡ Rogelio Tornero-Velez†* †

U.S. Environmental Protection Agency, Office of Research and Development, National

Exposure Research Laboratory, Research Triangle Park, NC 27709 ‡

U.S. Environmental Protection Agency, Office of Research and Development, National Health

Exposure Effects Research Laboratory, Duluth, MN 55804 §

Oak Ridge Institutes for Science and Education, Oak Ridge, TN 37830

ACS Paragon Plus Environment

1

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 42

ABSTRACT The free fraction of a xenobiotic in plasma (Fub) is an important determinant of chemical adsorption, distribution, metabolism, elimination, and toxicity, yet experimental plasma protein binding data are scarce for environmentally relevant chemicals. The presented work explores the merit of utilizing available pharmaceutical data to predict Fub for environmentally relevant chemicals via machine learning techniques. Quantitative structure-activity relationship (QSAR) models were constructed with k nearest neighbors (kNN), support vector machines (SVM), and random forest (RF) machine learning algorithms from a training set of 1045 pharmaceuticals. The models were then evaluated with independent test sets of pharmaceuticals (200 compounds) and environmentally relevant ToxCast chemicals (406 total, in two groups of 238 and 168 compounds). The selection of a minimal feature set of 10-15 2D molecular descriptors allowed for both informative feature interpretation and practical applicability domain assessment via a bounded box of descriptor ranges and principal component analysis. The diverse pharmaceutical and environmental chemical sets exhibit similarities in terms of chemical space (99-82% overlap), as well as comparable bias and variance in constructed learning curves. All the models exhibit significant predictability with mean absolute errors (MAE) in the range of 0.10-0.18 Fub. The models performed best for highly bound chemicals (MAE 0.07-0.12), neutrals (MAE 0.110.14), and acids (MAE 0.14-0.17). A consensus model had the highest accuracy across both pharmaceuticals (MAE 0.151-0.155) and environmentally relevant chemicals (MAE 0.1100.131). The inclusion of the majority of the ToxCast test sets within the AD of the consensus model, coupled with high prediction accuracy for these chemicals, indicates the model provides a QSAR for Fub that is broadly applicable to both pharmaceuticals and environmentally relevant chemicals.

ACS Paragon Plus Environment

2

Page 3 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1. INTRODUCTION The degree to which a chemical binds to plasma proteins influences its adsorption, distribution, metabolism, elimination, and toxicity (ADMET). While bound to plasma proteins, xenobiotics are generally unable to cross cellular membranes, interact with molecular targets, and undergo biotransformation.1,2 Since binding to plasma proteins may strongly impact chemical kinetics and effects in biological systems, the fraction of chemical unbound by plasma proteins (Fub) serves as an important chemical-specific parameter in many in silico toxicology models.1,3 Plasma protein binding occurs predominately through interactions with human serum albumin (HSA), α-acid glycoprotein (AAG), and lipoprotein.3-6 The diversity of binding sites within these proteins allows for the binding and transport of a wide array of xenobiotics.3-6 While hydrophobic compounds generally bind non-specifically to all proteins, acids and bases preferentially bind to sites on HSA and AAG, respectively.3-5 Although the bioavailability, and by extension toxicity, of a chemical is highly dependent upon plasma protein binding, only a small fraction of compounds of environmental concern have been evaluated to measure this parameter. Quantitative structure activity relationship (QSAR) models provide a means of predicting plasma protein binding in silico based on known values for similar compounds.3,6-10 Previously, Votano et al. (2006) used an artificial neural network to optimize an Fub model constructed with data for 808 pharmaceuticals.8 When evaluated against an independent test set of 200 pharmaceuticals, this model yielded a mean absolute error (MAE) of 0.141 and a root mean square error (RMSE) of 0.186.8 Zhu et al. (2013) performed a similar study using a pharmaceutical training set of 1242 chemicals and found that a binding model constructed using

ACS Paragon Plus Environment

3

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 42

support vector machines (SVM) was preferable to models constructed with k nearest neighbors (kNN) and random forest (RF) algorithms. For an independent pharmaceutical test set of 173 compounds, the optimal SVM model had an MAE of 0.119 and an RMSE of 0.182.11 The Zhu et al. (2013) study was unique in applying their Fub models to 238 environmentally relevant ToxCast chemicals; the ideal SVM model had an MAE of 0.148 and an RMSE of 0.226 for ToxCast chemicals, with less successful models having lower prediction accuracy (MAE 0.150.17; RMSE 0.23-0.25).11 The scarcity of models evaluated for environmentally relevant chemicals, coupled with the relatively poor predictions for these chemicals in the Zhu et al. (2013) study, suggests that a new QSAR model is needed for reliable estimation of plasma protein binding for the diverse array of environmentally relevant chemicals including pesticides, herbicides, and industrial chemicals. The present study explores the merit of utilizing available pharmaceutical data to construct a QSAR for prediction of Fub for environmentally relevant compounds, as previous studies have shown these classes of chemicals have some overlap in chemical space.12 Independent models were created with kNN, SVM, and RF algorithms and then assessed to select the ideal algorithm. The optimal balance between accuracy and a simple descriptor set was sought with the goal of creating a QSAR model for Fub that can be applied to both pharmaceuticals and a wide array of environmentally relevant chemicals.

2. MATERIALS AND METHODS 2.1. Dataset Preparation A pharmaceutical dataset of 1245 chemicals was derived from experimental plasma protein binding data collected and curated by Obach et al. (2008) and Zhu et al. (2013).10,11 Plasma

ACS Paragon Plus Environment

4

Page 5 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

protein binding data for 406 environmental contaminants in the ToxCast dataset was collected from two publications by Wetmore et al. (2012, 2015).13,14 When a chemical appeared in multiple datasets, the average reported value was taken, except in 21 cases where the variation in reported values was extremely high (> 0.30 Fub). These latter compounds were eliminated from further consideration. When a chemical occurred in both a pharmaceutical and ToxCast dataset, it was classified as a pharmaceutical. The resulting datasets encompass a wide range of structural, electronic, and physicochemical properties (chemical space), as well as diverse Fub values (Figure 1).

Figure 1. Histogram of the fraction unbound by plasma protein for entire pharmaceutical (dark blue) and environmental (light green) sets. When no structures were included with the experimental data (chemicals exclusive to the Obach and Wetmore datasets), generic SMILES strings from the ChemSpider database were used for the 2D structure of chemicals.15,16 Small salts and solvent molecules were removed. Charges were neutralized on the larger organic molecules with the exception of quaternary amines. In the case of mixtures, a representative chemical structure was retained. For example, emamectin benzoate (CAS 155569-91-8) refers to a mixture of two homologous compounds with

ACS Paragon Plus Environment

5

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 42

48-49 carbons that differ by a single methyl group; as the mixture typically contains 90% of the methylated structure, the homolog with the methyl group was selected as a representative structure.17 A single compound within the Obach dataset was removed, due to a lack of reliable structure. The compounds were then characterized with a set of 192 2D descriptors calculated using the MOE software.18 Previous studies found that QSAR models developed using association constants (ln Ka) provided better predictions of plasma protein binding than those based on Fub; therefore, experimental Fub values were converted into pseudo-equilibrium constants (ln Ka ) for model construction and then the resulting predictions were converted back to Fub for assessment of model accuracy.11 ln  = 0.3 × ln

1 −   

The relative acidity and basicity of the compounds were used to divide the datasets into biologically relevant clusters. The ADMET Predictor software was used to calculate acid dissociation constants (pKa) and corresponding ionization microstates at pH = 7.4 for each compound.19-21 Chemicals with > 10% anionic species were classified as “acids,” while those with > 10% cationic species were deemed “bases.” Any compound with < 10% ionic species was considered “neutral.” Finally, compounds with > 10% of the neutral state due to a zwitterion (positive and negative on the same species) and chemicals with independent sites predicted to have > 10% anionic and > 10% cationic species were branded “zwitterions,” due to the similar combination of charged and neutral states. This classification scheme captures the tendency of a compound to exhibit significant protonation/deprotonation at the physiological pH of human plasma. 2.2 Development of QSAR Models

ACS Paragon Plus Environment

6

Page 7 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Pharmaceuticals were split into a training set of 1045 compounds and a randomly selected independent test set of 200. The ToxCast chemicals from each publication were kept in independent test sets (238 in ToxCast I, 168 in ToxCast II) in order to assess predictability across three unique chemical test sets. Throughout the development of the Fub models, relative prediction errors in the drug and ToxCast I test sets guided selection of the optimal models, which were then applied to the ToxCast II chemicals in a completely blind testing set. Within the multidimensional space outlined by chemical descriptors, the kNN algorithm averaged the Fub value of the k closest neighbors from the training set to provide a prediction for test set chemicals.22,23 This algorithm's utility revolves around its simplicity, which makes it a common benchmark algorithm for more advanced machine learning methods. The only tuning parameter was k which represents the number of nearest neighbors to consider during evaluation. Using a 3-fold cross-validation technique, the optimal number of neighbors was determined to be 10. Since SVMs have had considerable success in a number of different classification and regression problems, the algorithm was included as an example of a standard machine learning method. To control model complexity, the optimization problem only considers training samples that lie outside the epsilon tube (ε), a tunable parameter that controls the acceptable amount of training error.23 The ε in multidimensional feature space was created with a Radial-BasisFunction.24 To balance model complexity and training error, a grid search with 3-fold cross validation was used to select ε and the complexity cost (C), with optimal values of ε = 0.3162 and C = 50. Finally, the RF algorithm generated a large collection of decision trees in which each individual tree was built using a randomly selected subset of the training set and

ACS Paragon Plus Environment

7

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 42

features.25 Output values were determined by averaging the results from all of the trees. The number of trees served as an important tunable parameter, as more trees would yield better model performance but at a longer computational runtime. A 3-fold cross validation scheme led to the selection of 500 trees. Features of the models (descriptors) were ranked using the internal feature rankings (node purity) of the RF regressor to ensure that reduced feature sets contained only the most valuable information.26 In order to reduce overfitting, a randomly selected half of the training samples were used to create an RF model and the top ranked features were recorded. Due to the inherent stochasticity of the RF algorithm, this process was repeated to generate ten unique RF models and only the top most recurring features were saved to be used in the optimally reduced, robust feature set. A consensus model consisting of the average Fub prediction across the kNN, SVM and RF models was also considered. All models were constructed with the Scikit-learn module in Python.27 Models are available at https://bitbucket.org/bveber/chem. The suitability of the pharmaceutical training set size was assessed with learning curves for each model.28 Bayesian information criterion (BIC) curves were used to balance the complexity and accuracy of models, and select the ideal number of descriptors for each model by gauging the impact of descriptor set size (randomly selected descriptor) on model performance.29 The small set of descriptors for each model was analyzed to determine if test compounds fell within the applicability domain (AD) for the pharmaceutical training set. Both a bounding box approach as delineated by the descriptor ranges (univariate) and principal component (PC) analysis (multivariate) were performed.30,31

ACS Paragon Plus Environment

8

Page 9 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

As a supplement to a purely feature-based evaluation of AD, reliability estimates were calculated with the local cross-validation (LCV) error method for each independent model, and a modified version of the 3D reliability estimate proposed by Sheridan (2012) was applied to the consensus model.32,33 The LCV method weighs the cross-validation error of neighboring training set molecules by the distance to the test chemical of interest.32 Scans of 2-50 potential neighbors using the kd-tree algorithm revealed a relatively minor dependence of reliability estimates on the number of neighbors used. Optimal performance (MAE 0.096) was achieved using 30 neighbors for each model (Figure S1, in Supplementary Materials).34 For the consensus model, a 3D reliability estimate was created by splitting the training set into a 3D array of 27 bins based on 1) the similarity to 5 nearest neighbors (average distance with kd-tree algorithm), 2) standard deviation in cross-validation prediction error across kNN, SVM and RF predictions, and 3) predicted Fub value. The root mean square error (RMSE) of training set chemicals within each bin was calculated and compared to test set chemicals within those same bins. 3. RESULTS 3.1. Model construction and applicability of domain As shown in Figure 2, learning curves for each model confirm a training set of 1045 pharmaceuticals is sufficiently large to predict Fub. The learning curves of the pharmaceutical test set and the ToxCast I chemicals show similar root mean square errors (RMSE) with respect to training set size, suggesting that ~200 training chemicals are sufficient for reasonable predictions in both sets. In each model, the RMSE for the test sets converges around 0.2, implying a potential cap for the predictability of the models. Complementary BIC curves were used to identify the ideal number of descriptors for the kNN, SVM and RF models (Figure 2). The BIC curve for the kNN model shows that 15 descriptors are optimal. In contrast, the BIC curve for the

ACS Paragon Plus Environment

9

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 42

SVM model declines sharply from 0-10 descriptors and then gradually from 10-30 descriptors. Since the BIC changes very little with the addition of more than 10 descriptors, the SVM model with 10 descriptors was pursued to minimize complexity. For the RF model, a clear minima in BIC occurs at 10-20 descriptors, so 10 descriptors were selected as ideal. Based on the learning and BIC curves, the construction of kNN, SVM and RF models with 1045 compounds in the training set and only 10-15 descriptors is justified.

ACS Paragon Plus Environment

10

Page 11 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2. Learning curves for (a) kNN, (b) SVM and (c) RF algorithms for the training set (blue), drug test set (green) and ToxCast I compounds (red). Bayesian information criterion (BIC) curves for (d) kNN, (e) SVM and (f) RF. Means are shown as solid lines, shaded areas to the dotted lines represent one standard deviation.

While the three models each contain a unique combination of descriptors that cover a variety of chemical properties, metrics of hydrophobicity and polarity are highly ranked in each model, as illustrated in Table S1 (Supplementary Materials). Within the kNN model, 3 of the top 5 descriptors relate to hydrophobicity and aqueous solubility: SlogP, logS, and logP(o/w). Additional descriptors in this model account for molecular shape, surface area, polarity, and chemical charge. The 10 descriptors in the SVM model highlight the importance of a different set of chemical properties. Descriptors for hydrophobicity, solubility, atom count and bonds dominate the model in both number and relative importance, while additional descriptors relate to polar surface area. The three top ranked descriptors in the RF model are metrics of hydrophobicity and aqueous solubility: logS, logP(o/w), and SlogP. Both the 4th and 10th ranked descriptors reflect the positive polar surface area (PEOE_VSA_FPPOS and PEOE_VSA_PPOS). The remaining descriptors account for the shape, molar refractivity, and partial charges of compounds. When utilizing the bounded box approach based on the training set descriptor ranges, only a small fraction of test set chemicals in each Fub model have at least one descriptor value outside the range of that descriptor in the training set, thus making these chemicals AD outliers (Table S2). Within the kNN model, 2%, 6% and 18% of pharmaceutical, ToxCast I and ToxCast II test set chemicals fall outside the bounded box. In contrast, less than 1% of all test set chemicals fall

ACS Paragon Plus Environment

11

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 42

outside of the training set ranges of the SVM model. Finally, 1% of pharmaceuticals, 4% of ToxCast I, and 16% of ToxCast II chemicals fall outside the bounded box AD of the RF model. The relatively higher number of ToxCast II chemicals outside the AD stem from differences between the molecular shape of these chemicals and the pharmaceutical training set as enumerated by distance and adjacency matrix descriptors. Across all models, the full test set of both pharmaceuticals and ToxCast chemicals fall entirely within the range of the 4-7 top ranked descriptors, indicating that the most important physicochemical properties in the Fub models are well covered by the training set. In a multivariate PC approach to AD assessment in the Fub models, most test set chemicals fall within the AD as defined by the full range of training set PCs (Figure 3 and Table S3). In the kNN model, 1% of test set pharmaceuticals, 4% of ToxCast I and 13% of ToxCast II chemicals fall outside of the 15 training set PC ranges. The SVM model has the greatest overlap between the 10 PCs of the training set and test sets, with less than 2% of test set chemicals outside the AD. In the RF model, the majority of test set chemicals are within the 10 PC ranges, with only 1% of test set pharmaceuticals, 3% of ToxCast I and 8% of ToxCast II chemicals out of the AD. In the consensus model, 4% of pharmaceuticals, 10% of ToxCast I and 18% of ToxCast II chemicals fall outside the full 21 PC ranges, but only 3% of pharmaceuticals, 5% of ToxCast I and 15% of ToxCast II chemicals fall outside the range of the top 10 PCs. Across both AD assessment metrics and all three models, pharmaceuticals show the most similarity to the training set, followed by ToxCast I and then ToxCast II chemicals.

ACS Paragon Plus Environment

12

Page 13 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 3.

Plots of the first three principal components in the consensus model for the

pharmaceutical training set (purple), pharmaceutical test set (green), ToxCast I set (yellow) and ToxCast II set (red). 3.2. Model performance Models are compared on the basis of MAE, RMSE, and coefficient of determination (R2) for the cross validated training set values as well as the three independent test sets. Applying the R2 metric to these non-linear models with data weighted towards highly bound compounds is not ideal and leads to low values for these and many other Fub models.6-10, 35 However, the R2 metric was included as a compliment to the more informative prediction error and error variance metrics of MAE and RMSE, respectively. The kNN, SVM and RF models, as well as a consensus model built from the average Fub prediction from the three independent models, all show reasonable Fub predictions with especially good performance for the environmental chemicals of the ToxCast datasets (Table 1). Across 606 chemicals in all test sets, the kNN model predicts Fub fairly well with an MAE of 0.145, an RMSE of 0.228, and an R2 of 0.45. Of all the models explored, the SVM model exhibits the worst performance for the entire test set (MAE = 0.156; RMSE = 0.228;

ACS Paragon Plus Environment

13

Journal of Chemical Information and Modeling

R2 = 0.47). Of the three independent models, the RF model is the most predictive (MAE = 0.131; RMSE = 0.218; R2 = 0.51). Overall, the consensus model (average of kNN, SVM and RF) proves to be the optimal Fub model with the full test set predictions exhibiting a good MAE (0.133), the lowest RMSE (0.206) and the highest R2 (0.56). The consensus model outperforms the kNN, SVM and RF models for each test set and a 5 fold cross validation of the training set in terms of RMSE and R2, which indicates that the consensus approach provides a benefit in safeguarding against large prediction errors, relative to independent models. The accuracy of Fub predictions generated by the consensus model for the pharmaceutical test set approach that of the pharmaceutical training set (MAE = 0.155, 0.151; RMSE = 0.225, 0.208; R2 = 0.62, 0.55, respectively). Interestingly, Fub predictions for both ToxCast I and ToxCast II test sets are better than those for the pharmaceutical test set. The smaller (168) ToxCast II dataset exhibits the best predictions (MAE = 0.110; RMSE = 0.181, R2 = 0.56), followed closely by the larger (238) ToxCast I dataset (MAE = 0.131; RMSE = 0.206; R2 = 0.39). Thus, the consensus model not only exhibits the highest accuracy of all the constructed QSARs, it also performs best for the environmentally relevant chemicals of interest.

Metric

kNN

SVM

RF

Consensus

MAE

0.116

0.172

0.146

0.151

RMSE

0.233

0.237

0.214

0.208

R2

0.48

0.44

0.52

0.55

MAE

0.164

0.177

0.157

0.155

RMSE

0.242

0.251

0.231

0.225

R2

0.52

0.50

0.59

0.62

MAE

0.146

0.140

0.130

0.131

RMSE

0.228

0.209

0.226

0.206

Drug Test

Dataset Drug Training*

Table 1. Performance metrics for Fub prediction.

ToxC ast I

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 42

ACS Paragon Plus Environment

14

Complete Test

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ToxCast II

Page 15 of 42

R2

0.25

0.39

0.25

0.39

MAE

0.122

0.155

0.103

0.110

RMSE

0.211

0.224

0.187

0.181

R2

0.41

0.39

0.53

0.56

MAE

0.145

0.156

0.131

0.133

RMSE

0.228

0.228

0.218

0.206

R2

0.45

0.47

0.51

0.56

*Training set predictions from 5-fold cross validation. Mean absolute error (MAE) and root mean square error (RMSE) values for k nearest neighbors (kNN), support vector machines (SVM), random forest (RF), and consensus models. The complete test set encompasses all chemicals in the pharmaceutical and environmental (ToxCast) test sets. The pharmaceutical and environmentally relevant chemicals exhibit different plasma protein binding profiles. While the vast majority of ToxCast I and II chemicals are highly bound (Fub < 0.15), the pharmaceutical dataset contains nearly equal portions of highly (Fub < 0.15) and moderately (0.15 < Fub < 0.85) bound compounds. Across all test sets and models, prediction errors are inverse to the experimental Fub (Table S4). The worst predictions are generated for weakly bound chemicals (Fub > 0.85) (MAE = 0.294-0.340; RMSE = 0.407-0.443). The models perform optimally for highly bound chemicals with errors roughly half the size of those for the weakly bound set (MAE = 0.071-0.119; RMSE = 0.128-0.178). Thus, all models perform well for highly bound chemicals, such as those that dominate the environmentally relevant ToxCast datasets. Indeed, 70% of consensus predictions for highly bound pharmaceuticals and ToxCast chemicals are within 0.10 of experimental values. Among moderately bound chemicals, 45% of pharmaceuticals and 47% of ToxCast are predicted within 0.10 by the consensus model, with 82% of pharmaceuticals and 80% of ToxCast chemicals predicted within 0.30. For weakly bound chemicals, only 25% of pharmaceuticals and 18% of ToxCast chemicals have Fub predicted by

ACS Paragon Plus Environment

15

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 42

the consensus model within 0.10 of experimental values. Clearly, the consensus Fub model performs well for highly bound compounds, which is most relevant for the environmental test sets, though less than optimal model performance was exhibited for weakly bound compounds. Pharmaceuticals and environmentally relevant chemicals exhibit significantly different ionization profiles (Figure 4, Table 2). Both the pharmaceutical training and test sets contain nearly equal proportions of acids and bases (33-36%), approximately one quarter neutrals (2227%), and a small fraction of zwitterions (5-7%). In contrast, the ToxCast datasets are composed largely of neutrals (76-77%). Acids (15-20%) dominate the charged ToxCast chemicals, with relatively small numbers of bases (3-6%) and zwitterions (1-2%) present. Despite having a similar percentage of neutral compounds as the ToxCast I dataset, the ToxCast II set contains a higher percentage of bases and zwitterions. The composition of ionization states in each dataset is especially relevant, as the performance of the QSAR models varies greatly across ionization state.

ACS Paragon Plus Environment

16

Page 17 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4. The (a) composition of pharmaceutical and ToxCast datasets and (b) prediction quality with the Random Forest model are shown across ionization categories: acids (red), bases (blue), neutrals (grey), and zwitterions (purple).

Table 2. Performance of Fub models with respect to ionization show optimal performance in neutrals and acids (> 10% anions). Training* Method

MAE

RMSE

Drug Test Set MAE

RMSE

ToxCast I MAE

RMSE

ToxCast II MAE

RMSE

Complete Test MAE

RMSE

ACS Paragon Plus Environment

17

Base Neutral Zwitterion

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Acid

Journal of Chemical Information and Modeling

Page 18 of 42

kNN

0.141

0.207

0.149

0.254

0.171

0.259

0.097

0.146

0.147

0.240

SVM

0.160

0.222

0.170

0.257

0.200

0.268

0.135

0.178

0.174

0.249

RF

0.127

0.194

0.147

0.241

0.156

0.257

0.078

0.128

0.138

0.231

Consensus

0.132

0.185

0.137

0.234

0.167

0.251

0.098

0.123

0.140

0.225

kNN

0.183

0.245

0.172

0.215

0.186

0.226

0.199

0.300

0.176

0.227

SVM

0.193

0.259

0.177

0.237

0.149

0.194

0.191

0.250

0.176

0.235

RF

0.168

0.242

0.164

0.217

0.261

0.356

0.199

0.301

0.176

0.243

Consensus

0.171

0.234

0.165

0.208

0.192

0.236

0.188

0.274

0.170

0.219

kNN

0.178

0.254

0.156

0.245

0.137

0.220

0.117

0.207

0.132

0.218

SVM

0.146

0.214

0.141

0.212

0.124

0.192

0.152

0.223

0.136

0.206

RF

0.140

0.202

0.136

0.198

0.118

0.210

0.100

0.185

0.113

0.200

Consensus

0.145

0.201

0.138

0.201

0.119

0.191

0.103

0.177

0.116

0.187

kNN

0.159

0.206

0.227

0.299

0.044

0.044

0.250

0.358

0.222

0.305

SVM

0.250

0.302

0.317

0.376

0.057

0.057

0.288

0.384

0.297

0.368

RF

0.158

0.216

0.239

0.327

0.046

0.046

0.136

0.203

0.207

0.296

Consensus

0.179

0.218

0.242

0.312

0.049

0.049

0.224

0.292

0.228

0.300

*Training set predictions from 5-fold cross validation. Mean absolute error (MAE) and root mean square error (RMSE) values for k nearest neighbors (kNN), support vector machine (SVM), random forest (RF), and consensus models. The complete test set encompasses all chemicals in the pharmaceutical and environmental (ToxCast) test sets. All of the machine learning algorithms provide more accurate predictions for acids and neutrals than for bases and zwitterions (Table 2). Using the MAE and RMSE of the full (606) test set as metrics, the kNN, SVM, and RF models generate good predictions for neutrals (MAE = 0.132, 0.136, 0.113; RSME = 0.218, 0.206, 0.200, respectively). Predicted binding values for acids obtained using the kNN model are substantially better than those generated for bases (MAE = 0.147, 0.176; RMSE = 0.240, 0.227, respectively). In contrast, the performance of the SVM model for acids and bases was similar (MAE = 0.174, 0.176; RMSE = 0.259, 0.235). The RF predictions for acids are markedly better than those for bases (MAE = 0.138, 0.176; RMSE =

ACS Paragon Plus Environment

18

Page 19 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

0.231, 0.243, respectively). For all of these models, the least reliable predictions are obtained for zwitterions (MAE = 0.207-0.297; RMSE = 0.296-0.368). Similar trends occur across all training and independent test sets for the kNN, SVM, and RF models. Binding predictions obtained using the consensus model are generally better than those generated by the individual models, yet the trend in predictive power for each ionization state (neutrals > acids > bases > zwitterions) remains essentially the same. Within the full test set of 606 pharmaceutical and ToxCast chemicals, the consensus model makes excellent predictions for neutrals and acids (MAE = 0.116, 0.140; RMSE = 0.187, 0.225, respectively), but the base and zwitterion predictions are not as reliable (MAE = 0.170, 0.228; RMSE = 0.219, 0.300, respectively). Overall, the consensus Fub model outperforms the kNN and SVM models across ionization groups and offers comparable performance to the RF model. The consensus Fub predictions for the ToxCast test sets reflect these trends across ionization states. Neutral ToxCast I and II chemicals show similar accuracy (MAE = 0.119, 0.103; RMSE = 0.191, 0.177, respectively), which is significantly higher than that of the pharmaceutical datasets (MAE = 0.145, 0.138; RMSE = 0.201, 0.201, respectively). While the consensus model yields reasonable predictions for ToxCast acids as a whole, predictions for ToxCast II are much more accurate than for ToxCast I (MAE = 0.098, 0.176; RMSE = 0.123, 0.251, respectively). The consensus Fub predictions for the 18 ToxCast I and II bases have fairly low accuracy (MAE = 0.192, 0.188; RMSE = 0.236, 0.274, respectively), while the predictions for the 5 zwitterions are mixed (MAE = 0.049, 0.224, RMSE = 0.049, 0.292, respectively). Due to the relatively small dataset of bases and zwitterions in the ToxCast datasets, it is difficult to know whether these trends can be generalized to other environmentally relevant chemicals. Therefore, the larger test

ACS Paragon Plus Environment

19

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 42

set (including pharmaceuticals) may be a better performance metric for cations and zwitterion predictions. The AD as determined by the ranges of descriptor values within the training set does not separate those chemicals with poor predictions from those with reliable predictions. The MAE of test set chemicals outside the bounded box of descriptor ranges is well below that of the entire test set in each model (0.097, 0.011, and 0.080 for the kNN, SVM, and RF models, respectively). Across all models, 85% or more of the chemicals outside the bounded box AD have predictions within 0.10 Fub of the experimental value. Indeed, while the consensus model predicts eight chemicals in total with extreme error (> 0.7 Fub) in the entire test set, none of these chemicals fall outside the bounded box of the consensus model. Those chemicals outside the ADs delineated by the full PCs of each model have higher Fub prediction errors than those identified by the bounded box of descriptor ranges (MAE of 0.117, 0.092, and 0.188 for the kNN, SVM, and RF models, respectively) (Table 3 and Table S3). While the MAE of chemicals outside the AD of the kNN and SVM chemicals is still below that of the entire test set, the chemicals outside the AD of the RF model have a higher error than the test set as a whole. For those test set chemicals outside the AD as defined by PCs, the majority (68-90%) are predicted within 0.10 Fub. Two chemicals outside the ranges of the PCs of the kNN model have extreme prediction errors > 0.70 Fub (1,3-diphenylguanidine and tannic acid). Interestingly, the AD of the RF model isolates four chemicals with prediction errors > 0.70 Fub (daptomycin, difenzoquat methyl sulfate, 1,3-diphenylguanidine, and tannic acid), which comprise nearly a third of all high error predictions in the model. The AD delineated by PCs isolates more high error compounds than the bounded box of descriptor ranges across all models.

ACS Paragon Plus Environment

20

Page 21 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 3. Analysis of principal components (PCs) for each model. The variance within the training set covered by the components, as well as the mean absolute error (MAE) and root mean square error (RMSE) of all test set chemicals outside the training set PC ranges are provided. Model

Number PCs

Variance Covered

Number Outliers

Fub Error > 0.7

MAE of Outliers

RMSE of Outliers

15

100%

35

2

0.117

0.236

10

96%

23

1

0.076

0.198

6

83%

12

0

0.028

0.033

3

62%

12

0

0.028

0.033

10

100%

10

0

0.092

0.207

6

94%

7

0

0.104

0.243

3

69%

1

0

0.036

0.036

10

100%

22

4

0.188

0.356

6

93%

14

1

0.070

0.205

3

72%

3

0

0.008

0.008

21

100%

60

1

0.117

0.220

15

98%

49

1

0.108

0.217

10

89%

41

1

0.089

0.205

6

74%

33

0

0.031

0.038

3

55%

33

0

0.031

0.038

kNN

SVM

RF

Consensus

Reliability estimates from LCV errors show that training set prediction errors do not always reflect the test set errors, but can serve as a general guideline for predictive capability (Table 4). Despite a similar reliance on neighbors in both the LCV and the kNN model, the reliability estimates for the kNN model exhibit the worst performance (MAE 0.133, RMSE 0.178). The

ACS Paragon Plus Environment

21

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 42

most accurate reliability estimates are seen in the SVM model (MAE 0.097, RMSE 0.148), while the RF reliability estimates fall in between kNN and SVM estimates (MAE of 0.115, and RMSE of 0.166). The LCV errors tend to underestimate prediction errors. While large prediction errors occurred in each model, the reliability estimates were always less than 0.35, indicating a maximal error prediction that is far lower than that exhibited by the kNN, SVM and RF models. Although the LCV tends to underestimate the errors in Fub predictions overall, the LCV errors are within 0.10 for 43%, 70%, and 63% of kNN, SVM and RF predictions. The LCV error estimates perform best for the SVM model, but do not provide a clear way to differentiate the chemical space with the best Fub predictions. Table 4. The accuracy of local cross validation (LCV) with 30 neighbors for the full test set. MAE

RMSE

< RE

± 0.1 RE

kNN SVM

0.133 0.097

0.178 0.148

70% 66%

43% 70%

RF

0.115

0.166

73%

63%

*Includes mean absolute error (MAE) and root mean square error (RMSE) of the reliability estimate relative the actual error in the entire test set (606), as well as the percentage of chemicals with errors less than the reliability estimate (< RE) and the percentage of chemicals with errors within 10% of the reliability estimate (± 0.1 RE).

The 3D reliability estimate for the consensus model offers more insight into prediction quality than LCV. An uneven distribution of both training set and test set chemicals across the 27 bins results in mixed quality of error predictions across bins (Table S5). For example, Bin 3 (neighbor distance >2, standard deviation in predictions < 0.04, Fub prediction < 0.13) has an RMSE of 0.099 Fub based off 81 training set chemicals; the 198 test set chemicals in this bin have an RMSE of 0.103 Fub, with 93% of the test set chemicals having an error less than the RMSE of the bin. In contrast, it is difficult to assign an accurate reliability estimate for low occupancy bins

ACS Paragon Plus Environment

22

Page 23 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

such as Bin 7 (neighbor distance< 0.14, standard deviation in predictions > 0.104, Fub prediction < 0.13), which has an RSME based off only 2 training set chemicals and no test set chemicals to evaluate the metric. When applying the 3D reliability estimates, one should consider both the training set bin occupation and performance relative to the test sets (Table S5). In the 3D reliability estimates, there is a general trend of low bin RMSE associated with high similarity to neighbors, low predicted Fub values and low standard deviations (Table S5). Of the 17 bins that contain test chemicals, 11 have test RMSE within 0.05 Fub of the reliability estimate for the bin. For individual predictions, the RMSE of the bin serves as a conservative error estimate for the vast majority of chemicals (84%). While the 3D reliability estimate method offers a promising way to estimate prediction error for the consensus Fub model, the method could greatly benefit from larger datasets.

4. DISCUSSION & CONCLUSIONS In order to provide reliable Fub predictions for a wide array of chemicals, QSAR models were constructed with pharmaceutical data and evaluated for use with environmentally relevant chemicals. Three machine learning algorithms were employed with a large pharmaceutical training set, and the subsequent models were assessed for differences in AD and performance across independent pharmaceutical and environmental chemical testing sets. The kNN, SVM and RF machine learning algorithms represent distinctive approaches to Fub prediction. Nevertheless, similarities among the learning curves for all three models showed that the training set of 1045 pharmaceuticals was sufficient to cover the chemical space for both pharmaceuticals and environmentally relevant (ToxCast) chemicals.

ACS Paragon Plus Environment

23

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 42

While the RF model outperforms the kNN and SVM models, the consensus Fub model yields the most reliable means of predicting Fub with the lowest RMSE for both pharmaceutical and environmental test sets. The consensus model has diminished predictive errors relative to the RF model and still maintains sufficient parsimony in the descriptor set to yield a generalizable Fub model. Since the majority of environmentally relevant chemicals are highly bound neutrals and acids, the excellent predictions of the consensus model for these compounds is especially encouraging. As errors in the Fub prediction of highly bound chemicals can have larger impact on the pharmaco/toxicokinetics of a chemical than for weakly bound chemicals, due to the relative error, the remarkable performance of the consensus Fub model in this region also bodes well for the application of these models in a risk assessment setting. Both the construction and performance of the Fub QSAR models reflect the biochemical context for plasma protein binding. Metrics of hydrophobicity, aqueous solubility, and polarity appear as highly informative descriptors in every model. These physicochemical properties are strong drivers of chemical partitioning from the aqueous phase of plasma into protein binding sites.1-3 Thus, the inclusion of descriptors such as logP(o/w) into the QSAR models suggests the machine learning algorithms are capturing the key elements of plasma protein binding. The divergence of predictive power associated with ionization indicates that the QSAR models may be biased toward predicting binding by HSA, the most prevalent plasma protein. As shown in Table 2 and Figure 4, all models show superior prediction for neutral and acidic chemicals, which bind preferentially to HSA.4 In contrast, the relatively poor predictions for basic and zwitterionic chemicals may indicate that the models are neglecting important contributions from AAG, which tends to bind bases.5 Both the descriptors and the performance relative to ionization suggest the constructed QSAR models are driven by the most general properties of plasma

ACS Paragon Plus Environment

24

Page 25 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

protein binding, which would be expected to apply to neutral environmental chemicals more than the highly substituted and frequently ionized pharmaceuticals. The univariate bounded box of descriptor ranges and multivariate PC analysis serve as informative AD assessments for the Fub models. Although fewer chemicals are identified as outside the AD in the PC method, the isolated chemicals tend to have a higher Fub prediction error (MAE 0.092 – 0.188) than those outside the bounded box AD (MAE 0.011 – 0.097). The Fub prediction errors for chemicals outside the AD are generally lower than the test sets as a whole, indicating successful extrapolation across the explored chemicals. Additionally, across all models and AD assessments, the vast majority of test set chemicals are identified as within the AD. For the consensus Fub model, bounded box and PC approaches to AD assessment classify 8% and 10% of chemicals as outside the AD, respectively. A trend of decreasing similarity to the pharmaceutical training set from pharmaceutical test set to ToxCast I test set to ToxCast II test set is exhibited by each model and AD assessment. While the AD defined by multivariate PCs offers an improved method for delineating the training set chemical space and identifying compounds with high Fub prediction error over the bounded box method, neither of these AD assessments are fully able to separate well predicted chemicals from poorly predicted chemicals. Although the consensus model performs well for environmental chemicals as a whole, it is difficult to determine the precise regions of chemical space that yield optimal predictions. Inclusion of a chemical within the AD of the model does not ensure accurate Fub predictions. While reliability estimates for individual models based on LCV are largely within 10% of the true error, chemicals with extreme Fub prediction errors are still not correctly identified. The 3D reliability estimate for the consensus model provides a reasonable and conservative estimate for Fub prediction error through an assessment of chemical similarity, Fub prediction value and

ACS Paragon Plus Environment

25

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 42

standard deviation across models. Associations between ionization state and prediction quality indicate that the model accurately predicts binding values for acids and neutrals, but should be used with caution in the case of bases and zwitterions. The mixed results described for the consensus model serve as a reminder that QSAR models for Fub prediction function best as an early tool for understanding plasma protein binding, and risk assessors should consider the strengths and weaknesses of the model, rather than blindly following any specific AD metric. There are ongoing challenges in Fub prediction for environmental chemicals, including the potential for wide variation in measured binding values for individual compounds. Within the pharmaceutical datasets, 21 chemicals were eliminated from the model due to conflicting experimental reports with Fub differences of greater than 0.30. The raw data for the ToxCast datasets shows surprising variability in the 2-3 measurements per chemical determined by the same lab, with an average experimental uncertainty of 0.041 ± 0.104 across all 406 ToxCast chemicals.12,13 A few of the chemicals with high prediction error also had significant experimental variability, which may explain why the model did not match the average value well. For example, the RF model predicts the Fub of diethylhexyl phthalate and cyromazine as 0.051 and 0.154, respectively. Although the experimentally reported Fub values are much higher (0.933 and 0.935), independent trials differed greatly (ranges of 0.495 and 0.525). In cases such as these, the extreme prediction errors may reflect an experimental uncertainty and not a flawed model. Another challenge in the construction of QSARs for Fub lies in the diversity of chemical binding sites available in plasma. Specific interactions of chemicals with the multiple binding sites on HSA can be difficult to model. The additional binding sites on AAG as well as nonspecific interactions with HSA, AAG, and lipoproteins add further complications. While it is

ACS Paragon Plus Environment

26

Page 27 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

possible to describe the general nature of plasma protein binding in a 2D QSAR, it is unreasonable to expect to capture the specific and non-specific interactions between xenobiotics and every possible binding site with a small descriptor set. Over the past decade, a variety Fub models have been constructed with machine learning methods and various training sets, yet even the best Fub models have MAEs of 0.12 or more, which implies there may be a limit in the predictability of 2D Fub QSARS.7-9,11 Moreover, most of the QSAR models for Fub prediction have been developed for pharmaceuticals. Thus, existing 2D QSAR models for Fub exhibit better performance in pharmaceuticals (MAE 0.12-0.18 Fub) than in environmentally relevant chemicals (MAEs of 0.15-0.17 Fub).7-11 The reported R2 values for these QSARs range from 0.49-0.70 for pharmaceuticals.7-11 Due to the previously mentioned shortcomings of the metric, Zhu et al. (2013) did not report an R2 values for environmentally relevant chemicals. 11 However, for the sake of comparison, the Fub prediction values supplied in the supplemental material of Zhu et al. (2013) study were used to calculate the R2 for ToxCast chemicals, which ranged from 0.20-0.29 across the algorithms.11 The models developed in the present study fall well within the accuracy range of previous Fub models for pharmaceuticals, and offer significant improvements in Fub predictions for environmental chemicals, which are comparable to those generated by some of the best pharmaceutical 2D QSAR models. Despite these shortcoming, Fub QSAR models continue to play a critical role in the prioritization of chemicals in both pharmacology and toxicology due to the strong influence of plasma protein binding on ADMET properties.1-3 Of the thousands of chemicals currently in wide commercial use, a small minority have undergone rigorous toxicity testing.36,37 Risk assessors must therefore rely on computational models (in silico prediction) to bridge the gap between in vitro assays and in vivo effects research as a means of gauging the potential impact of

ACS Paragon Plus Environment

27

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 42

such chemicals on human health.38,39 One such approach, which is termed in vitro-in vivo extrapolation

(IVIVE),

employs

pharmacokinetic

(PK)

and

physiologically

based

pharmacokinetic (PBPK) models to calculate the chemical concentration in an organism required to trigger a specific mechanism of action delineated by in vitro assays.40,41 When coupled with exposure estimates, IVIVE models can be invaluable tools for the high throughput screening of environmental chemicals with unknown in vivo toxicity for the prioritization of chemicals for further testing.42-44 An Fub model that provides accurate predictions in diverse chemical space is a critical input for these in toxicology models. The presented work shows that plasma protein binding data for pharmaceuticals can be used to develop a high quality Fub QSAR model for environmentally relevant chemicals. Focusing on a simple model (10-15 2D descriptors) allows for the construction of a global Fub model that captures the general drivers of plasma protein binding and outperforms previous Fub models in the chemical domain outside of pharmaceuticals. Not only are the Fub predictions for environmental chemicals reliable (MAE 0.103-0.131), but these ToxCast chemicals tend to occupy a subset of the pharmaceutical descriptor space, establishing a basis for the prediction of biophysical properties of environmental chemicals from a larger pharmaceutical dataset. Ultimately, the novel approach to the prediction of plasma protein binding in environmentally relevant chemicals presented here yields a generic Fub model that can be applied to a diverse array of chemicals.

ASSOCIATED CONTENT Supporting Information. Supplementary Figure S1 and Tables S1-S6 include the datasets used for constructing and testing the QSAR models, as well as detailed information on the descriptors

ACS Paragon Plus Environment

28

Page 29 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

used for each model, chemicals outside of the applicability domain, and performance metrics relative to fraction unbound. This material is available free of charge via the Internet at http://pubs.acs.org.

AUTHOR INFORMATION Corresponding Author *Address correspondence to Rogelio Tornero-Velez at 109 T.W. Alexander Drive, Mail Code E205-01, Research Triangle Park, NC, 27709; Email: [email protected] Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Funding Sources The U.S. Environmental Protection Agency through its Office of Research and Development funded and managed the research described here. Brandon Veber was funded by the Oak Ridge Institute for Science and Education Research Participation Program at the U.S. Environmental Protection Agency. Notes Although the presented research has been subject the U.S. Environmental Protection Agency’s administrative review and approved for publication, the presented work is that of the authors and does not necessarily represent Agency policy. All of the models presented here are available to use free of charge at https://bitbucket.org/bveber/chem.

ACS Paragon Plus Environment

29

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 42

ACKNOWLEDGMENT Much appreciation goes out to Yu-Mei Tan and Michael “Rocky” Goldsmith for discussions on the cross-over between pharmaceuticals and environmental chemicals and the need for better plasma protein binding models. The authors also would like to thank Marina Evans, Kamel Mansouri and Stephen Graham for completing the Environmental Protection Agency internal review of this manuscript.

ABBREVIATIONS AAG, α-acid glycoprotein; AD, applicability domain; BIC, Bayesian information criterion; Fub, fraction of xenobiotic unbound by plasma; HSA, human serum albumin; IVIVE, in vitro-in vivo extrapolation; kNN, k nearest neighbors; LCV, local cross-validation; lnKa, association constant for plasma proteins; MAE, mean absolute error; PBPK, physiologically based pharmacokinetic; QSAR, quantitative structure-activity relationship; R2, coefficient of determination; RF, random forest; RMSE, root mean square error; SVM, support vector machine.

ACS Paragon Plus Environment

30

Page 31 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

REFERENCES 1. Benet, L.Z.; Kroetz, D.L.; Sheiner, L.B. Pharmacokinetics. The Dynamics of Drug Adsorption Distribution and Elimination. In Goodman and Gilman’s the Pharmacological Basis of Therapeutics; McGraw-Hill: New York, 1996; 9th ed., pp 3-27. 2. Rowley, M.; Kulagowski, J.J.; Watt, A.P.; Rathbone, D.; Stevenson, G.I.; Carling, R.W.; Baker, R.; Marshall, G.R.; Kemp, J.A.; Foster, A.C.; et. al. Effect of Plasma Protein Binding on In Vivo Activity and Brain Penetration of Glycine/NMDA Receptor Antagonists. J. Med. Chem. 1997, 40, 4053-4068. 3. Lambrinidis, G.; Vallainatou, T.; Tsantili-Kakoulidou, A. In Vitro, In Silico and Integrated Strategies for The Estimation of Plasma Protein Binding. A Review, Adv. Drug Deliv. Rev. 2015, 86, 27-45. 4. Zhivkova, Z.D. Studies on Drug-Human Serum Albumin Binding: The Current State of the Matter, Curr. Pharm. Des. 2015, 21, 1817-1830. 5. Kopecký Jr.; V., Ettrich, R.; Hofbauerová, K.; Baumruk, V. Structure of Human α-Acid Glycoprotein and Its High-Affinity Binding Site, Biochem. Biophys. Res. Comm. 2003, 300, 41-46. 6. Kuchinskiene, Z.; Carlson, L.A. Composition, Concentration, and Size of Low Density Lipoproteins and of Subfractions of Very Low Density Lipoproteins from Serum of Normal Men and Woman, J. Lipid Res. 1982, 23, 762-769. 7. Kratochwil, N.A.; Huber, W.; Müller, F.; Kansy, M.; Gerber, P.R. Predicting Plasma Protein Binding: A New Approach, Biochem. Pharmacol. 2002, 64, 1355-1374.

ACS Paragon Plus Environment

31

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 32 of 42

8. Votano, J.R.; Parham, M.; Hall, M.; Hall, L.H.; Kier, L.B.; Obach, S.; Tropsha, A. QSAR Modeling of Human Serum Protein Binding with Several Modeling Techniques Utilizing Structure-Information Representation, J. Med. Chem. 2006, 49, 1769-1781. 9. Ghafourain, T.; Amin, Z. QSAR Models for the Prediction of Plasma Protein Binding, Bioimpacts 2013, 3, 21-27. 10. Obach, R.S.; Lombardo, F.; Waters, N.J. Trend Analysis of A Database Of Intravenous Pharmacokinetic Parameters in Humans for 670 Drug Compounds, Drug Metab. Dispos. 2008, 36, 1385-1405. 11. Zhu, X.; Sedykh, A.; Zhu, H.; Liu, S.; Tropsha, A. The Use of Pseudo-Equilibrium Constant Affords Improved QSAR Models of Human Plasma Protein Binding, Pharm. Res. 2013, 30, 1790-1798. 12. Yin, Y.; Chang, D.T.; Grulke, C.M.; Tan, Y.-M.; Goldsmith, M.-R.; Tornero-Velez, R. Essential Set of Molecular Descriptors for ADME Prediction in Drug and Environmental Chemical Space, Research 2014, 1:996. 13. Wetmore, B.A.; Wambaugh, J.F.; Ferguson, S.S.; Sochaski, M.A.; Rotroff, D.M.; Freeman, K.; Clewell, H.J.; Dix, D.J.; Anderson, M.E.; Houck, K.A.; et. al. Integration of Dosimetry, Exposure, and High-Throughput Screening Data in Chemical Toxicity Assessment, Toxicol. Sci. 2012, 125, 157-174. 14. Wetmore, B.A.; Wambaugh, J.F.; Allen, B.; Ferguson, S.S.; Sochaski, M.A.; Setzer, R.W.; Houck, K.A.; Strope, C.L.; Cantwell, K.; Judson, R.S.; et. al. Incorporating High-Throughput

ACS Paragon Plus Environment

32

Page 33 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Exposure Predictions wDosimetry-Adjusted In Vitro Bioactivity to Inform Chemical Toxicity Testing. Toxicol. Sci. 2015, 148, 121-136. 15. Weininger, D. SMILES, A Chemical Language and Information System: Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31-36. 16. Pence, H.E.; Williams, A. ChemSpider: An Online Chemical Information Resource. J. Chem. Educ. 2010, 87, 1123-1124. 17. Yen, T-H.; Lin, J.-L. Acute Poisoning with Emamectin Benzoate. J. Toxicol. Clin. Toxicol. 2004, 42, 657-661. 18. Molecular Operating Environment (MOE), 2013.08 (2015) Chemical Computing Group Inc., Montreal, QC, Canada. 19. ADMET Predictor, 7.0 (2014) Simulations Plus, Inc., Lancaster, CA, USA. 20. Fraczkiewicz, R.; Lobell, M.; Göller, A.H.; Krenz, U.; Schoennis, R.; Clark, R.D.; Hillisch, A. Best of Both Worlds: Combining Pharma Data and State of the Art Modeling Technology to Improve In Silico pKa Prediction, J. Chem. Inf. Model. 2015, 55, 389-397. 21. Liao, C.; Nicklaus, M.C. Comparison of Nine Programs Predicting pKa Values of Pharmaceutical Substances, J. Chem. Inf. Model 2009, 49, 2801-2812. 22. Zheng, W.; Tropsha, A. Novel Variable Selection Quantitative Structure-Property Relationship Approach Based on the K-Nearest-Neighbor Principle. J. Chem. Inf. Compt. Sci. 2000, 40, 185-194.

ACS Paragon Plus Environment

33

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 34 of 42

23. Sharaf, M.A.; Illman, D.L., Kowalski, M. Chemometrics; John Wiley and Sones: New York, 1986. 24. Vapnik, V.N. The Nature of Statistical Learning Theory; Springer-Verlag: New York, 1995. 25. Breiman, L. Random forest. Mach. Learn. 2001, 45, 5-32. 26. Strobl, C.; Boulesteix, A.-L.; Zeileis, A.; Hothorn, T. Bias in Random Forest Variable Important Measures: Illustrations, Sources and A Solution. BMC Bioinformatics 2007, 8:25. 27. Pedregosa, F.; Varoquauz, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dunbourg, V.; et. al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12,2825-2830. 28. Yelle, L.E. The Learning Curve: Historical Review and Comprehensive Survey. Decision Sci. 1979, 10, 302-328. 29. Schwarz, G. Estimating the Dimension of a Model. Ann. Stat. 1978, 6, 461-464. 30. Jaworska, J.; Nikolova-Jeliazkova, N.; Aldenberg, T. QSAR Applicability Domain Estimation by Projection of the Training Set in Descriptor Space: A Review, Altern. Lab Anim. 2006, 33, 446-469. 31. Seber, G.A.F. Multivariate Observations; John Wiley and Sons: New York, 1984. 32. Toplak, M.; Močnik, R.; Matija, P.; Bosnić, Z.; Carlsson, L.; Hasselgren, C.; Demšar, J.; Boyer, S.; Zupan, B.; Stălring, J. Assessment of Machine Learning Reliability Methods for Quantifying the Applicability of Domain of QSAR Regression Models, J. Chem. Inf. Model 2014, 54, 431-441.

ACS Paragon Plus Environment

34

Page 35 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

33. Sheridan, R.P. Three Useful Dimensions for Domain Applicability in QSAR Models Using Random Forest. J. Chem. Infor. Model 2012, 52, 814-823. 34. Friedman, J.H.; Basket, F.; Shustek, L.J. An Algorithm for Finding Nearest Neighbors. IEEE Trans. Comput. 1975, 10, 1000-1006. 35. Alexander, D.L.J.; Tropsha, A.; Winkler, D.A. Beware of R2: Simple, Unambiguous Assessment of the Prediction Accuracy of QSAR and QSPR Models. J. Chem. Infor. Model 2015, 55, 1316-1322. 36. Egeghy, P.P.; Judson, R.; Gangwal, S.; Mosher, S.; Smith, D.; Vail, J.; Hubal, E.A.C. The Exposure Data Landscape for Manufactured Chemicals. Sci. Total Environ. 2012, 414, 159166. 37. Muir, D.C.K.; Howard, P.H. Are There Other Persistent Organic Pollutants? A Challenge for Environmental Chemists. Environ. Sci. Technol. 2003, 40, 7157-7166. 38. Knaak, J.B.; Dary, C.C.; Zhang, X.; Gerlach, R.W.; Tornero-Velez, R.; Chang, D.T.; Goldsmith, R.; Blancato, J.N. Parameters for Pyrethroid Insecticide QSAR and PBPK/PD Models for Human Risk Assessment. Rev. Environ. Contam. T. 2012, 219, 1-114. 39. Phillips, M.B.; Leonard, J.A.; Grulke, C.M.; Chang, D.T.; Edwards, S.W.; Brooks, R.; Goldsmith, M.R.; El-Masri, H.; Tan, Y.M. A Workflow to Investigate Exposure and Pharmacokinetic Influences on High-Throughput In Vitro Chemical Screening Based on Adverse Outcome Pathways. Environ. Health Perspect. 2016, 124, 53-60.

ACS Paragon Plus Environment

35

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 36 of 42

40. Kavlock, R.J.; Ankley, G.; Blancato, J.; Breen, M.; Conolly, R.; Dix, D.; Houck, K.; Hubal, E.; Judson, R.; Rabinowitz, J.; et. al. Computational Toxicology: A State of the Science Mini Review. Tox. Sci. 2008, 103, 14-27. 41. Godin, S.J.; DeVito, M.J.; Hughes, M.F.; Ross, D.G.; Scollon, E.J.; Starr, J.M.; Setzer, R.W.; Conolly, R.B.; Tornero-Velez, R. Physiologically Based Pharmacokinetic Modeling of Deltamethrin: Development of a Rat-Human Diffusion-Limited Model. Toxicol. Sci. 2012, 115, 330-343. 42. Krewski, D.; Westphal, M.; Andersen, M.E.; Paoli, G.M.; Chiu, W.A.; Al-Zoughool, M.; Croteau, Burgoon, L.D.; Cote, I. A Framework for the Next Generation of Risk Science. Environ. Health Persp. 2014, 122, 796-805. 43. Tan, Y.-M.; Sobus, J.; Chang, D.; Tornero-Velez, R.; Goldsmith, M.; Pleil, J.; Dary, C. Reconstructing Human Exposures Using Biomarkers and Other “Clues.” J. Toxicol. Environ. Health B 2012, 15, 22-38. 44. Issacs, K.K.; Glen, W.G.; Egeghy, P.; Goldsmith, M.-R.; Smith, L.; Vallero, D.; Brooks, R.; Grulke, C.M.; Özkaynak, H. SHEDS-HT: An Integrated Probabilistic Exposure Model for Prioritizing Exposures to Chemicals with Near-Field and Dietary Sources, Environ. Sci. Technol. 2014, 48, 12750-12759.

ACS Paragon Plus Environment

36

Page 37 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

For Table of Contents use only Informing the human plasma protein binding of environmental chemicals by machine learning in the pharmaceutical space: Applicability domain and limits of predictability. Brandall L. Ingle, Brandon C. Veber, John W. Nichols, Rogelio Tornero-Velez

ACS Paragon Plus Environment

37

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. Histogram of the fraction unbound by plasma protein for entire pharmaceutical (dark blue) and environmental (light green) sets. Figure 1 165x65mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 38 of 42

Page 39 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Learning curves for (a) kNN, (b) SVM and (c) RF algorithms for the training set (blue), drug test set (green) and ToxCast I compounds (red). Bayesian information criterion (BIC) curves for (d) kNN, (e) SVM and (f) RF. Means are shown as solid lines, shaded areas to the dotted lines represent one standard deviation. Figure 2 172x203mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Plots of the first three principal components in the consensus model for the pharmaceutical training set (purple), pharmaceutical test set (green), ToxCast I set (yellow) and ToxCast II set (red). Figure 3 25x21mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 40 of 42

Page 41 of 42

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4. The (a) composition of pharmaceutical and ToxCast datasets and (b) prediction quality with the Random Forest model are shown across ionization categories: acids (red), bases (blue), neutrals (grey), and zwitterions (purple). Figure 4 164x145mm (96 x 96 DPI)

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table of Contents Graphic 94x37mm (96 x 96 DPI)

ACS Paragon Plus Environment

Page 42 of 42