Article Cite This: Chem. Res. Toxicol. 2019, 32, 1178−1192
pubs.acs.org/crt
Enhancing Acute Oral Toxicity Predictions by using Consensus Modeling and Algebraic Form-Based 0D-to-2D Molecular Encodes Ceś ar R. García-Jacas,*,† Yovani Marrero-Ponce,‡,§ Fernando Corteś -Guzmań ,∥ Jose ́ Suaŕ ez-Lezcano,⊥ Felix O. Martinez-Rios,# Luis A. García-Gonzaĺ ez,∇ Mario Pupo-Meriño,∇ and Karina Martinez-Mayorga∥
Downloaded via CARLETON UNIV on August 21, 2019 at 14:11:51 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.
†
Departamento de Ciencias de la Computación, Centro de Investigación Científica y de Educación Superior de Ensenada, Ensenada, Baja California, México ‡ Universidad San Francisco de Quito, Grupo de Medicina Molecular y Traslacional, Colegio de Ciencias de la Salud, Escuela de Medicina, Edificio de Especialidades Médicas, Quito, Pichincha, Ecuador § Grupo de Investigación Ambiental, Programas Ambientales, Facultad de Ingenierías, Fundacion Universitaria Tecnologico Comfenalco−Cartagena, Cr44 DN 30 A, 91, Cartagena, Bolívar, Colombia ∥ Instituto de Química, Universidad Nacional Autónoma de México, Ciudad de México, México ⊥ Pontificia Universidad Católica del Ecuador Sede Esmeraldas, Esmeraldas, Ecuador # Facultad de Ingeniería, Universidad Panamericana, Ciudad de México, México ∇ Grupo de Investigación de Bioinformática, Universidad de las Ciencias Informáticas, La Habana, Cuba S Supporting Information *
ABSTRACT: Quantitative structure−activity relationships (QSAR) are introduced to predict acute oral toxicity (AOT), by using the QuBiLS-MAS (acronym for quadratic, bilinear and N-Linear maps based on graph-theoretic electronic-density matrices and atomic weightings) framework for the molecular encoding. Three training sets were employed to build the models: EPA training set (5931 compounds), EPA-full training set (7413 compounds), and Zhu training set (10 152 compounds). Additionally, the EPA test set (1482 compounds) was used for the validation of the QSAR models built on the EPA training set, while the ProTox (425 compounds) and T3DB (284 compounds) external sets were employed for the assessment of all the models. The k-nearest neighbor, multilayer perceptron, random forest, and support vector machine procedures were employed to build several base (individual) models. The base models with REPA−training ≥ 0.75 (R = correlation coefficient) and MAEEPA−training ≤ 0.5 (MAE = mean absolute error) were retained to build consensus models. As a result, two consensus models based on the minimum operator and denoted as M19 and M22, as well as a consensus model based on the weighted average operator and denoted as M24, were selected as the best ones for each training set considered. According to the applicability domain (AD) analysis performed, model M19 (built on the EPA training set) has MAEtest−AD = 0.4044, MAEProTox−AD = 0.4067 and MAET3DB−AD = 0.2586 on the EPA test set, ProTox external set, and T3DB external set, respectively; whereas model M22 (built on the EPA-full set) and model M24 (built on the Zhu set) present MAEProTox−AD = 0.3992 and MAET3DB−AD = 0.2286, and MAEProTox−AD = 0.3773 and MAET3DB−AD = 0.2471 on the two external sets accounted for, respectively. These outcomes were compared and statistically validated with respect to 14 QSAR methods (e.g., admetSAR, ProTox-II) from the literature. As a result, model M22 presents the best overall performance. In addition, a retrospective study on 261 withdrawn drugs due to their toxic/side effects was performed, to assess the usefulness of prospectively using the QSAR models proposed in the labeling continued...
Received: January 14, 2019 Published: May 8, 2019 © 2019 American Chemical Society
1178
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology
of chemicals. A comparison with regard to the methods from the literature was also made. As a result, model M22 has the best ability of labeling a compound as toxic according to the globally harmonized system of classification and labeling of chemicals. Therefore, it can be concluded that the models proposed, especially model M22, constitute prominent tools for studying AOT, at providing the best results among all the methods examined. A freely available software was also developed to be used in virtual screening tasks (http://tomocomd.com/apps/ptoxra).
1. INTRODUCTION Chemical toxicity is the ability of a substance to cause hazardous biological damages in vulnerable sites (e.g., in organs as the liver−hepatoxicity) or cells (cytotoxicity)1 in the human body, or in other living systems such as plants,2 animals,3 or ecosystems.4 Acute oral toxicity (AOT) constitutes one of the most important toxicological end points to be assessed. LD50, indicator used to measure AOT, is the dose of a chemical that kills 50% of the animals treated, either immediately or in short time, after administration of a single dose or multiple doses within 24 h.5 In vivo assays of animal tests are required to obtain an exact measure of AOT. However, these are complex, expensive, and time-consuming, especially when a large number of chemicals are studied. In addition, because of ethical reasons and animal rights, LD50 experiments are quite controversial. Thus, quantitative structure−activity relationship (QSAR) models constitute one of the alternatives to be used to predict AOT. Since the last quarter of the past century, several QSAR models have been introduced to estimate this end point. In this sense, Enslein et al.6,7 built models based on the multiple linear regression (MLR) technique using noncongeneric chemicals, but they lack predictive power. To tackle this drawback, Eldred et al.8 and Guo et al.9 developed models using congeneric compounds. Not surprisingly these have limited application ranges. Posteriorly, Zhu et al.10,11 proposed models based on kNN, random forest, hierarchical clustering, and other methods, as well as consensus models, showing the latter better performances than the formers. Nonetheless, the predictive ability achieved by the consensus models is from low to moderate (see Table 3 in ref 10). In general, these results highlight not only the complexity of the biological mechanisms involved in AOT, but also the diversity of modes of action and the lack of chemical space coverage. It is also important to remark that LD50 measurements, as any similarly complex in vivo end point, are expected to contain rather high quantity of noise, which reflects biological diversity as well as testing conditions. More recently, with the purpose of obtaining better results in the AOT prediction, the QSAR models for the admetSAR12,13 and ProTox-II14,15 web tools were developed by using training sets comprised of 10 207 and 38 515 compounds, respectively, to consider a wider chemical space in the learning process. Also, other modeling strategies have been applied, for instance: (1) Lagunin et al.16 presented models based on the combination of QNA (Quantitative Neighborhoods of Atoms) descriptors, PASS (Prediction of Activity Spectra for Substances) predictions, and self-consistent regression; (2) Lu et al.17 proposed local lazy learning (LLL)-based methods, which use linear regression, arithmetic mean, or weighted mean to determine the predictions from k-nearest neighbors to a chemical structure given; and (3) Xu et al.18 introduced models based on a deep learning architecture using improved molecular graph encoding convolutional neural networks.
As it can be seen, all these models were built with classic (e.g., k-NN) and modern (e.g., deep learning) learning methods. The majority of them were created using similarity-based strategies10−17 (e.g., FDA-method), under the assumption of that structurally similar compounds may have similar activities/ toxicities. In addition, as a common characteristic, these QSAR models perform the molecular encoding using “traditional” molecular descriptors (MDs) (e.g., Dragon MDs) and/or fingerprints (e.g., molecular access system (MACCS) keys). In general, all these approaches may seem indicative that all the efforts have been made to build good models to predict AOT. However, the QSAR models mentioned above present poor predictive power when used in external sets (see Section 3.3). In our opinion, this behavior could be due to the MDs used. Indeed, as it has been analyzed elsewhere,19−21 it is more important to use the set of “right” predictors than the choice of the “right” learning algorithm. The traditional MDs (and fingerprints) included in the QSAR models reported to predict AOT have as a characteristic that they are mostly chemically interpretable. However, an easily interpretable descriptor is not necessarily a good descriptor to predict a determined end point. In fact, a property of the MDs is that they present the ability of highlighting molecular features that are not highlighted by a chemical interpretation. Consequently, the objective is often to determine the “best” predictive model, while the interpretation is an add-on. The latter is stated in Principle 5 (Chapter 6) of the Organisation for Economic Co-operation and Development (OECD) Guidance Document on the Validation of QSAR Models:22 “a (Q)SAR should be associated with a mechanistic interpretation, if possible”. Therefore, the use of more recent definitions of MDs, whose interpretation could be more difficult, may contribute to develop better QSAR models to predict AOT. Moreover, according to the complexity theory through the No Free Launch Theorem,23 it can be said that no single QSAR model to predict AOT achieves superior results to all the others when its performance is averaged over all possible chemicals. The reasoning for this opinion is that one MDs family (e.g., DRAGON MDs) may not suffice to encode all the relevant structural information for distinct compounds. In other words, the significance of the MDs depends on the nature of the chemical structures under analysis. Thus, following the same line of thought, it is necessary to use alternative approaches to encode new and orthogonal chemical information to build more robust QSAR models. In this sense, several MDs both topological (2D) and geometrical (3D) have recently been presented,24−29 being the QuBiLS-MAS (acronym for Quadratic, Bilinear and N-Linear mapS based on graph-theoretic electronic-density Matrices and Atomic weightingS) 2DMDs29,30 one of these novel definitions. The QuBiLS-MAS 2D-MDs are the current definition of the former ToMoCoMD-CARDD 2D-MDs,31−36 at presenting novel generalizations and extensions.29 These MDs have been applied in a wide variety of applications.37−45 Indeed, the significant predictive power demonstrated by this ligand-based method suggests that these 2D-MDs encode relevant chemical 1179
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology Scheme 1. General Workflow to Compute the QuBiLS-MAS Molecular Descriptorsa
a (1) Computation of the molecular vectors according to the properties selected. (2) Computation of the nonstochastic matrices for k = 1. (3) To consider atom-types or local-fragments (optional). (4) To apply molecular cutoffs (optional). (5) Computation of the simple-stochastic, doublestochastic, and mutual probability matrices. (6) To perform the division in atom-level (or bond-level) matrices. (7) Computation of the atom-level (or bond-level) indices (descriptors) using the property vectors calculated in the step (1). (8) To apply one or several aggregation operators on the atom-level (or bond-level) descriptors vector.
of 292 drugs withdrawn because of their toxic/side effects was manually obtained from the “Toxicities” section in the WITHDRAWN database.47 In this database, it is considered as a withdrawn drug if it was recalled from market in at least one country. The WITHDRAWN data set was used to perform a simulation of prospective virtual screening with the best QSAR models built. In this way, according to the AOT values to be predicted, the relevance of employing the novel predictive models in the labeling of compounds was analyzed (see Sections 2.6 and 3.5). All these data sets were obtained as simplified molecular input line entries (SMILES) or structure data format (SDF) and then converted to 2D representations without H-depleted by using the OpenBabel software.48 Filters to remove inorganic and organometallic compounds, salts, compound mixtures, and repeated compounds with respect to the Zhu training set were applied. After that, the EPA training/test sets and EPA-full training set kept the same number of structures, whereas the Zhu training set, ProTox external set, T3DB external set, and WITHDRAWN set contain 10 152, 425, 284, and 267 compounds, respectively. The LD50 values for the chemicals considered were expressed as moles per kilogram (mol/kg) and then these were stated as log[1/(mol/kg)] according to standard QSAR practices. Supporting Information 1 shows the SDF files of the chemical sets explained above. 2.2. QuBiLS-MAS Topological Molecular Descriptors. The atom- and bond-based QuBiLS-MAS topological molecular descriptors (2D-MDs) were recently presented as a novel framework to characterize compounds.29,30 This framework is the current definition of the former ToMoCoMD-CARDD 2D-MDs,31−36 at including novel matrix approaches (e.g., double-stochastic), novel molecular cutoffs to take into account particular interatomic relations (see eq 7 in ref 29), novel local fragments to consider chemical regions of interest, and diversity of operators to generalize the calculation from local vertex invariants (LOVIs) or local edge invariants (LOEIs), among other features.29 These 2D-MDs have been used in several applications, such
information on the molecules. Therefore, all in all, this manuscript introduces a novel topological approach based on the QuBiLS-MAS 2D-MDs to predict AOT. This manuscript is organized as follows. Section 2 describes the chemical data sets and the modeling methodology used. Section 3 analyzes the results obtained and compares them with respect to several tools established in the literature. A statistical assessment of the predictions obtained is also presented, as well as the outcomes achieved in a retrospective virtual screening. Finally, Section 4 establishes the findings and conclusions.
2. MATERIAL AND METHODS 2.1. Chemical Data Sets. The chemical data set contained in the Toxicity Estimation Software Tool (TEST), provided by the U.S. Environmental Protection Agency (EPA), was used to build the QSAR models.11 This data set is comprised of 5931 training compounds (hereafter denoted as EPA training set) and 1482 test compounds (hereafter denoted as EPA test set) annotated with oral rat LD50 values. Two other training sets comprised of 7413 and 10 207 compounds, respectively, were also accounted for. The first one, denoted as EPA-full training set, was created by joining the EPA training/test sets aforementioned, whereas the second one, widely used in the literature12−18 and denoted as Zhu training set, was collected from Zhu et al.10 With these three training sets, comparability of results among the QSAR models stated in the literature to those to be developed is guaranteed. It is important to highlight that the EPA training set ⊆ EPA-full training set ⊆ Zhu training set (⊆ means subset). Moreover, two sets comprised of 500 and 428 chemical structures annotated with oral rat LD50 values were downloaded from the ProToxII Web site15 and from the Toxin and Toxin Target Database (T3DB also well-known as Toxic Exposome Database),46 respectively, to be considered as external validation sets. Additionally, a data set composed 1180
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology Scheme 2. Workflow Corresponding to the Feature Selection and Modeling Process
as antimalarials,37 trichomonacidals,38−40 paramphistomicides,41 tyrosinase inhibitors,42,43 and others.44,45 The QuBiLS-MAS 2D-MDs calculation is based on the bilinear, quadratic, and linear algebraic forms. Thus, the original approach to compute the atom-based [abk(x,y)] and bond-based [ebk(x,y)] bilinear QuBiLS-MAS MDs is a bilinear mapping as shown below: n k a b (x ̅ , y ̅ ) =
m eb
(x ̅ , y ̅ ) =
n a,k b(q , f ) 3 a = b(q , f ) (x ̅ , y ̅ ) =
n
∑ ∑ mijkx iy j = [X ]T 4k[Y ] ∑∑ i=1 j=1
i=1 j=1
(3) m
m
eijkx iy j
T
n
∑ ∑ mija , kx iy j = [X ]T 4a , k[Y ]
∀ a = 1, 2, ..., n
(1)
i=1 j=1 k
summation operator is applied on the LOVIs (or LOEIs) calculated, then the results exactly coincide with the original approach (eqs 1 and 2). Scheme 1 depicts a flowchart regarding the QuBiLS-MAS 2D-MDs calculation.
b(q , f ) 3 e
k
= [X ] , [Y ]
= b(q , f )e , k (x ̅ , y ̅ ) =
(2) ∀ e = 1, 2, ..., m
where n (or m) is the total number of atoms (or bonds), k (±1, ±2, ..., ±15) is the power of the matrix approaches, and mkij (or ekij) denotes the coefficients of the kth atom-based electronic-density matrix 4k see ref 33 (or bond-based matrix ,k see ref 35), and xi and yi are the components of the x and y atom-based (or bond-based) property vectors, respectively. Note that, if the x and y vectors encode the same property (x = y), then eqs 1 and 2 define atom-based [aqk(x,x)] and bond-based [eqk(x,x)] quadratic QuBiLS-MAS 2D-MDs (quadratic mapping), respectively, and if all the entries of the x vector are equal to 1 and the y vector encodes a property, then eqs 1 and 2 define atom-based [a f k(x,x)] and bond-based [e f k(x,x)] linear QuBiLS-MAS 2D-MDs (linear mapping), respectively. The main contribution of the QuBiLS-MAS framework (formerly ToMoCoMD-CARDD) is the generalization of the vector-matrixvector procedure described in eqs 1 and 2. The 4k and ,k matrices are divided to atom-level (4a , k ) and bond-level (, e , k ), respectively, by using the procedure to determine local-fragment matrices but considering as fragment each atom “a” (see eq 6 in ref 31) or bond “e” (see eq 7 in ref 36) of the molecule. In this way, bilinear (quadratic, linear) QuBiLS-MAS 2D-MDs for each atom [LOVI − b(q,f)3 a ] or bond [LOEI − b(q,f)3 e ] can be computed (see eqs 3 and 4). So, different aggregation operators (e.g., central tendency statistics) can be used on the LOVIs (or LOEIs) calculated to obtain diversity of global molecular characterizations, just as confirmed elsewhere.49,50 This generalization is unique in the 2D-MDs computation. We remark that, if the
m
∑ ∑ eije , kx iy j = [X ]T ,e , k[Y ] i=1 j=1
(4)
2.3. Feature Selection and Modeling Methodology. First, 335 configurations of the bilinear, quadratic, and linear QuBiLS-MAS 2DMDs were computed on the EPA training set. These configurations were created according to the outcomes obtained in several cheminformatics studies (see Section 4 in ref 29), and they are available in the software of the same name of the descriptors.30 Since numerous MDs are calculated with this program, then the following workflow for data reduction was performed on each descriptor matrix obtained: (1) the MDs with Not-a-Number (NaN) values and/or with values represented as power of 10 were dropped, and (2) a supervised feature selection with regard to the LD50 values corresponding to the EPA training compounds was performed using an in-house program. This program retains the best 2D-MDs according to the Correlation Subset51 and Relief-F52 methods available in the Weka software (v3.8).53 This supervised selection was repeated until obtaining an only matrix of until 5000 QuBiLS-MAS 2D-MDs. Once the previous workflow concluded, the two selection methods aforementioned and a Shannon Entropy (SE)-based unsupervised method54 were applied on the MDs matrix obtained. Thus, three subsets with the most suitable MDs were obtained, so that (1) with the Correlation Subset method51 a subset of features highly correlated with the activity and lowly correlated among them was retained; (2) with the Relief-F method52 the features that best distinguish among instances that are near each other were retained; and (3) with the SE-based method54 the features that best yield distinct descriptions for 1181
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology Table 1. Statistical Parameters of the Best Individual Models Built on the EPA Training Set (5931 structures)a ID
size
modelb
Rtraining (test)
MAEtraining (test)
RMSEtraining (test)
M01 M02 M03 M04 M05 M06 M07 M08 M09 M10 M11 M12
166 159 228 78 95 103 37 78 228 166 166 228
(ReF-CFS)_RF (ReF-CFSB)_RF (ReF-CFS-SE)_RF CFS_RF ReF_RF SE_SVM SE_Wrapper(RF) CFS_SVM (ReF-CFS-SE)_SVM (ReF-CFS)_SVM (ReF-CFS)_kNN(6) (ReF-CFS-SE)_kNN(6) average performance
0.7790 (0.7952) 0.7770 (0.7991) 0.7753 (0.7942) 0.7689 (0.7865) 0.7681 (0.7869) 0.7594 (0.7807) 0.7552 (0.7687) 0.7543 (0.7671) 0.7536 (0.7773) 0.7521 (0.7818) 0.7510 (0.7789) 0.7500 (0.7724) 0.7620 (0.7824)
0.4382 (0.4355) 0.4411 (0.4329) 0.4420 (0.4416) 0.4493 (0.4478) 0.4473 (0.4400) 0.4455 (0.4401) 0.4602 (0.4620) 0.4582 (0.4529) 0.4568 (0.4479) 0.4568 (0.4427) 0.4617 (0.4475) 0.4580 (0.4532) 0.4513 (0.4453)
0.5986 (0.5899) 0.6013 (0.5865) 0.6034 (0.5927) 0.6096 (0.6010) 0.6109 (0.5997) 0.6173 (0.6033) 0.6254 (0.6223) 0.6246 (0.6203) 0.6259 (0.6097) 0.6255 (0.6032) 0.6347 (0.6054) 0.6305 (0.6140) 0.6173 (0.6040)
a
The performance achieved by these models on the EPA test set (1482 structures) is reported between parentheses. bThe name of the models is denoted as “base name”_“learning method”, where “base name” is denoted by the feature selection methods used: ReF (Relief-F), CFS (correlation subset forward), CFSB (correlation subset bidirectional), and SE (Shannon entropy-based method). method to determine the structures that fall within the AD of a model (e.g., Liew et al.69 only used an approach based on range, and Aranda et al.70 only used the leverage method). However, there are several AD methods that characterize the distinct interpolation space stated by the MDs used.67 So, inspired by the idea of consensus-based decision, five approaches (i.e., city-block, Euclidean, Mahalanobis, range, and densitysee refs 71 and 72 for a detailed description) were considered in a recent study to determine the reliability of the predictions.73 More specifically, if the prediction for a molecule lies outside the bounds in at least two AD methods, then it is considered unreliable. This approach will be the one used in this work. These AD methods are available in the Ambit Discovery software,74 and they were used with their respective default configurations. 2.5. Statistical Analysis of the External Predictive Ability. A statistical analysis of the predictive ability achieved on the ProTox and T3DB external sets was performed, with the purpose of knowing if the results obtained by the best QSAR models proposed are significantly different than the ones achieved by the QSAR methods from the literature (e.g., admetSAR,12 ProTox,14 DL-AOT18). To this end, the predictions within the AD of the best models built and QSAR methods from the literature were accounted for. Then, the absolute error module was computed to be used in this study. An analysis to examine the data normality was made, by use of the Kolmogorov−Smirnov test corrected by Lilliefors, and the Shapiro-Wilk test.75 If no normality was indicated (pvalue < 0.05), then nonparametric tests were applied. Thus, the Friedman test76 was used to determine global differences, and if these exist, then the posthoc Holm method77 was used for multiple comparisons. In this report, a significance level of α = 0.05 was considered. 2.6. Retrospective Virtual Screening in a Withdrawn Drug Data Set. With this retrospective screening, a study to know the ability of the best models proposed to generate toxicological warnings was made. In this way, the usefulness of prospectively using the QSAR models proposed in the early prediction of toxicity or in the labeling of chemicals was assessed. An analysis with regard to the QSAR methods most commonly employed in the literature for predicting AOT (i.e., TEST consensus,10 admetSAR,12 ProTox,14 and DL-AOT18) was also performed. To this end, the WITHDRAWN data set, composed of 267 drugs recalled from the market in at least one country due to their toxic/ side effects, was used (see Section 2.1). The predictions obtained on the WITHDRAWN set were converted to milligrams per kilogram. Then, these predictions were labeled (classified) according to the globally harmonized system of classification and labeling of chemicals (GHS):78 Class I (fatal if swallowed − LD50 ≤ 5), Class II (fatal if swallowed −5 < LD50 ≤ 50), Class III (toxic if swallowed −50 < LD50 ≤ 300), Class IV (harmful if swallowed −300 < LD50 ≤ 2000), Class V (may be harmful if swallowed −2000 < LD50 ≤ 5000), and Class VI (nontoxic − LD50 >
structurally different molecules were retained. The SE-based selection was performed with the IMMAN software,55 following a discretization scheme equal to 5931 bins (size of the EPA training set). These three subsets were also shuffled among themselves (ensemble feature selection56), obtaining four additional subsets. A wrapper-type selection57 was also applied on all the previous subsets, by using the Best-First method as search strategy, the Root Mean Square Error (RMSE) as evaluation measure, and the k-Nearest Neighbor (k-NN),58 Multilayer Perceptron (MLP),59 Random Forest (RF),60 and Support Vector Machine (SVM)61 learning methods. Subsequently, several individual regression models (QSAR) were built using the learning methods aforementioned, which are available in the Weka software (v3.8).53 In detail, the k-NN method was configured with k-values (number of neighbors) between 3 and 6 (determined by cross validation), and the ratio 1/distance was used as distance weighting scheme. Moreover, the Puk function62 was used as kernel in the SVM. The remaining learning methods were utilized with the respective default configurations. These configurations were also employed in the wrapper-type selection explained above. It is important to highlight that, on the subsets of MDs obtained from the wrapper-type selection, only the learning method used in the wrapper was employed to build the model. The performance of the models was measured using the correlation coefficient (R), mean absolute error (MAE), and Root Mean Square Error (RMSE) statistical parameters. The training assessment was performed through 10-fold cross-validation. The base (individual) models with REPA−training ≥ 0.75 and MAEEPA−training ≤ 0.5 were retained to build ensemble and consensus models. The ensemble models were built using the bagging,63 additive regression,64 and stacking65 procedures, while the consensus models were developed using the maximum (Max), minimum (Min), average (AVE), and weighted average (WA) aggregation functions. The weightings for the WA function were calculated according to eq 9 described in ref 49, with α = 1. The ensemble models with a training performance better than the one achieved by the best individual model, as well as all the consensus models built, were assessed on the EPA test set. In this way, the best models based on the EPA training set were selected according to their training and test results. Finally, the same QuBiLS-MAS 2D-MDs and learning method used in the best models built on the EPA training set were employed to build the models on the EPA-full and Zhu training sets, respectively. All these models were assessed on the external sets considered to measure their predictive abilities. Lastly, a comparison with respect to several QSAR methodologies established in the literature (e.g., admetSAR,12 ProTox14) was performed. Scheme 2 depicts the workflow described above. 2.4. Applicability Domain. An applicability domain (AD) analysis was performed to know the behavior of the models to perform reliable predictions.66−68 This analysis is often performed using only one 1182
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology
Table 2. Statistical Parameters Achieved by the Best Ensemble and Consensus Models Built for the EPA Training Set (5931 structures) and the EPA Test Set (1482 structures), Respectively model ID
size
ensemble method
base models
Rtraining
MAEtraining
RMSEtraining
Rtest
MAEtest
RMSEtest
M13 M14 M15 M16 M17 M18 M19
166 228 159 166 200 200 200
additive regression additive regression additive regression stacking (linear regression) consensus (average) consensus (weighted average) consensus (minimum)
M01 M03 M02 M01, M10, M11 M13, M15 M13, M15 M13, M15
0.7885 0.7864 0.7841 0.7801
0.4224 0.4256 0.4266 0.4312
0.5843 0.5875 0.5898 0.5933
0.8059 0.8092 0.8093 0.8024 0.8103 0.8102 0.8095
0.4160 0.4179 0.4157 0.4267 0.4136 0.4138 0.4106
0.5725 0.5694 0.5690 0.5755 0.5676 0.5677 0.5693
Table 3. Statistical Parameters Achieved by the Consensus Models from M17 to M25a ProTox external set (425 compounds)
T3DB external set (284 compounds)
ID
size
model
MAEProTox
RMSEProTox
MAET3DB
RMSET3DB
M17 M18 M19 M20 M21 M22 M23 M24 M25
200 200 200 200 200 200 200 200 200
consensus (average) consensus (weighted average) consensus (minimum) consensus (average) consensus (weighted average) consensus (minimum) consensus (average) consensus (weighted average) consensus (minimum)
0.4106 0.4108 0.4073 0.3975 0.3976 0.3971 0.3807 0.3805 0.3848
0.5901 0.5902 0.5890 0.5784 0.5785 0.5781 0.5663 0.5656 0.5756
0.3004 0.3006 0.2968 0.3019 0.3029 0.2796 0.3016 0.3049 0.3057
0.5157 0.5155 0.5221 0.5098 0.5097 0.5161 0.5091 0.5095 0.5272
a
The models from M17 to M19, from M20 to M22, and from M23 to M25 were built on the EPA training set (5931 structures), EPA-full training set (7413 structures), and Zhu training set (10 152 structures), respectively.
5000). A labeling between Class I and Class V is considered as toxicological warning.
being better than the ones obtained by the best base model (see Table 1). It indicates that these models may have a better behavior on the test set. In this sense, it can be noted that all the ensemble and consensus models achieve 0.8024 ≤ Rtest ≤ 0.8103 and 0.4106 ≤ MAEtest ≤ 0.4267. In detail, consensus models M17 (Rtest = 0.8103 and MAEtest = 0.4136), M18 (Rtest = 0.8102 and MAEtest = 0.4138), and M19 (Rtest = 0.8095 and MAEtest = 0.4106) are those with the best performance. Model M19 presents the lowest MAEtest. Moreover, as it was explained above, regression models to predict AOT were also built on both the EPA-full (7413 structures) and Zhu (10 152 structures) training sets, to consider a wider chemical space in the learning process and, in addition, to guarantee comparability of results with regard to several models reported in the literature.12−17 Specifically, the QSAR models developed are based on the same QuBiLS-MAS 2D-MDs and learning method used to build models M13 and M15, which make up the consensus models from M17 to M19, which are the ones with the best performances so far. Therefore, only consensus models based on the AVE, WA, and Min functions will be analyzed for the two training sets mentioned above. It is expected that the new consensus models yield better predictions on external sets than the consensus models created on the EPA training set (5931 structures). In this sense, Table 3 shows the results obtained by the consensus models on the ProTox and T3DB external sets. On one hand, if the performance of the models built on the EPA training set (from M17 to M19), EPA-full training set (from M20 to M22), and Zhu training set (from M23 to M25) is analyzed, it can be seen that, for the first two training sets, the best result is achieved by the Min function-based consensus models M19 (MAEProTox = 0.4073, MAET3DB = 0.2968) and M22 (MAEProTox = 0.3971, MAET3DB = 0.2796), respectively, whereas for the latter training set mentioned, the AVE function-
3. RESULTS AND DISCUSSION 3.1. Performance of the Best Individual, Ensemble, and Consensus Models. From the data matrices built, 84 individual (or base) models (42 k-NN, 14 MLP, 14 RF, and 14 SVM) were developed and examined. Only 12 of these regression models achieved the cutoff of REPA−training ≥ 0.75 and MAEEPA−training ≤ 0.5 in a 10-fold cross validation. Therefore, they were included in the pool of base models for the ensemble and consensus models building. Table 1 shows the performance of these models on the EPA training/test sets. As it can be observed, the number of QuBiLS-MAS 2D-MDs included in the models range from 37 to 228, for a ratio of cases-to-independent variable of 160:1 and 26:1 in the smallest and largest models, respectively. It can also be observed that all the models present a good behavior on the test set, at presenting values of 0.7671 ≤ Rtest ≤ 0.7991 and 0.4329 ≤ MAEtest ≤ 0.4620. The best performances belong to models M02 (Rtest = 0.7991, MAEtest = 0.4329), M01 (Rtest = 0.7952, MAEtest = 0.4355), and M03 (Rtest = 0.7942 and MAEtest = 0.4416), being the only ones with Rtest ≥ 0.79. Lastly, it can also be observed that the RF and SVM learning methods allow to achieve the best correlations, except in models M11 and M12, which are based on the k-NN (k = 6) method. Posteriorly, from the best individual (or base) models (see Table 1), ensemble models were built using the bagging,63 additive regression,64 and stacking65 procedures, as well as consensus models based on the Max, Min, AVE, and WA functions. Table 2 shows the statistics of the best ensemble and consensus models built for the EPA training/test sets. As it can be observed, the ensemble models present 0.7801 ≤ REPA−training ≤ 0.7885 and 0.4224 ≤ MAEEPA−training ≤ 0.4312, both statistics 1183
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology
Table 4. Performance of the Best Consensus Models Built on the EPA Training Set (from M17 to M19), EPA-Full Training Set (from M20 to M22), and Zhu Training Set (from M23 to M25), Respectively, According to Their Applicability Domains EPA test set (1482 compounds)
ProTox external set (425 compounds)
T3DB external set (284 compounds)
ID
% coverage
MAEtest‑AD
RMSEtest‑AD
% coverage
MAEProTox‑AD
RMSEProTox‑AD
% coverage
MAET3DB‑AD
RMSET3DB‑AD
M17 M18 M19 M20 M21 M22 M23 M24 M25
98.25 98.25 98.25
0.4084 0.4087 0.4044
0.5611 0.5613 0.5613
94.59 94.59 94.59 97.41 97.41 97.41 94.12 94.12 94.12
0.4096 0.4098 0.4067 0.3992 0.3993 0.3992 0.3774 0.3773 0.3811
0.5925 0.5926 0.5916 0.5813 0.5814 0.5809 0.5688 0.5681 0.5779
95.42 95.42 95.42 95.42 95.42 95.42 93.66 93.66 93.66
0.2632 0.2635 0.2586 0.2547 0.2559 0.2286 0.2434 0.2471 0.2444
0.4402 0.4403 0.4405 0.4023 0.4030 0.3957 0.3859 0.3871 0.3979
Figure 1. Comparative graphic of the performance achieved by consensus model M19 with respect to seven methods reported in the literature on the EPA test set (1482 compounds), according to the AD of the GUSAR wNN method, TEST consensus method, and model M19.
3.2. Applicability Domain of the Best Models Built. An AD analysis (see Section 2.4) for each consensus model built was done, with the purpose of determining how trustworthy are the predictions performed by them. Table 4 shows the outcomes obtained in this study for the EPA test set, ProTox external set, and T3DB external set. If the coverage statistical parameter (i.e., ratio of compounds that is expected to fall within the AD) is examined, it can be observed that, for the EPA test set, the models present a coverage equal to 98.25%, while for the ProTox and T3DB external sets, the models present a coverage between 93.58% and 96.89% and between 93.66% and 95.42% of the
based model M23 (MAEProTox = 0.3807, MAET3DB = 0.3016) and the WA function-based model M24 (MAEProTox = 0.3805, MAET3DB = 0.3049) are the best ones. On the other hand, it can be observed that the performance of the models on the ProTox set improves as long as the number of training compounds increases. A similar behavior can be noted in the T3DB external set. However, in this external set, the models built on the Zhu set present a slight decrease in its performance as compared to the ones built on the EPA-full training set. See Supporting Information 2 for the values predicted. 1184
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology
Figure 2. Comparative graphic of the performance achieved by consensus models M19, M22, and M24 with respect to four methods reported in the literature on the ProTox external set (425 compounds), according to the applicability domain (AD) of the GUSAR wNN method, GUSAR nNN method, and model M24.
models built and an analysis of the outcomes achieved by them have been performed. However, it is important to perform a comparison with respect to other procedures established in the literature. To this end, from the results previously described, it can be stated that consensus models M19, M22, and M24 (see Supporting Information 3 to choose between models M23 and M24) are the best ones for each training set accounted for. Thus, they will be used in the comparison. Several freely accessible or commercial QSAR/similarity-based tools are also used for this study. These tools include the TEST software (version 4.2.1), 10,11 GUSAR software, 16 admetSAR web server (v1.0),12,13 DL-AOT web server,18 and ProTox-II web server.14,15 Hereafter, the combinatorial use of the QNA (Quantitative Neighborhoods of Atoms) and MNA (Multilevel Neighborhoods of Atoms) MDs is referred to as GUSAR consensus method (see Supporting Information 4 for the training results with this approach). Study Case 1: Performance on the EPA Test Set. Figure 1 shows a comparative graphic of the performance (MAE) achieved by seven QSAR methods reported in the literature with regard to the performance achieved by consensus model M19 on the EPA test set (1482 compounds). All these methods were built on the EPA training set (5931 compounds). Therefore, comparability of results is guaranteed. In detail, the methods from the literature to be compared are the GUSAR
predictions performed, respectively. Therefore, in a general sense, it can be concluded that all the consensus models built perform highly reliable predictions according to the corresponding AD. Nonetheless, it is important to remark that the models built on the EPA-full training set (from M20 to M22) are those with the best coverage, followed by the models built on the EPA training set (from M17 to M19) and Zhu training set (from M23 to M25), respectively. Moreover, if the statistics of the predictions performed are analyzed (see Table 4), it can be observed that all the consensus models present comparable-to-superior results with respect to the ones obtained when the AD is not accounted for (see Tables 2 and 3). Note that the best results were achieved on the T3DB external set, where the models present 0.2286 ≤ MAET3DB−AD ≤ 0.2635 (without AD: 0.2796 ≤ MAET3DB ≤ 0.3057) and 0.3859 ≤ RMSET3DB−AD ≤ 0.4405 (without AD: 0.5091 ≤ RMSET3DB ≤ 0.5272). Lastly, it can be noted that models M19 (MAEtest−AD = 0.4044, MAEProTox−AD = 0.4067, MAET3DB−AD = 0.2586), M22 (MAEProTox−AD = 0.3992, MAET3DB−AD = 0.2286), M23 (MAEProTox−AD = 0.3774, MAET3DB−AD = 0.2434), and M24 (MAEProTox−AD = 0.3773, MAET3DB−AD = 0.2471), just as in the previous analysis without AD, are the ones with the best performances. 3.3. Comparison with Respect to Other QSAR Models Reported in the Literature. So far, a description of the best 1185
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology
Figure 3. Comparative graphic of the performance achieved by consensus models M19, M22, and M24 with respect to six methods reported in the literature on the T3DB external set (284 compounds), according to the applicability domain (AD) of the GUSAR wNN method, GUSAR nNN method, TEST consensus method, and model M24.
shows the statistical parameters obtained according to all the ADs. Study Case 2: Performance on the ProTox External Set. Figure 2 depicts the performance achieved by the GUSAR wNN,16 GUSAR nNN, 16 DL-AOT, 18 and admetSAR12 methods, as well as the one obtained by consensus models M19, M22, and M24, according to their ADs on the ProTox external set (425 compounds) (for more details see Supporting Information 6). The TEST software10 fails during the prediction (error in the MDs computation) on the ProTox set, and thus, none of its methods were taken into account in the current comparison. ProTox method14,15 was not considered in this study, because the external set used presents a similar distribution to the training set (38 515 compounds) of this method. Note that, excepting consensus models M19 and M22, all the remaining QSAR methods were built on the Zhu training set (10 152 compounds). As a result, consensus model M24 is the one with the best performance for all the ADs analyzed. This model is the only one in always achieving MAEProTox < 0.38, whereas the other QSAR methods yield MAEProTox > 0.39. It can also be observed that consensus models M19 (0.3965 ≤ MAEProTox ≤ 0.4054) and M22 (0.3907 ≤ MAEProTox ≤ 0.3967), as well as the admetSAR (0.3905 ≤ MAEProTox ≤ 0.3999) and GUSAR wNN (0.3915 ≤ MAEProTox ≤ 0.4029) methods, present comparable perform-
consensus method considering nearest neighbors (GUSAR wNN),16 the GUSAR consensus method without considering nearest neighbors (GUSAR nNN),16 the GUSAR method using only QNA MDs,16 the GUSAR method using only MNA MDs,16 the TEST FDA method (TEST FDA),10 the TEST nearest neighbors method (TEST NN),10 and the TEST consensus method (TEST consensus).10 The comparisons were performed according to the AD of the GUSAR wNN method, the TEST consensus method, and model M19. TEST hierarchical clustering method10 was not included in this comparison due to its poor performance. As it can be observed in Figure 1, consensus model M19, built with the QuBiLS-MAS 2D-MDs, presents the best performance irrespectively of the AD considered, being the only model in always achieving MAEtest < 0.41. It can also be noted that the AD of model M19 is the second best (coverage = 98.25%), which is only slightly overcome by the AD of the TEST consensus method (coverage = 98.38%). Moreover, in a general sense, the GUSAR wNN method achieves the second-best results, only surpassed by the TEST consensus method if the AD of this latter is analyzed. Finally, it can be observed that the GUSAR nNN and TEST NN methods present similar performances and that the TEST FDA, GUSAR MNA, and GUSAR QNA methods are the ones with the worst outcomes. Supporting Information 5 1186
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology
Figure 4. Graphics corresponding to the GHS labeling obtained in the WITHDRAWN set by (A) consensus models M19, M22, and M24 proposed in this work and (B) consensus model M22 (the best one in external prediction) and the TEST consensus, DL-AOT, admetSAR, and ProTox methods established in the literature.
methods, which are those most commonly used in the literature, obtain the poorest behaviors, at presenting MAET3DB > 0.6. All in all, it can be concluded that the consensus models (M19, M22, and M24) proposed constitute prominent alternatives to be used in the study of AOT, because they achieve results superior to the ones obtained by all the QSAR methods analyzed from the literature. In this way, the hypothesis assumed in the present work is confirmed, which poses that the use of current def initions of MDs may contribute to develop better QSAR models to predict AOT. It is important to remark that the models built always presented the best performance regardless of the chemical data set used to assess them, to difference of the QSAR methods from the literature, which showed different performances in the sets considered. For instance: DL-AOT method was the worst one in the ProTox external set, whereas it was second-best one in the T3DB external set. A statistical assessment to confirm the results obtained is presented below. 3.4. Statistical Analysis of the External Predictive Ability Achieved by the Best Models Built. This analysis was made to determine if the outcomes obtained by the best consensus models built (i.e., M19, M22, and M24) are statistically different than the ones achieved by the QSAR methods from the literature. To this end (see Section 2.5), the absolute error module was computed for the 360 (of the 425) and 250 (of the 284) predictions within the AD of all the QSAR methods (models) in the ProTox and T3DB external sets, respectively (see Supporting Information 8). For the analysis in the ProTox set, the admetSAR12 and DL-AOT18 methods, as well as the GUSAR software-based methods16 (i.e., GUSAR wNN, GUSAR nNN, GUSAR MNA, GUSAR QNA) were considered, whereas for the T3DB set, in addition of the QSAR methods aforementioned, the ProTox14,15 method and the TEST software-based methods10,11 (i.e., TEST HC, TEST FDA, TEST NN, and TEST consensus) were also accounted for. According to the normality test made, the null hypothesis was rejected for all the methods under study in both chemical data sets (Supporting Information 9). Therefore, nonparametric tests were applied. In this sense, global differences were determined according to the results obtained from the Friedman test
ances, even when models M19 and M22 were built on smaller training sets. Note that the GUSAR nNN (0.4164 ≤ MAEProTox ≤ 0.4269) and DL-AOT (0.4602 ≤ MAEProTox ≤ 0.4654) methods are those with the poorest outcomes. The DL-AOT method is based on a convolutional neural network, which is a more complex procedure than the one used by the models proposed here; even then, models M24, M22, and M19 achieve better results for the external set under study. Study Case 3: Performance on the T3DB External Set. Finally, Figure 3 represents the performance obtained by the GUSAR wNN,16 GUSAR nNN,16 TEST consensus,10 DLAOT,18 admetSAR,12 and ProTox14,15 methods, as well as the one achieved by the best QSAR models proposed in this report (i.e., consensus models M19, M22, and M24), according to their ADs on the T3DB external set (284 compounds). Other QSAR models were also analyzed (e.g., TEST NN, GUSAR MNA), but as they yielded poor outcomes, these were not graphically represented. Supporting Information 7 contains the results corresponding to the 14 methods examined. We remark that the ProTox method was built on 38515 training compounds, that the TEST software-based methods and model M19 were created on the EPA training set (5931 compounds), that model M22 was developed on the EPA-full training set (7413 compounds), and that the other QSAR methods were built on the Zhu training set (10 152 compounds). As it can be noted, consensus model M22 (0.2165 ≤ MAEProTox ≤ 0.2634), followed by the DL-AOT method (0.2374 ≤ MAEProTox ≤ 0.2752) and consensus models M19 (0.2368 ≤ MAEProTox ≤ 0.2798) and M24 (0.2451 ≤ MAEProTox ≤ 0.2870), are the ones with the best outcomes for all the ADs analyzed. Specifically, consensus model M22 is the only one in obtaining MAET3DB < 0.23 (MAET3DB = 0.2165 and MAET3DB = 0.2226 within the AD of the TEST consensus method and model M24, respectively), whereas the remaining procedures achieve results superior to the threshold aforementioned. Moreover, it can also be seen that the GUSAR nNN (0.2754 ≤ MAEProTox ≤ 0.3113) and GUSAR wNN (0.3080 ≤ MAEProTox ≤ 0.3544) methods have comparable performances. Curiously, the ProTox, admetSAR, and TEST consensus 1187
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology
Table 5. Pearson Correlation-Based Similarity Coefficients among the Models Proposed and QSAR Methods Most Commonly Used in the Literature to Predict AOTa TEST consensus DL-AOT admetSAR ProTox M19 M22 M24
TEST consensus
DL-AOT
admetSAR
ProTox
M19
M22
M24
1.000 0.496 0.433 0.351 0.447 0.465 0.456
1.000 0.658 0.557 0.531 0.599 0.666
1.000 0.612 0.648 0.723 0.848
1.000 0.506 0.548 0.648
1.000 0.890 0.735
1.000 0.814
1.000
a
The similarities were calculated from the toxicity classes obtained by each model in the WITHDRAWN dataset according to the GHS labelling.
(Supporting Information 10). According to this test, model M24 has the rank of 1 (best-performing model) in the ProTox set, followed by models GUSAR wNN, M19, M22, admetSAR, DLAOT, GUSAR nNN, GUSAR MNA, and GUSAR QNA, in that order; whereas in the T3DB set, model M22 is the one with the best performance, followed by models M19, DL-AOT, M24, GUSAR MNA, GUSAR nNN, TEST FDA, GUSAR wNN, GUSAR QNA, TEST NN, TEST consensus, ProTox, admetSAR, and TEST HC, in that order. These outcomes confirm the descriptive assessment performed in the previous section, where it was indicated that the QSAR models based on the QuBiLS-MAS 2D-MDs obtain the best overall performances. Finally, it can be observed that, according to the results obtained from the Holm posthoc procedure (Supporting Information 10), model M24 is statistically superior to the DL-AOT, GUSAR nNN, GUSAR MNA, and GUSAR QNA methods in the ProTox set. In this data set, model M24 is not statistically better than the GUSAR wNN and admetSAR methods, neither with respect to models M19 and M22. Moreover, in the T3DB set, model M22 presents a statistically superior performance with regard to all the QSAR methods from the literature (Supporting Information 10), excepting the DLAOT method, as well as models M19 and M24. Therefore, it can be statistically stated that the QuBiLS-MAS 2D-MDs allowed to build the best QSAR models to predict AOT. Lastly, it can also be statistically concluded that model M22 is the best of all by a short difference, followed by models M24 and M19 (Supporting Information 11), respectively. 3.5. Results of the Retrospective Virtual Screening in a Withdrawn Drug Data Set. This study was performed to assess the usefulness of prospectively using the best QSAR models created in the labeling of chemicals. To this end, the WIHTDRAWN data set was used, because it contains drugs recalled from the market in at least one country, due to their toxic/side effects. Figure 4A shows the labeling yielded by models M19, M22, and M24 according to the GHS standard on the set aforementioned (see Supporting Information 12). On one hand, it is expected that the majority of these withdrawn drugs is not labeled as highly toxic (Class I−IIfatal if swallowed), because they were marketable drugs, and thus, they passed several preclinical toxicological assays. Thereby, it can be analyzed that the models considered only label four drugs as highly toxic: indomethacin (gastrointestinal), amphetamine (cardiovascular), aprobarbital (respiratory), and bithionol (dermatological). This labeling is in correspondence with the toxic/side effects evidenced by these drugs. However, it is also expected that only a few of these drugs are in a Class VI (nontoxic) labeling, because, as it was aforementioned, they present toxic/side effects that caused
them to be recalled from the market. In this sense, it can be noted that only 7 (0.026%), 9 (0.033%), and 17 (0.063%) drugs were labeled in this class (e.g., gentamicin) by models M19, M22, and M24, respectively. Lastly, it can be observed that the majority of these compounds were labeled between Class III (e.g., oxyphenisatin) and Class V (e.g., practolol), mainly Class IV (e.g., alosetron). That is, the majority of these withdrawn drugs were identified as toxic or harmful if swallowed. In general, models M19, M22, and M24 yielded toxicological warnings between Class I and Class IV, for 226 (84.64%), 221 (82.77%), and 201 (75.28%) drugs, and between Class I and Class V for 260 (97.38%), 258 (96.63%), and 250 (93.63%) drugs, respectively. Moreover, Figure 4B shows the labeling obtained by model M22 and the TEST consensus,10 DL-AOT,18 admetSAR,12 and ProTox14 methods. Model M22 was selected, because it presents the best external predictive power (see Section 3.4). As it can be noted, similar results were obtained by the methods from the literature, at labeling the majority of the withdrawn drugs between Class III and Class V. In summary, model M22, TEST consensus method, DL-AOT method, admetSAR method, and ProTox method generate toxicological warnings between Class I and Class IV for 221 (82.77%), 207 (77.53%), 197 (73.78%), 192 (71.91%), and 200 (74.91%) drugs and between Class I and Class V for 258 (96.63%), 250 (93.63%), 253 (94.76%), 248 (92.88), and 250 (93.63%) drugs, respectively. As it can be analyzed, for both warning groups, model M22 is the one with the highest ability of labeling a compound as toxic (or harmful). Note that model M19 also presents better ability than the methods from the literature to generate toxicological warnings, while model M24 has a comparable behavior. To conclude, Table 5 shows the Pearson correlation-based similarity coefficients among the consensus models proposed and the QSAR methods from the literature considered in this study. The similarity coefficients were determined from the toxicity classes obtained according to the GHS labeling. As it can be observed, models M19, M22, and M24 majorly present moderate-to-low similarity coefficients (C< (minssssC) in PC7.
4. CONCLUSIONS Several QSAR models have been developed to predict AOT, by using the QuBiLS-MAS framework for the molecular encoding. 1189
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology ORCID
Three training sets comprised of 5931 (EPA training set), 7413 (EPA-full training set), and 10 152 (Zhu training set) compounds were used for model building (EPA training set ⊆ EPA-full set ⊆ Zhu set). The models created on the EPA training set were validated on the EPA test set (1482 compounds) as well as on the ProTox (425 compounds) and T3DB (284 compounds) external sets, whereas the models built on the other two training sets were only assessed on the external sets mentioned above. The models developed ranging from 37 to 228 QuBiLS-MAS 2D-MDs, and they present good predictive ability and high coverage percent according to the AD analysis performed. Specifically, two Min operator-based consensus models denoted as M19 and M22, as well as a WA operator-based consensus model denoted as M24, were selected as the best ones for each training set accounted for. On one hand, model M19, created on the EPA training set, presents MAEtest−AD = 0.4044, MAEProTox−AD = 0.4067, and MAET3DB−AD = 0.2586 on the 1456, 402, and 271 compounds included within the AD for the EPA test set, ProTox external set, and T3DB external set, respectively. On the other hand, model M22, built on the EPA-full training set, has MAEProTox−AD = 0.3992 and MAET3DB−AD = 0.2286 on the 414 and 271 compounds inside its AD for the ProTox and T3DB external sets, respectively. In addition, for these two external sets, model M24, built on the Zhu training set, presents MAEProTox−AD = 0.3773 and MAET3DB−AD = 0.2471 on the 400 and 266 compounds within its AD, respectively. These outcomes were compared and statistically validated with respect to the ones obtained by 14 QSAR methods from the literature, which include the TEST software-based methods,10 the GUSAR software-based methods,16 and the admetSAR,12 ProTox,14 and DL-AOT18 methods. As a result, model M22 presents the best performance. It is important to remark that model M19 is the one with the best performance on the EPA test set among all the QSAR methods built on the EPA training set (e.g., TEST software-based methods). Moreover, a retrospective study on 261 withdrawn drugs due to their toxic/side effects was performed, to assess the ability of the models proposed in generating toxicological warnings. A comparison with regard to the methods from the literature was also made. As a result, model M22 has the highest ability of labeling a compound as toxic (or harmful) according to the GHS scheme.78 All in all, it can be concluded that the models proposed here, especially model M22, constitute prominent tools to predict AOT, at providing the best results among all the methods examined. To this end, a software was built to be used in virtual screening tasks (http://tomocomd.com/apps/ptoxra).
■
César R. García-Jacas: 0000-0002-3962-7658 Yovani Marrero-Ponce: 0000-0003-2721-1142 Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS C.R.G.J. acknowledges to the program “Cátedras CONACYT” from “Consejo Nacional de Ciencia y Tecnologiá (CONACYT)”, México, by the support to the endowed chair 501/ ́ 2018 at “Centro de Investigación Cientifica y de Educación Superior de Ensenada”. Y.M.P. thanks to the program “Profesor Invitado” for a postdoctoral fellowship to work at Valencia Univ. in 2018−2019. Y.M.P. also acknowledges the support from USFQ “Chancellor Grant No. 2017-2018 (Project ID11192)
■
ASSOCIATED CONTENT
S Supporting Information *
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.chemrestox.9b00011.
■
REFERENCES
(1) Boyerinas, B., Jochems, C., Fantini, M., Heery, C. R., Gulley, J. L., Tsang, K. Y., and Schlom, J. (2015) Antibody-Dependent Cellular Cytotoxicity Activity of a Novel Anti−PD-L1 Antibody Avelumab (MSB0010718C) on Human Tumor Cells. Cancer Immunol. Res. 3, 1148−1157. (2) Adrees, M., Ali, S., Rizwan, M., Zia-Ur-Rehman, M., Ibrahim, M., Abbas, F., Farid, M., Qayyum, M. F., and Irshad, M. K. (2015) Mechanisms of silicon-mediated alleviation of heavy metal toxicity in plants: A review. Ecotoxicol. Environ. Saf. 119, 186−197. (3) Mebs, D. (2001) Toxicity in animals. Trends in evolution? Toxicon 39, 87−96. (4) Saha, N. C., Bhunia, F., and Kaviraj, A. (1999) Toxicity of Phenol to Fish and Aquatic Ecosystems. Bull. Environ. Contam. Toxicol. 63, 195−202. (5) Walum, E. (1998) Acute oral toxicity. Environ. Health Perspect. 106, 497−503. (6) Enslein, K. (1978) A toxicity estimation model. J. Environ. Pathol. Toxicol. 2, 115−121. (7) Enslein, K., Lander, T. R., Tomb, M. E., and Craig, P. N. (1989) A Predictive Model for Estimating Rat Oral Ld50 Values. Toxicol. Ind. Health 5, 265−265. (8) Eldred, D. V., and Jurs, P. C. (1999) Prediction of Acute Mammalian Toxicity of Organophosphorus Pesticide Compounds from Molecular Structure. SAR QSAR Environ. Res. 10, 75−99. (9) Guo, J.-X., Wu, J. J. Q., Wright, J. B., and Lushington, G. H. (2006) Mechanistic Insight into Acetylcholinesterase Inhibition and Acute Toxicity of Organophosphorus Compounds:: A Molecular Modeling Study. Chem. Res. Toxicol. 19, 209−216. (10) Zhu, H., Martin, T. M., Ye, L., Sedykh, A., Young, D. M., and Tropsha, A. (2009) Quantitative Structure-Activity Relationship Modeling of Rat Acute Toxicity by Oral Exposure. Chem. Res. Toxicol. 22, 1913−1921. (11) TEST software. https://www.epa.gov/chemical-research/ toxicity-estimation-software-tool-test (April 2, 2019). (12) Cheng, F., Li, W., Zhou, Y., Shen, J., Wu, Z., Liu, G., Lee, P. W., and Tang, Y. (2012) admetSAR: A Comprehensive Source and Free Tool for Assessment of Chemical ADMET Properties. J. Chem. Inf. Model. 52, 3099−3105. (13) admetSAR software. http://lmmd.ecust.edu.cn/admetsar1/ home/ (April 2, 2019). (14) Drwal, M. N., Banerjee, P., Dunkel, M., Wettig, M. R., and Preissner, R. (2014) ProTox: a web server for the in silico prediction of rodent oral toxicity. Nucleic Acids Res. 42, W53−W58. (15) ProTox-II software. http://tox.charite.de/protox_II/ (April 2, 2019). (16) Lagunin, A., et al. (2011) QSAR Modelling of Rat Acute Toxicity on the Basis of PASS Prediction. Mol. Inf. 30, 241−250. (17) Lu, J., Peng, J., Wang, J., Shen, Q., Bi, Y., Gong, L., Zheng, M., Luo, X., Zhu, W., Jiang, H., and Chen, K. (2014) Estimation of acute oral toxicity in rat using local lazy learning. J. Cheminf. 6, 26.
Molecular structures in training sets, predictions, tabulated RMSE and MAE data, models built on training sets, tabulated correlation coefficients, tests of normality, Friedman and NPar tests, PCA results (ZIP)
AUTHOR INFORMATION
Corresponding Author
*E-mail:
[email protected],
[email protected],
[email protected]. 1190
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology (18) Xu, Y., Pei, J., and Lai, L. (2017) Deep Learning Based Regression and Multiclass Models for Acute Oral Toxicity Prediction with Automatic Chemical Feature Extraction. J. Chem. Inf. Model. 57, 2672−2685. (19) Hand, D. J. (2006) Classifier Technology and the Illusion of Progress. Stat. Sci. 21, 1−14. (20) Young, S. S., et al. (2012) Chemical Descriptors Are More Important Than Learning Algorithms for Modelling. Mol. Inf. 31, 707− 710. (21) Skoraczyński, G., Dittwald, P., Miasojedow, B., Szymkuć, S., Gajewska, E. P., Grzybowski, B. A., and Gambin, A. (2017) Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient? Sci. Rep. 7, 3582. (22) Validation of (Q)SAR Models. http://www.oecd.org/ chemicalsafety/risk-assessment/validationofqsarmodels.htm (April 2, 2019). (23) Whitley, D., and Watson, J. Complexity Theory and the No Free Lunch Theorem. In Search Methodologies; Burke, E., and Kendall, G., Eds.; Springer US: Boston, MA, 2005; pp 317−339. (24) Barigye, S. J., Marrero-Ponce, Y., López, Y. M., Santiago, O. M., Torrens, F., Domenech, R. G., and Galvez, J. (2013) Event-based criteria in GT-STAF information indices: theory, exploratory diversity analysis and QSPR applications. SAR QSAR Environ. Res. 24, 3−34. (25) Martínez-Santiago, O., Millán-Cabrera, R., Marrero-Ponce, Y., Barigye, S. J., Martínez-López, Y., Torrens, F., and Pérez-Giménez, F. (2014) Discrete Derivatives for Atom-Pairs as a Novel GraphTheoretical Invariant for Generating New Molecular Descriptors: Orthogonality, Interpretation and QSARs/QSPRs on Benchmark Databases. Mol. Inf. 33, 343−368. (26) Cubillán, N., Marrero-Ponce, Y., Ariza-Rico, H., Barigye, S. J., García-Jacas, C. R., Valdes-Martini, J. R., and Alvarado, Y. J. (2015) Novel global and local 3D atom-based linear descriptors of the Minkowski distance matrix: theory, diversity−variability analysis and QSPR applications. J. Math. Chem. 53, 2028−2064. (27) Marrero-Ponce, Y., García-Jacas, C. R., Barigye, S. J., ValdésMartiní, J. R., Rivera-Borroto, O. M., Pino-Urias, R. W., Cubillán, N., Alvarado, Y. J., et al. (2015) Optimum Search Strategies or Novel 3D Molecular Descriptors: is there a Stalemate? Curr. Bioinf. 10, 533−564. (28) García-Jacas, C. R., Marrero-Ponce, Y., Barigye, S. J., ValdésMartiní, J. R., Rivera-Borroto, O. M., and Olivero-Verbel, J. (2014) NLinear Algebraic Maps to Codify Chemical Structures: is a suitable generalization to the atom-pairs approaches? Curr. Drug Metab. 15, 441−469. (29) Valdés-Martiní, J. R., Marrero-Ponce, Y., García-Jacas, C. R., Martinez-Mayorga, K., Barigye, S. J., Vaz d'Almeida, Y. S., Pham-The, H., Pérez-Giménez, F., and Morell, C. A. (2017) QuBiLS-MAS, open source multi-platform software for atom- and bond-based topological (2D) and chiral (2.5D) algebraic molecular descriptors computations. J. Cheminf. 9, 1−26. (30) QuBiLS-MAS software. http://tomocomd.com/qubils-mas (April 2, 2019). (31) Ponce, Y. (2003) Total and Local Quadratic Indices of the Molecular Pseudograph’s Atom Adjacency Matrix: Applications to the Prediction of Physical Properties of Organic Compounds. Molecules 8, 687−726. (32) Marrero-Ponce, Y. (2004) Total and local (atom and atom type) molecular quadratic indices: significance interpretation, comparison to other molecular descriptors, and QSPR/QSAR applications. Bioorg. Med. Chem. 12, 6351−6369. (33) Castillo-Garit, J. A., Martinez-Santiago, O., Marrero-Ponce, Y., Casañola-Martín, G. M., and Torrens, F. (2008) Atom-based nonstochastic and stochastic bilinear indices: Application to QSPR/QSAR studies of organic compounds. Chem. Phys. Lett. 464, 107−112. (34) Marrero-Ponce, Y., Torrens, F., García-Domenech, R., OrtegaBroche, S. E., and Zaldivar, V. R. (2008) Novel 2D TOMOCOMDCARDD molecular descriptors: atom-based stochastic and nonstochastic bilinear indices and their QSPR applications. J. Math. Chem. 44, 650−673.
(35) Marrero-Ponce, Y., Martínez-Albelo, E. R., Casañola-Martín, G. M., Castillo-Garit, J. A., Echevería-Díaz, Y., Zaldivar, V. R., Tygat, J., Borges, J. E. R., García-Domenech, R., Torrens, F., and Pérez-Giménez, F. (2010) Bond-based linear indices of the non-stochastic and stochastic edge-adjacency matrix. 1. Theory and modeling of ChemPhys properties of organic molecules. Mol. Diversity 14, 731−753. (36) Marrero-Ponce, Y., Torrens, F., Alvarado, Y. J., and Rotondo, R. (2006) Bond-based global and local (bond, group and bond-type) quadratic indices and their applications to computer-aided molecular design. 1. QSPR studies of diverse sets of organic chemicals. J. Comput.Aided Mol. Des. 20, 685−701. (37) Marrero-Ponce, Y., Iyarreta-Veitía, M., Montero-Torres, A., Romero-Zaldivar, C., Brandt, C. A., Á vila, P. E., Kirchgatter, K., and Machado, Y. (2005) Ligand-Based Virtual Screening and in Silico Design of New Antimalarial Compounds Using Nonstochastic and Stochastic Total and Atom-Type Quadratic Maps. J. Chem. Inf. Model. 45, 1082−1100. (38) Meneses-Marcel, A., Marrero-Ponce, Y., Machado-Tugores, Y., Montero-Torres, A., Pereira, D. M., Escario, J. A., Nogal-Ruiz, J. J., Ochoa, C., Arán, V. J., Martínez-Fernández, A. R., and García Sánchez, R. N. (2005) A linear discrimination analysis based virtual screening of trichomonacidal lead-like compounds: Outcomes of in silico studies supported by experimental results. Bioorg. Med. Chem. Lett. 15, 3838− 3843. (39) Marrero-Ponce, Y., Meneses-Marcel, A., Castillo-Garit, J. A., Machado-Tugores, Y., Escario, J. A., Barrio, A. G., Pereira, D. M., NogalRuiz, J. J., Arán, V. J., Martínez-Fernández, A. R., Torrens, F., Rotondo, R., Ibarra-Velarde, F., and Alvarado, Y. J. (2006) Predicting antitrichomonal activity: A computational screening using atombased bilinear indices and experimental proofs. Bioorg. Med. Chem. 14, 6502−6524. (40) Meneses-Marcel, A., Marrero-Ponce, Y., Ibáñez-Escribano, A., Gómez-Barrio, A., Escario, J. A., Barigye, S. J., Terán, E., García-Jacas, C. R., Machado-Tugores, Y., Nogal-Ruiz, J. J., and Arán-Redó, V. J. (2018) Drug repositioning for novel antitrichomonas from known antiprotozoan drugs using hierarchical screening. Future Med. Chem. 10, 863− 878. (41) Marrero-Ponce, Y., Huesca-Guillén, A., and Ibarra-Velarde, F. (2005) Quadratic indices of the ’molecular pseudograph’s atom adjacency matrix’ and their stochastic forms: a novel approach for virtual screening and in silico discovery of new lead paramphistomicide drugs-like compounds. J. Mol. Struct.: THEOCHEM 717, 67−79. (42) Casañola-Martín, G. M., Marrero-Ponce, Y., Khan, M. T. H., Ather, A., Sultan, S., Torrens, F., and Rotondo, R. (2007) TOMOCOMD-CARDD descriptors-based virtual screening of tyrosinase inhibitors: Evaluation of different classification model combinations using bond-based linear indices. Bioorg. Med. Chem. 15, 1483− 1503. (43) Casañola-Martín, G. M., Khan, M. T. H., Marrero-Ponce, Y., Ather, A., Sultankhodzhaev, M. N., and Torrens, F. (2006) New tyrosinase inhibitors selected by atomic linear indices-based classification models. Bioorg. Med. Chem. Lett. 16, 324−330. (44) Castillo-Garit, J. A., et al. (2008) Estimation of ADME properties in drug discovery: Predicting Caco 2 cell permeability using atom based stochastic and non stochastic linear indices. J. Pharm. Sci. 97, 1946− 1976. (45) Medina Marrero, R., Marrero-Ponce, Y., Barigye, S. J., Echeverría Díaz, Y., Acevedo-Barrios, R., Casañola-Martín, G. M., García Bernal, M., Torrens, F., and Pérez-Giménez, F. (2015) QuBiLs-MAS method in early drug discovery and rational drug identification of antifungal agents. SAR QSAR Environ. Res. 26, 943−958. (46) Toxin and Toxin Target Database (T3DB). http://www.t3db. ca/ (April 2, 2019). (47) WITHDRAWN: A Resource for Withdrawn and Discontinued Drugs. http://cheminfo.charite.de/withdrawn/index.html (April 2, 2019). (48) O’Boyle, N. M., Banck, M., James, C. A., Morley, C., Vandermeersch, T., and Hutchison, G. R. (2011) Open Babel: An open chemical toolbox. J. Cheminf. 3, 1−14. 1191
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192
Article
Chemical Research in Toxicology (49) García-Jacas, C. R., Cabrera-Leyva, L., Marrero-Ponce, Y., Suárez-Lezcano, J., Cortés-Guzmán, F., and García-González, L. A. (2018) GOWAWA Aggregation Operator-based Global Molecular Characterizations: Weighting Atom/bond Contributions (LOVIs/ LOEIs) According to their Influence in the Molecular Encoding. Mol. Inf. 37, 1−14. (50) García-Jacas, C. R., Cabrera-Leyva, L., Marrero-Ponce, Y., Suárez-Lezcano, J., Cortés-Guzmán, F., Pupo-Meriño, M., and VivasReyes, R. (2018) Choquet integral-based fuzzy molecular characterizations: when global definitions are computed from the dependency among atom/bond contributions (LOVIs/LOEIs). J. Cheminf. 10, 1− 17. (51) Hall, M. A. Correlation-based Feature Selection for Machine Learning; University of Waikato: Hamilton, NewZealand, 1999. (52) Robnik-Š ikonja, M., and Kononenko, I. (2003) Theoretical and Empirical Analysis of ReliefF and RReliefF. Mach. Learn. 53, 23−69. (53) WEKA software. http://www.cs.waikato.ac.nz/ml/weka/ (April 2, 2019). (54) Godden, J. W., Stahura, F. L., and Bajorath, J. (2000) Variability of Molecular Descriptors in Compound Databases Revealed by Shannon Entropy Calculations. J. Chem. Inf. Comput. Sci. 40, 796−800. (55) Urias, R. W. P., Barigye, S. J., Marrero-Ponce, Y., García-Jacas, C. R., Valdes-Martiní, J. R., and Perez-Gimenez, F. (2015) IMMAN: free software for information theory-based chemometric analysis. Mol. Diversity 19, 305−319. (56) Kuncheva, L. I. Ensemble Feature Selection. In Combining Pattern Classifiers: Methods and Algorithms; John Wiley & Sons: NJ, 2014. (57) Kohavi, R., and John, G. H. (1997) Wrappers for feature subset selection. Artif. Intell. 97, 273−324. (58) Altman, N. S. (1992) An Introduction to Kernel and NearestNeighbor Nonparametric Regression. Am. Stat. 46, 175−185. (59) Murtagh, F. (1991) Multilayer perceptrons for classification and regression. Neurocomputing. 2, 183−197. (60) Breiman, L. (2001) Random Forests. Mach. Learn. 45, 5−32. (61) Shevade, S. K., Keerthi, S. S., Bhattacharyya, C., and Murthy, K. R. K. (2000) Improvements to the SMO algorithm for SVM regression. IEEE Trans. Neural Netw. Learn. Syst. 11, 1188−1193. (62) Ü stün, B., Melssen, W. J., and Buydens, L. M. C. (2006) Facilitating the application of Support Vector Regression by using a universal Pearson VII function based kernel. Chemom. Intell. Lab. Syst. 81, 29−40. (63) Breiman, L. (1996) Bagging predictors. Mach. Learn. 24, 123− 140. (64) Friedman, J. H. (2002) Stochastic gradient boosting. Comput. Stat. Data. Anal. 38, 367−378. (65) Wolpert, D. H. (1992) Stacked generalization. Neural Netw. 5, 241−259. (66) Dragos, H., Gilles, M., and Alexandre, V. (2009) Predicting the Predictability: A Unified Approach to the Applicability Domain Problem of QSAR Models. J. Chem. Inf. Model. 49, 1762−1776. (67) Sahigara, F., Mansouri, K., Ballabio, D., Mauri, A., Consonni, V., and Todeschini, R. (2012) Comparison of Different Approaches to Define the Applicability Domain of QSAR Models. Molecules 17, 4791. (68) Roy, K., Kar, S., and Ambure, P. (2015) On a simple approach for determining applicability domain of QSAR models. Chemom. Intell. Lab. Syst. 145, 22−29. (69) Liew, C. Y., Lim, Y. C., and Yap, C. W. (2011) Mixed learning algorithms and features ensemble in hepatotoxicity prediction. J. Comput.-Aided Mol. Des. 25, 855−871. (70) Aranda, J., Garro Martinez, J., Castro, E., and Duchowicz, P. (2016) Conformation-Independent QSPR Approach for the Soil Sorption Coefficient of Heterogeneous Compounds. Int. J. Mol. Sci. 17, 1247. (71) Jaworska, J., et al. (2005) QSAR applicabilty domain estimation by projection of the training set descriptor space: a review. ATLA, Altern. Lab. Anim. 33, 445−459. (72) Netzeva, T. I., Worth, A., Aldenberg, T., Benigni, R., Cronin, M. T., Gramatica, P., Jaworska, J. S., Kahn, S., Klopman, G., Marchant, C.
A., Myatt, G., Nikolova-Jeliazkova, N., Patlewicz, G. Y., Perkins, R., Roberts, D., Schultz, T., Stanton, D. W., van de Sandt, J. J., Tong, W., Veith, G., and Yang, C. (2005) Current status of methods for defining the applicability domain of (quantitative) structure-activity relationships. The report and recommendations of ECVAM Workshop 52. ATLA, Altern. Lab. Anim. 33, 155−73. (73) García-Jacas, C. R., Martinez-Mayorga, K., Marrero-Ponce, Y., and Medina-Franco, J. L. (2017) Conformation-dependent QSAR approach for the prediction of inhibitory activity of bromodomain modulators. SAR QSAR Environ. Res. 28, 41−58. (74) Jaworska, J., and Nikolova-Jeliazkova, N. Ambit Discovery v0.0.4. http://ambit.sourceforge.net/download_ambitdiscovery.html (April 2, 2019). (75) Yap, B. W., and Sim, C. H. (2011) Comparisons of various types of normality tests. J. Stat. Comput. Sim. 81, 2141−2155. (76) Friedman, M. (1940) A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings. Ann. Math. Stat. 11, 86− 92. (77) Holm, S. (1979) A Simple Sequentially Rejective Multiple Test Procedure. Scand. J. Stat. 6, 65−70. (78) Winder, C., Azzi, R., and Wagner, D. (2005) The development of the globally harmonized system (GHS) of classification and labelling of hazardous chemicals. J. Hazard. Mater. 125, 29−44. (79) Randić, M. (1991) Generalized molecular descriptors. J. Math. Chem. 7, 155−168.
1192
DOI: 10.1021/acs.chemrestox.9b00011 Chem. Res. Toxicol. 2019, 32, 1178−1192