In silico prediction of endocrine disrupting chemicals using single

3 days ago - Endocrine disruption (ED) has become a serious public health issue and also poses a significant threat to the ecosystem. Due to complex ...
0 downloads 0 Views 1MB Size
Subscriber access provided by Macquarie University

Chemical Information

In silico prediction of endocrine disrupting chemicals using single-label and multi-label models Lixia Sun, Hongbin Yang, Yingchun Cai, Weihua Li, Guixia Liu, and Yun Tang J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.8b00551 • Publication Date (Web): 26 Feb 2019 Downloaded from http://pubs.acs.org on February 27, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In silico Prediction of Endocrine Disrupting Chemicals Using Single-label and Multi-label Models

Lixia Sun, Hongbin Yang, Yingchun Cai, Weihua Li, Guixia Liu, Yun Tang*

Shanghai Key Laboratory of New Drug Design, School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China

*Corresponding author, E-mail: [email protected]

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ABSTRACT: Endocrine disruption (ED) has become a serious public health issue and also poses a significant threat to the ecosystem. Due to complex mechanisms of ED, traditional in silico models focusing on only one mechanism are insufficient for detection of endocrine disrupting chemicals (EDCs), let alone offering an overview of possible action mechanisms for a known EDC. To remove these limitations, in this study both single-label and multi-label models were constructed across six ED targets, namely AR (androgen receptor), ER (estrogen receptor alpha), TR (thyroid receptor), GR (glucocorticoid receptor), PPARg (peroxisome proliferator-activated receptor gamma), and Aromatase. Two machine learning methods were used to build the single-label models, with multiple random under-sampling combining voting classification to overcome the challenge of data imbalance. Four methods were explored to construct the multi-label models that can predict the interaction of one EDC against multiple targets simultaneously. The single-label models of all the six targets have achieved reasonable performance with balanced accuracy (BA) values from 0.742 to 0.816. Each top single-label model was then jointed to predict the multi-label test set with BA values from 0.586 to 0.711. The multi-label models could offer a significant boost over the single-label baselines with BA values for the multi-label test set from 0.659 to 0.832. Therefore, we concluded that single-label models could be employed for identification of potential EDCs, while multi-label ones are preferable for prediction of possible mechanisms of known EDCs. Keywords: Endocrine disruption chemicals; Single-label model; Multi-label model

ACS Paragon Plus Environment

Page 2 of 37

Page 3 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1.

INTRODUCTION Endocrine disruption (ED) has become a serious public health issue1, 2 and also

posed a significant threat to the ecosystem.3 Endocrine disrupting chemicals (EDCs) are natural or synthetic compounds that disrupt the normal functions of endogenous hormone in human or other organisms.4 Exposure to EDCs can lead to detrimental health effects, such as development and reproduction deficiency, cancer, obesity, diabetes, immunity disorders, cardiovascular complications and cognitive inability.5, 6 EDCs also have adverse effects on the wildlife causing impairment of normal capacity and population decline.3 The mechanisms of ED are complex,7,

8

as EDCs can alter the synthesis,

transport, action, metabolism or excretion of hormone. A focal part of the study is hormone receptor binding,9 mimicking or blocking normal signaling process. Androgen receptor (AR) and estrogen receptor alpha (ER) are the most commonly studied targets for their effects on reproduction. However, evidence gathered over the past decades shows that the toxic effects of EDCs are much broader, and their receptors go far beyond sex hormone receptors, including a variety of other nuclear receptors (NRs), like thyroid receptor (TR) and glucocorticoid receptor (GR). Hence, to examine how compounds interact with these NRs can provide some fundamental information about their ED potentials. During the past century, over 80,000 chemicals were synthesized for widespread use.10 The toxicological evaluation for majority of them is insufficient, and a subset of them may be toxic because of ED. EDCs have been found in many sources including

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 37

pharmaceuticals, cosmetics, personal care product, pesticides, flame retardants, plasticizer, and so on.7,

8

Thus, the detection of chemicals with potential ED

characteristics is crucial and urgent. Although high-throughput screening (HTS) is available for rapid test of huge amounts of compounds across many ED endpoints,11, 12 it is still unfeasible to conduct a thorough experimental evaluation for each chemical. Recently, in silico methods13, 14 have been utilized to prioritize chemicals for bioassays which have the advantages of lower cost, higher efficiency, reduced animal use and less resource consumption. A number of quantitative structure-activity relationship (QSAR) models were built to predict ED activities of pharmaceuticals and environmental chemicals.10,

15-18

However, most of previous studies only centered on the prediction of AR or ER-based mechanism, while chemical ED potentials could be conducted via other mechanisms. Therefore, to detect if a chemical is a EDC, multiple mechanisms of action should be considered.19 Traditional QSAR models only predict a single endpoint at a time, namely single-label models, which are insufficient for a comprehensive view of ED characteristics of chemicals. Nevertheless, to the best of our knowledge, there are no in silico models for prediction of EDCs with multiple mechanisms of action yet. For that purpose, in addition to single-label models, herein we developed multi-label models to identify EDCs that act simultaneously on multiple ED targets. Six targets, namely AR, ER, TR, GR, PPARg (peroxisome proliferator-activated receptors gamma), and aromatase, were selected as the potential targets of EDCs here. Unlike

single-label

classification

(SLC),

multi-label

ACS Paragon Plus Environment

classification

(MLC)

Page 5 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

encompasses several prediction tasks and each instance can be assigned to multiple binary labels at the same time, instead of only one output value.20, 21 Besides its early applications in multimedia22, 23 such as images, music, video and news, now MLC has been expanded into many other fields, such as gene function prediction,24 drug target prediction,25,

26

protein subcellular location prediction27 and drug metabolism

prediction.28 Therefore, MLC might be an appreciate method to incorporate the complexity of ED and help to evaluate ED activity at the early stages.

2. MATERIALS AND METHODS 2.1 Data collection and preparation Tox2129 data for all the six targets were retrieved from the PubChem BioAssay database.30 For each target, all the bioassays of agonist and antagonist screens were collected. To avoid the interference of false-positives, only the ‘Summary’ assays were retained. Detailed information was provided in Table S1. Each of the original datasets was subjected to be filtered as following: (1) compounds without outcome annotations or whose readout is ‘inconclusive’ were removed; (2) inorganic counter-ions in organic salts, smaller part in mixtures, inorganic and organometallic compounds were removed; (3) the charged compounds were neutralized; (4) tautomers were converted to their uniforms; (5) stereo-chemical information was removed, which can not be distinguished by the fingerprints we used and may lead to duplication; (6) canonical SMILES code was used to find out the duplicates and analyze their activity labels of binary annotations. If all the labels were

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

consistent, only one record was kept. If inconsistencies occurred and the majority was positive, this structure was labeled as positive, and only one unique entry was retained. Otherwise, these entries were thought as ambiguous and all removed. Then, for targets with several agonist or antagonist datasets, all the data were incorporated to get a modulator dataset for the corresponding target, where a compound is considered as positive if it is positive in any of the individual dataset. 2.2 Molecular representation In this study, eight molecular fingerprints were explored, where each molecule is represented as a bit vector describing the presence of specified substructures or patterns. Four fingerprints, namely CDK Fingerprint (FP), MACCS Fingerprint (MACCS), Pubchem Fingerprint (PubFP) and Klekota Roth Fingerprint (KRFP), were calculated using PaDEL-Descriptor.31 The other four ones, i.e. RDK Fingerprint (RDK), Topological Torsion Fingerprint (Topo), Morgan Fingerprint (Morg) and Atom Pairs Fingerprint (AP), were calculated by RDKit.32 2.3 Model construction In this study, both single-label and multi-label classifications were explored. At first, the differences of both models were illustrated in Figure 1A-C. Specifically, SLC includes binary one (Figure 1A) and multiclass one (Figure 1B). In both cases, each instance only belongs to one class. However, in MLC (Figure 1C), each instance can belong to several classes at the same time. For single-label model development, two machine learning algorithms including Support Vector Machine (SVM)33 and Random Forest (RF)34 were used. The

ACS Paragon Plus Environment

Page 6 of 37

Page 7 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

combination of multiple random under-sampling with voting classifier was utilized to treat the dataset imbalance issue. As described in Figure 1D, an imbalanced dataset was under-sampled with multiple random seeds to get multiple balanced counterparts, then each of them was used to train a classifier and eventually the majority vote of these classifiers determined the final prediction. For MLC model building, four methods were used, including Binary Relevance (BR),35 Classifier Chains (CC),36 Label Powerset (LP)23 and Multi-label Random Forest (ML-RF).37 ML-RF method is derived from the existent algorithm RF by adapting it to deal with multi-label data. For BR and CC methods, the multi-label dataset is transformed into a set of binary datasets, one per label. Then a binary model is trained for each label and all the outputs are combined as the final label-set. The most remarkable difference between the two methods is that the former trains each model independently, while the latter fits all models sequentially to allow for possible label dependencies, where the successive model is assigned additional features of outputs from the previous models. LP method transformed the multi-label dataset into a multiclass dataset by treating each unique label-set as a category, and then any multiclass classification method can be used to train a model. RF was utilized as the base classifier. All the codes for constructing these models were supplied as a zip file in supporting information. 2.4 Model validation The hyper-parameters of all models were optimized according to their results of 5-fold cross validation (CV). These models were further validated by an external test

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 37

set. To avoid chance prediction, Y-randomization test38, 39 was also applied for the best single-label model per target as well as the optimal multi-label model, where each model was trained with a training set of randomized activities but the same parameters of its corresponding version. This procedure was repeated three times for each model. The predictive ability of the single-label models was measured by five evaluation metrics, including accuracy (ACC), sensitivity (SE), specificity (SP), balanced accuracy (BA) and AUC. Their formulas are defined as following: 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁

𝐴𝐶𝐶 =

𝑇𝑁 𝑇𝑁 + 𝐹𝑃

𝑆𝑃 =

𝑆𝐸 = 𝐵𝐴 =

𝑇𝑃 𝑇𝑃 + 𝐹𝑁

(SE + SP) 2

(eq. 1) (eq. 2) (eq. 3) (eq. 4)

where TP represents the count of true positive, TN the count of true negative, FP the count of false positive and FN the count of false negative. In contrast to the single-value output from single-label model, the output of multi-label model is a set of vector. Hence, the performance of the multi-label model was also evaluated by some specific metrics, namely subset accuracy (SubACC), hamming loss (HL), Jaccard similarity coefficient (JSC) and micro AUC, which are defined as following.21, 40 1

𝑛

𝑆𝑢𝑏𝐴𝐶𝐶 = 𝑛∑𝑖 = 1⟦𝑌𝑖 = 𝑍𝑖⟧ 𝐻𝐿 =

11 𝑛 ∑ | | 𝑛𝑘 𝑖 = 1 𝑌𝑖∆𝑍𝑖

ACS Paragon Plus Environment

(eq. 5) (eq. 6)

Page 9 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

1

𝑛

|𝑌𝑖 ∩ 𝑍𝑖|

𝐽𝑆𝐶 = 𝑛∑𝑖 = 1|𝑌𝑖 ∪ 𝑍𝑖|

(eq. 7)

where 𝑌𝑖 is the true labelset of ith instance, 𝑍𝑖 the predicted one, n is the number of instances in the dataset, k is the count of labels, ⟦𝑌𝑖 = 𝑍𝑖⟧ counts 1 if 𝑌𝑖 equals to 𝑍𝑖, |𝑌𝑖∆𝑍𝑖| counts the symmetric difference between 𝑌𝑖 and 𝑍𝑖, and |𝑌𝑖 ∩ 𝑍𝑖| and

|𝑌𝑖 ∪ 𝑍𝑖| counts the number of active labels in the intersection and union of 𝑌𝑖 and 𝑍𝑖 respectively. 2.5 Definition of applicability domain The applicability domain of models was analyzed using a distance-based method,41,

42

where the distances between a query molecule and its 3 nearest

neighbors in the training set were compared with a predefined threshold DT. 𝐷𝑇 = 𝛾 +𝑍𝜎

(eq. 8)

Where, 𝛾 represents the average Tanimoto distance of pair-wise molecules in the training set; 𝜎 is the standard deviation of these distances; Z equals to 0.5 by default. Only when all the three distances are less than DT, the query molecule can be considered within the domain, otherwise outside domain.

3. RESULTS 3.1 Data collection and analysis Originally, a library of approximately 10K chemicals screened against the six targets was collected (Table S1). After data curation, around 7K compounds were remained. In order to gain a view of the activity profile of the remaining compounds,

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

a matrix representing the activity value of each compound-target pair was given in Figure 2A, where red denoted positives, blue for negatives and white for missing values. We can see that not all the molecules have activitiy information for the whole six endpoints in terms of white regions, and the dataset of each target is extremely imbalanced with a large number of negatives (blue regions), but only a small fraction of positives (red regions). There are 3000+ compounds with known activity effects for all the endpoints which are suitable for multi-label modeling, but as presented in Figure 2B, nearly 90% of them have only negative effects across all the targets. In the context of an overwhelming majority of completely negative samples, this multi-label dataset (MLD) is not preferable for modeling. Hence, to get a relatively balanced one, only compounds being positive to at least one of the targets were used for multi-label modeling, which comprised a MLD of 367 compounds. Then, 294 (80%) of them were used for multi-label training and internal validation, and the rest 73 molecules (20%) were for external validation. Detailed composition of these two MLDs was shown in Table 1. For multi-label training set, we can see that except for AR, the other targets all have a noteworthy imbalanceness with the ratio between the counts of majority-class samples and that of the minority-class varying from 1.43 to 3.45 (Table 1). Furthermore, to make the best use of all available data points, we respectively curated a single-label dataset (SLD) for each target by substracting molecules in the aforementioned multi-label test set from the whole available dataset. Afterwards, all the SLDs were randomly separated into a training set and a test set by 80% : 20%, the

ACS Paragon Plus Environment

Page 10 of 37

Page 11 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

constitution of which was also shown in Table 1. We can see that the imblanceness of the single-label training sets is even worse with imbalance ratio from 3.26 to 13.83 (Table 1). As a result, three datasets per target, including a single-label training set, a single-label test set and a multi-label test set were used for single-label models construction and evaluation. All the data used in the study were supplied in SI-I. Apart from the analysis of activity features, the structure space of all datasets was also explored. According to the Tanimoto similarity indexes and their average values shown in Figure S1, we can conclude that all the datasets have a diversity of structures and thus models trained on these diverse training sets would have a wide applicability domain and also be thoroughly validated. 3.2 Strategy to treat data imbalance issue When developing single-label models, we used the strategy of multiple random under-sampling combining with voting classification to overcome data imbalance. In order to take the most advantage of majority-class instances and to reduce computation expense, the number of individual models in each voting classifier was determined by the degree of imbalance of the corresponding training set (nearly twice of the imbalance ratio), as listed in Table S2. Figure 3 compares the cross validation metrics of AR voting classifiers and their constituent individual models built with the combination of two algorithms and eight fingerprints. With respect to the results of individual models shown as boxplot (Figure 3), we can see that there were obvious fluctuation among models resulted from various random-sampling datasets, which demonstrated the unstableness of traditional

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

single under-sampling method. Notably, especially for AUC metric (Figure 3B), most of the stars representing the voting classifiers positioned on top of the corresponding boxplot, demonstrating that our voting classifiers are usually superior to their component individual models. When it came to the rest five targets, a similar trend was also discovered (Figure S2), which further verified that our voting classifiers could make use of more data and achieve more stable and accurate performance. 3.3 Construction of single-label models As mentioned above, the voting classifiers had shown superiority over their individual ones, so our final single-label models were constructed using the voting strategy. In this study, BA metric giving the same weight to SE and SP was used for model optimization and selection, as it can guarantee a proper performance for both positives and negatives when the dataset is imbalanced. According to the BA metric of cross validation (Figure S3), the optimal voting models per target were selected. As listed in Table 3, the best models for AR, ER and PPARg were all constructed by SVC method and PubFP fingerprint (SVC_PubFP) and the best ones for GR, Aromatase and TR were SVC_FP, RF_PubFP and RF_MACCS, respectively. With regard to their prediction statistics for corresponding single-label test sets (Table 3), ER model performed the best with ACC, AUC, SE and SP values of 0.813, 0.883, 0.819 and 0.812, and TR model showed the worst performance (ACC = 0.752, AUC = 0.801, SE = 0.732 and SP = 0.753). We assumed that the abundant AR training data of 1037 positive and 3377 negative samples may contribute to its superiority. That is to say, each individual model in the final voting

ACS Paragon Plus Environment

Page 12 of 37

Page 13 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

classifier was trained on a balanced and diverse training set consisting of 1037 positive and 1037 negative samples. On the other hand, with a total TR training data of 286 positive and 3954 negative samples, the failure of individual models trained on an insufficient training set of 572 instances may finally lead to the unsatisfactory voting classifier. The trend that a bigger training set for each individual model contributed to a better overall voting classifier performance, can be more apparently illustrated in Figure S4. Subsequently, we combined the optimal voting classifier per target for the prediction of multi-label test set. As shown in Figure 4, we discovered an intriguing phenomenon that the single-label models showed high prediction reliability for positive samples (SE: 0.75-0.913) but failed for the prediction of negative ones (SP: 0.378-0.516), indicating that the negatives in multi-label test set were prone to be mis-classified as actives. Note that all the voting classifiers are made up of individual models built using balanced training set and that they also achieved reasonable accuracy for both positives and negatives in single-label test set (Table 2). The failure of prediction for negatives in multi-label test set may not be due to the unreliability of our models, which is worth further investigation. 3.4 Construction of multi-label models and comparison with single-label ones In the exploratory experiments of multi-label modeling, we discovered that RF was a preferable base classifier, which was then in conjunction with BR, CC and LP approaches respectively. Then 32 multi-label models were developed by 8 fingerprints (KRFP, AP, MACCS, Morg, FP, PubFP, RDK and Topo) combined with

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

4 multi-label methods (BR_RF, CC_RF, LP_RF and ML-RF). Based on the SubACC values of cross validation shown in Figure S5, the top five multi-label models, including LP_AP, LP_Morg, LP_FP, LP_Topo and LP_MACCS, were selected for further evaluation, which were all constructed with LP method. The consensus model by voting prediction from the top five multi-label ones was also built to gain a stronger classifier. To decide whether a multi-label model is preferable over a single-label one, we joined the corresponding optimal single-label model per target. The joint model was viewed as a baseline to which our multi-label models were compared. In Figure 5, the overall prediction results for multi-label test set obtained from the top five multi-label models along with their consensus one were presented and compared against those from the combined top single-label models. Overall, the six multi-label models showed no significant difference with the consensus one outperforming the others slightly. However, multi-label models offered a remarkable performance boost over the single-label ones (p-value ranging from 3.12e-11 to 2.6e-5 across the four multi-label evaluation metrics), especially for their ability of fully correct prediction (p-value = 3.12e-11 for SubACC). When comparing the detailed results for each target by the six multi-label models and combined single-label ones (Figure S6), we learned that, in contrast to the single-label models, all multi-label counterparts could succeed in predicting most negative samples (high SP values), but failed for the prediction of positive ones (low SE values) across most targets with ER as an exception. It might be resulted from the

ACS Paragon Plus Environment

Page 14 of 37

Page 15 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

fact that there are much larger number of positives than negatives for ER in multi-label training set and the reverse for the other five targets (Table 1). To alleviate the problem caused by data imbalance, the current multi-label training set was further processed to get a more balanced one. To simultaneously reduce the active samples of ER and the inactive ones of the remaining targets as well as not to damage the diversity of activity space, just 3 out of the 38 molecules, which are only positive to ER, were retained. The constitution of the newly balanced multi-label training set was shown in Table S3, where a relatively balanced training dataset was obtained for most targets. Then the newly balanced training set was used to re-train the aforementioned top five multi-label models as well as get their consensus one. The test set results of the best newly trained multi-label model (LP_AP), the consensus multi-label model and the combined single-label ones, were summarized in Table 3. Both consensus multi-label model and LP_AP outperformed the single-label models in terms of the AUC and ACC values. Particularly, in the case of AR, ER, GR and Aromatase, the multi-label models showed a more balanced performance for positive and negative samples compared with the failure of prediction for negatives by single-label models (SP: 0.378-0.516). The consensus multi-label model can achieve reasonable performance for both positives (SE: 0.778-0.837) and negatives (SP: 0.676-0.88) of AR, ER, GR and Aromatase targets. Although the prediction for positive samples of TR and PPARg targets was unsatisfactory, their overall performance in terms of AUC values was still preferable. Finally, in order to ensure our models have authentic predictivity instead of

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

chance prediction, the best voting model per target and the optimal multi-label model (LP_AP) were further inspected by Y-randomization test. Comparing the results shown in Table 2 and Table S4 or Table 3 and Table S5, we can observe that the original voting models had a remarkable advantage over their counterparts built for random response. Likewise, the multi-label model was also significantly superior to its Y-randomization version, in reference to the results shown in Table 3 and Table S6. As a result, we think that all the models have achieved meaningful predictivity. The applicability domains of these winning models were also defined and all the test molecules were in the domain of corresponding models, demonstrating all the models have been reasonably applied.

4. DISCUSSION In this study, we have developed in silico models for ED risk assessment of small molecules. Compared to previous studies which mainly focus on sex hormone receptors such as AR or ER, our models involved six ED related targets. As the absence of action with one or two targets cannot be taken as evidence of no endocrine disrupting potential, a comprehensive range of modes of action should be considered. Given the complex mechanisms of endocrine disruption, traditional QSAR models considering only one mode of action are insufficient for the detection of chemicals with ED potential. As a result, building multi-label models which predict across multiple targets may be a good choice. As described in Figure 2, many compounds have active labels simultaneously on multiple targets concerning ED, confirming that

ACS Paragon Plus Environment

Page 16 of 37

Page 17 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

it is necessary to construct models that can assess chemical ED potential across multiple mechanisms. Meanwhile, the multi-label dataset we collected here has some degree of multi-labelness which is suitable for multi-label modeling. Apparently, the ED datasets have a high degree of imbalance with a majority of negatives, which is in line with the reality scenario but may bring some difficulty to learn from. Nowadays, it is still challenging for these classification methods to learn from imbalanced data. By randomly choosing partial majority-class instances, random under-sampling technique43 is commonly implemented to produce a balanced dataset from the original imbalanced one. However, single random under-sampling strategy may have the disadvantages of high randomness inherent to the results and loss of information associated with the more frequent class. Hence, multiple random under-sampling in conjunction with voting classification was utilized for single-label model building. This strategy can render some weak classifiers to a strong one and the resultant voting classifiers were fairly superior to their component individual ones (Figure 3 and Figure S2). The voting classifiers also showed great robustness as the selections of fingerprints and algorithms have nearly negligible influence on their performance (Figure S3). Therefore, we think our strategy of combining multiple random under-sampling with voting classification may be a good way for imbalanced dataset modeling. Although all the single-label models showed good performance for their corresponding single-label test set (Table 2), they made many false positive predictions across different targets in multi-label test set (Figure 4). We noticed that

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

these wrongly predicted molecules, which are inactive to the objective target, can interact with at least one of the other five targets. Moreover, the targets associated with ED we studied have a high degree of homology. Consequently, we presumed that negative molecules in multi-label test set, which can also be viewed as known EDCs, were inherently inclined to be predicted as active by single-label models of these similar targets. To verify our hypothesis, taking target AR as an example, we compared the false positive rate (FPR) for compounds with ED activity to that of molecules possibly having no interaction with any of the six targets. As shown in Figure 6A, there are 966 negatives in the AR single-label test set, of which 167 were mistakenly predicted as positives and 90 were positive to at least one of the remaining five targets. Out of the 90 negatives, 40 were wrongly predicted as active modulators to AR, indicating that the FPR for known EDCs in single-label test set is 0.489. Likewise, the FPR for known EDCs in multi-label test set is 0.533 (Figure 6B). Surprisingly, the FPR for non-EDCs in single-label test set is only 0.14 (Figure 6A). Therefore, the remarkable difference manifested the incapability of reliable predictions for compounds with ED activity by traditional single-label models. In the exploration of multi-label models, LP method performed the best (Figure S5) and showed a significant boost over single-label ones based on the prediction for multi-label test set (Table 3 and Figure 5). The outperformance of LP over other multi-label methods implicated that there may exist some dependency information among the six targets, and LP method treating a full label-set as a class identifier can implicitly incorporate the hidden relationship among labels into training process. In

ACS Paragon Plus Environment

Page 18 of 37

Page 19 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

contrast, BR method trains an independent model for each target without considering their correlations. Though CC method struggles to learn the correlations by concatenating the models to one another, the ordering imposed on the six targets may not be meaningful. Furthermore, as expected, the performance for negatives in multi-label test was improved by multi-label models, even if they were trained with a small number of data. Thus, we can conclude that multi-label models facilitate the capture of correlation and distinction information across different targets, and can offer more reliable and all-round mechanism prediction for already known endocrine disruptors. Further comparisons were made with previous studies. As far as we known, the winning team44 of Tox21 Data Challenge45 has built models for targets AR, ER, PPARg and Aromatase utilizing the tox21 data. As shown in Table S7, comparing to the AUC values of its best model multi-task deep neural network (MT-DNN) across various targets on the leaderboard-set (0.3459-0.7921) and the final test set (0.778-0.856), both the performance of our single-label models on single-label tests (0.818-0.888) and our consensus multi-label model on multi-label test set (0.795-0.828) show a slight superiority. Further, the Collaborative Modeling Projects coordinated by the National Center of Computational Toxicology have separately constructed consensus models to identify AR and ER modulators, where the final two AR consensus models46 got the SP of 0.71-0.92 and SE of 0.78-0.87, and the final ER consensus model47 got the SP of 0.91-0.94 and SE of 0.3-0.87 across different test sets. But their work only focused on AR or ER targets, which may be insufficient for

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

endocrine screening. Afterwards, although limited by the availability of an identical test set for a fair comparison, we can still conclude that both our single-label and multi-label models have achieved competitive performance and can be a useful tool for ED risk assessment.

5. CONCLUSIONS This work was aimed to supply models that could be used to screen compounds with potential ED characteristics, and to prioritize experimental study in case of further risks occurring. For that purpose, six key targets, including AR, ER, GR, Aromatase, TR and PPARg, were considered, and both single-label and multi-label models were constructed. The single-label models of all the six targets have achieved reasonable performance for both positive and negative samples. However, their prediction for compounds with known ED potential is unreliable and the implementation of multi-label models has succeeded in overcoming this defect. Overall, we anticipated that our single-label models could be used to find out potential EDCs, while the multi-label ones might be preferable in predicting the action mechanisms of known EDCs and then offer the right direction for experimental tests. These models will be a good tool for endocrine screening and helpful for human and environmental health.

Supporting Information All the data used in the study (SI-I); code for model construction (code.zip);

ACS Paragon Plus Environment

Page 20 of 37

Page 21 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

structural diversity analysis of our datasets (Figure S1); comparison of cross validation results among the voting classifiers and their corresponding individual models (Figure S2); the cross-validation results for voting classifiers (Figure S3); trend analysis between model performance and the count of training molecules (Figure S4); the cross-validation results for multi-label models (Figure S5); comparison of the test results for each target between the six multi-label models and the combined single-label models (Figure S6); bioassay records from PubChem used in our study (Table S1); the number of individual models composing the voting classifier for each target (Table S2); constitution of the multi-label training set after balance process (Table S3); Y-randomization performance of voting models for single-label test set (Table S4); Y-randomization performance of voting models for multi-label test set (Table S5); Y-randomization performance of multi-label models (Table S6); performance comparison with public models(Table S7). This information is available free of charge via the Internet at http://pubs.acs.org

Acknowledgements This work was supported by the National Key Research and Development Program (Grant 2016YFA0502304) and the National Natural Science Foundation of China (Grant 81872800).

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

REFERENCES 1. Attina, T. M.; Hauser, R.; Sathyanarayana, S.; Hunt, P. A.; Bourguignon, J.-P.; Myers, J. P.; DiGangi, J.; Zoeller, R. T.; Trasande, L., Exposure to Endocrine-Disrupting Chemicals in the USA: A Population-Based Disease Burden and Cost Analysis. Lancet Diabetes Endocrinol. 2016, 4, 996-1003. 2. Trasande, L.; Zoeller, R. T.; Hass, U.; Kortenkamp, A.; Grandjean, P.; Myers, J. P.; DiGangi, J.; Bellanger, M.; Hauser, R.; Legler, J.; Skakkebaek, N. E.; Heindel, J. J., Estimating Burden and Disease Costs of Exposure to Endocrine-Disrupting Chemicals in the European Union. J. Clin. Endocrinol. Metab. 2015, 100, 1245-1255. 3. Skakkebaek, N. E.; Rajpert-De, M. E.; Buck Louis, G. M.; Toppari, J.; Andersson, A. M.; Eisenberg, M. L.; Jensen, T. K.; Jã¸Rgensen, N.; Swan, S. H.; Sapra, K. J., Male Reproductive Disorders and Fertility Trends: Influences of Environment and Genetic Susceptibility. Physiol. Rev. 2016, 96, 55-97. 4. Zoeller, R. T.; Brown, T. R.; Doan, L. L.; Gore, A. C.; Skakkebaek, N. E.; Soto, A. M.; Woodruff, T. J.; Saal, F. S. V., Endocrine-Disrupting Chemicals and Public Health Protection: A Statement of Principles from the Endocrine Society. Endocrinology 2012, 153, 4097-4110. 5. De Coster, S.; van Larebeke, N., Endocrine-Disrupting Chemicals: Associated Disorders and Mechanisms of Action. J. Environ. Public Health 2012, 2012, 52. 6. Fucic, A.; Gamulin, M.; Ferencic, Z.; Katic, J.; Krayer von Krauss, M.; Bartonova, A.; Merlo, D. F., Environmental Exposure to Xenoestrogens and Oestrogen Related Cancers: Reproductive System, Breast, Lung, Kidney, Pancreas, and Brain. Environ. Health 2012, 11, S8. 7. De, C. S.; Van, L. N., Endocrine-Disrupting Chemicals: Associated Disorders and Mechanisms of Action. J. Environ. Public Health 2012, 2012, 713696. 8. Maqbool, F.; Mostafalou, S.; Bahadar, H.; Abdollahi, M., Review of Endocrine Disorders Associated with Environmental Toxicants and Possible Involved Mechanisms. Life Sci. 2016, 145, 265. 9. Kolsek, K.; Mavri, J.; Sollner Dolenc, M.; Gobec, S.; Turk, S., Endocrine Disruptome--an Open Source Prediction Tool for Assessing Endocrine Disruption Potential through Nuclear Receptor Binding. J. Chem. Inf. Model. 2014, 54, 1254-67. 10. McRobb, F. M.; Kufareva, I.; Abagyan, R., In Silico Identification and Pharmacological Evaluation of Novel Endocrine Disrupting Chemicals That Act Via the Ligand-Binding Domain of the Estrogen Receptor Alpha. Toxicol. Sci. 2014, 141, 188-97. 11. Dix, D. J.; Houck, K. A.; Martin, M. T.; Richard, A. M.; Setzer, R. W.; Kavlock, R. J., The Toxcast Program for Prioritizing Toxicity Testing of Environmental Chemicals. Toxicol. Sci. 2007, 95, 5-12. 12. Desantis, K.; Reed, A.; Rahhal, R.; Reinking, J., Use of Differential Scanning Fluorimetry as a High-Throughput Assay to Identify Nuclear Receptor Ligands. Nucl. Recept. Signaling 2012, 10, e002. 13. Knudsen, T. B.; Keller, D. A.; Sander, M.; Carney, E. W.; Doerrer, N. G.; Eaton, D. L.; Fitzpatrick, S. C.; Hastings, K. L.; Mendrick, D. L.; Tice, R. R.; Watkins, P. B.;

ACS Paragon Plus Environment

Page 22 of 37

Page 23 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Whelan, M., Futuretox Ii: In Vitro Data and in Silico Models for Predictive Toxicology. Toxicol Sci 2015, 143, 256-67. 14. Vuorinen, A.; Odermatt, A.; Schuster, D., Reprint of "in Silico Methods in the Discovery of Endocrine Disrupting Chemicals". J. Steroid Biochem. Mol. Biol. 2015, 153, 93-101. 15. Bhhatarai, B.; Wilson, D. M.; Price, P. S.; Marty, S.; Parks, A. K.; Carney, E., Evaluation of Oasis Qsar Models Using Toxcast in Vitro Estrogen and Androgen Receptor Binding Data and Application in an Integrated Endocrine Screening Approach. Environ. Health Perspect. 2016, 124, 1453-61. 16. Fang, H.; Tong, W.; Branham, W. S.; Moland, C. L.; Dial, S. L.; Hong, H.; Xie, Q.; Perkins, R.; Owens, W.; Sheehan, D. M., Study of 202 Natural, Synthetic, and Environmental Chemicals for Binding to the Androgen Receptor. Chem. Res. Toxicol. 2003, 16, 1338-58. 17. Mansouri, K.; Abdelaziz, A.; Rybacka, A.; Roncaglioni, A.; Tropsha, A.; Varnek, A.; Zakharov, A.; Worth, A.; Richard, A. M.; Grulke, C. M., Cerapp: Collaborative Estrogen Receptor Activity Prediction Project. Environ. Health Perspect. 2016, 124, 1023-1033. 18. Hao, M.; Bryant, S. H.; Wang, Y., Cheminformatics Analysis of the Ar Agonist and Antagonist Datasets in Pubchem. J. Cheminf. 2016, 8, 1-13. 19. Sohoni, P.; Sumpter, J. P., Several Environmental Oestrogens Are Also Anti-Androgens. J. Endocrinol. 1998, 158, 327-339. 20. Raies, A. B.; Bajic, V. B., In Silico Toxicology: Comprehensive Benchmarking of Multi ‐ Label Classification Methods Applied to Chemical Toxicity Data. Wiley Interdiscip. Rev.: Comput. Mol. Sci. 2017. 21. Gibaja, E.; Ventura, S., Multilabel Learning: A Review of the State of the Art and Ongoing Research. WIREs Data Mining Knowl. Discov. 2014, 4, 411-444. 22. Chen, X. J.; Zhan, Y. Z.; Ke, J.; Chen, X. B., Complex Video Event Detection Via Pairwise Fusion of Trajectory and Multi-Label Hypergraphs. Multimed. Tools Appl. 2016, 75, 15079-15100. 23. Boutell, M. R.; Luo, J.; Shen, X.; Brown, C. M., Learning Multi-Label Scene Classification ☆. Pattern Recognit. 2004, 37, 1757-1771. 24. Schietgat, L.; Vens, C.; Struyf, J.; Blockeel, H.; Kocev, D.; Džeroski, S., Predicting Gene Function Using Hierarchical Multi-Label Decision Tree Ensembles. BMC Bioinf. 2010, 11, 2. 25. Montanari, F.; Zdrazil, B.; Digles, D.; Ecker, G. F., Selectivity Profiling of Bcrp Versus P-Gp Inhibition: From Automated Collection of Polypharmacology Data to Multi-Label Learning. J. Cheminf. 2016, 8, 7. 26. Afzal, A. M.; Mussa, H. Y.; Turner, R. E.; Bender, A.; Glen, R. C., A Multi-Label Approach to Target Prediction Taking Ligand Promiscuity into Account. J. Cheminf. 2015, 7, 24. 27. Wan, S.; Duan, Y.; Zou, Q., Hpslpred: An Ensemble Multi-Label Classifier for Human Protein Subcellular Location Prediction with Imbalanced Source. Proteomics 2017, 17. 28. Michielan, L.; Terfloth, L.; Gasteiger, J.; Moro, S., Comparison of Multilabel and

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Single-Label Classification Applied to the Prediction of the Isoform Specificity of Cytochrome P450 Substrates. J. Chem. Inf. Model. 2009, 49, 2588-2605. 29. Kavlock, R. J.; Austin, C. P.; Tice, R. R., Toxicity Testing in the 21st Century: Implications for Human Health Risk Assessment. Risk Anal. 2009, 29, 485. 30. Wang, Y.; Bryant, S. H.; Cheng, T.; Wang, J.; Gindulyte, A.; Shoemaker, B. A.; Thiessen, P. A.; He, S.; Zhang, J., Pubchem Bioassay: 2017 Update. Nucleic Acids Res. 2016, 45, D955-D963. 31. Yap, C. W., Padel-Descriptor: An Open Source Software to Calculate Molecular Descriptors and Fingerprints. J. Comput. Chem. 2011, 32, 1466-74. 32. Landrum, G. Rdkit: Open-Source Cheminformatics. http://www.rdkit.org (accessed Feb 8, 2018). 33. Ma, C. Y.; Yang, S. Y.; Zhang, H.; Xiang, M. L.; Huang, Q.; Wei, Y. Q., Prediction Models of Human Plasma Protein Binding Rate and Oral Bioavailability Derived by Using Ga-Cg-Svm Method. J. Pharm. Biomed. Anal. 2008, 47, 677-682. 34. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P., Random Forest:  A Classification and Regression Tool for Compound Classification and Qsar Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958. 35. Godbole, S.; Sarawagi, S., Discriminative Methods for Multi-Labeled Classification. Lect. Notes Comput. Sc, 2004, 3056, 22-30. 36. Almuallim, J., Classifier Chains for Multi-Label Classification with Incomplete Labels. Mach. Learn. 2013, 85, 333-359. 37. Zhao, X.; Kim, T. K.; Luo, W. Unified Face Analysis by Iterative Multi-Output Random Forests. In Computer Vision and Pattern Recognition, 2014; 2014; pp 1765-1772. 38. Melagraki, G.; Afantitis, A., A Risk Assessment Tool for the Virtual Screening of Metal Oxide Nanoparticles through Enalos Insiliconano Platform. Curr. Top. Med. Chem. 2015, 15, 1827-1836. 39. Zhang, S.; Golbraikh, A.; Oloff, S.; Kohn, H.; Tropsha, A., A Novel Automated Lazy Learning Qsar (All-Qsar) Approach:  Method Development, Applications, and Virtual Screening of Chemical Databases Using Validated All-Qsar Models. J. Chem. Inf. Model. 2006, 46, 1984-1995. 40. Herrera, F.; Charte, F.; Rivera, A. J.; Jesus, M. J. D., Multilabel Classification. Springer International Publishing: 2016. 41. Cheng, F.; Ikenaga, Y.; Zhou, Y.; Yu, Y.; Li, W.; Shen, J.; Du, Z.; Chen, L.; Xu, C.; Liu, G.; Lee, P. W.; Tang, Y., In Silico Assessment of Chemical Biodegradability. J. Chem. Inf. Model. 2012, 52, 655-669. 42. Sun, L.; Yang, H.; Li, J.; Wang, T.; Li, W.; Liu, G.; Tang, Y., In Silico Prediction of Compounds Binding to Human Plasma Proteins by Qsar Models. Chemmedchem 2017, 13. 43. Liu, X. Y.; Wu, J.; Zhou, Z. H., Exploratory Undersampling for Class-Imbalance Learning. IEEE T. Syst. Man Cy. B 2009, 39, 539-550. 44. Mayr, A.; Klambauer, G.; Unterthiner, T.; Hochreiter, S., Deeptox: Toxicity Prediction Using Deep Learning. Front. Environ. Sci. 2016, 3. 45. Capuzzi, S. J.; Politi, R.; Isayev, O.; Farag, S.; Tropsha, A., Qsar Modeling of

ACS Paragon Plus Environment

Page 24 of 37

Page 25 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Tox21 Challenge Stress Response and Nuclear Receptor Signaling Toxicity Assays. Front. Environ. Sci. 2016, 4. 46. Grisoni, F.; Consonni, V.; Ballabio, D., Machine Learning Consensus to Predict the Binding to the Androgen Receptor within the Compara Project. J. Chem. Inf. Model. 2019. 47. Mansouri, K.; Abdelaziz, A.; Rybacka, A.; Roncaglioni, A.; Tropsha, A.; Varnek, A.; Zakharov, A.; Worth, A.; Richard, A. M.; Grulke, C. M.; Trisciuzzi, D.; Fourches, D.; Horvath, D.; Benfenati, E.; Muratov, E.; Wedebye, E. B.; Grisoni, F.; Mangiatordi, G. F.; Incisivo, G. M.; Hong, H.; Ng, H. W.; Tetko, I. V.; Balabin, I.; Kancherla, J.; Shen, J.; Burton, J.; Nicklaus, M.; Cassotti, M.; Nikolov, N. G.; Nicolotti, O.; Andersson, P. L.; Zang, Q.; Politi, R.; Beger, R. D.; Todeschini, R.; Huang, R.; Farag, S.; Rosenberg, S. A.; Slavov, S.; Hu, X.; Judson, R. S., Cerapp: Collaborative Estrogen Receptor Activity Prediction Project. Environ. Health Perspect. 2016, 124, 1023-33.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 37

Table 1. Constitutions of the datasets used in this study Dataset MLD_train

MLD_test

SLD_train

SLD_test

Index Negative Positive Total IBR Negative Positive Total IBR Negative Positive Total IBR Negative Positive Total IBR

AR 144 150 294 1.04 30 43 73 1.43 3864 813 4677 4.75 966 203 1169 4.76

ER 116 178 294 1.53 31 42 73 1.35 3377 1037 4414 3.26 844 259 1103 3.26

GR 203 91 294 2.23 50 23 73 2.17 4323 462 4785 9.36 1081 116 1197 9.32

Aromatase 173 121 294 1.43 37 36 73 1.03 4336 529 4865 8.20 1084 132 1216 8.21

TR 215 79 294 2.72 57 16 73 3.56 3954 286 4240 13.83 988 71 1059 13.92

PPARg 228 66 294 3.45 51 22 73 2.32 3864 443 4307 8.72 965 111 1076 8.69

* IBR (imbalance ratio) represents the ratio of the count of majority-class samples to that of the minority-class samples, MLD_train represents multi-label training set, MLD_test represents multi-label test set, SLD_train represents single-label training set, SLD_test represents single-label test set

ACS Paragon Plus Environment

Page 27 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table 2. The optimal voting models per target and their prediction metrics for corresponding single-label test set Target AR ER GR Aromatase TR PPARg

Method SVC SVC SVC RF RF SVC

Fingerprint PubFP PubFP FP PubFP MACCS PubFP

ACC 0.811 0.813 0.764 0.722 0.752 0.774

AUC 0.886 0.883 0.87 0.888 0.801 0.818

SE 0.773 0.819 0.767 0.886 0.732 0.721

SP 0.819 0.812 0.763 0.702 0.753 0.78

(SE+SP)/2 0.796 0.816 0.765 0.794 0.742 0.751

SVC: support vector machine, RF: random forest, PubFP: Pubchem Fingerprint, FP: CDK Fingerprint, ACC: accuracy, SE: sensitivity, SP: specificity

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 37

Table 3. The test performance for each target by the top multi-label model (LP_AP), consensus multi-label model (multi-label_consensus) and combined single-label models (single-label_combine) target AR

model

LP_AP multi-label_consensus single-label_combine ER LP_AP multi-label_consensus single-label_combine LP_AP GR multi-label_consensus single-label_combine Aromatase LP_AP multi-label_consensus single-label_combine TR LP_AP multi-label_consensus single-label_combine LP_AP PPARg multi-label_consensus single-label_combine

ACC 0.767 0.822 0.699 0.74 0.753 0.74 0.822 0.849 0.562 0.753 0.726 0.589 0.822 0.836 0.493 0.767 0.74 0.616

AUC 0.785 0.828 0.822 0.785 0.812 0.764 0.862 0.841 0.837 0.78 0.809 0.716 0.674 0.765 0.698 0.777 0.795 0.714

SE 0.767 0.837 0.884 0.738 0.81 0.905 0.739 0.783 0.913 0.75 0.778 0.806 0.438 0.5 0.75 0.545 0.455 0.864

SP 0.767 0.8 0.433 0.742 0.677 0.516 0.86 0.88 0.4 0.757 0.676 0.378 0.93 0.93 0.421 0.863 0.863 0.51

LP_AP: model built with Label Powerset method and Atom Pairs fingerprint, ACC: accuracy, SE: sensitivity, SP: specificity

ACS Paragon Plus Environment

Page 29 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure Legends Figure 1. (A) – (C) Illustration of binary, multiclass and multi-label classification, where x1-xm denote features, y the label for single-label model and y1-yk the labelset for multi-label classification. (D) The scheme of voting classification combined with multiple random under-sampling. Figure 2. Dataset description. (A) Activity profiles of all the compounds against all targets. Rows representing compounds, columns representing targets, cells representing the activity values of per compound-target pair. Red indicating active effect, blue corresponding to inactive effect, and white denoting the absence of activity data. (B) The distribution of positive and negative labels for compounds commonly tested for all the six targets. Figure 3. Comparison of the cross validation results among the voting classifiers and their component individual models for AR, where x-coordinate represented eight fingerprints, red and green boxplot separately represented RF an SVM models, and red and green stars denoted the corresponding voting classifiers. (A) Accuracy, (B) AUC, (C) specificity, (D) sensitivity. Figure 4. The prediction metrics for the multi-label test set by the optimal voting classifier per target. Figure 5. Comparing the performance of the top five multi-label models along with their consensus one (multilabel_consensus) for multi-label test set against that from the joint optimal single-label model per target (singlelabel_combine), where LP represents Label Powerset method, AP represents Atom Pairs Fingerprint, Morg

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

represents Morgan Fingerprint, FP represents CDK Fingerprint, Topo represents Topological Torsion Fingerprint, LP_AP represents model constructed with LP method and AP fingerprint, SubACC represents subset accuracy, HL represents hamming loss, JSC represents Jaccard similarity coefficient. Figure 6. (A) Analysis and comparison of the false positive rate (FPR) for the AR negative samples with and without known endocrine disruption (ED) activity in single-label test, where FPR_non-EDC represents FPR of non-endocrine disruption chemicals, FPR_EDC represents FPR of endocrine disruption chemicals. (B) Analysis of the false positive rate (FPR) for the AR negative samples in multi-label test set, which have known ED activity.

ACS Paragon Plus Environment

Page 30 of 37

Page 31 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 1.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 2.

ACS Paragon Plus Environment

Page 32 of 37

Page 33 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 3.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 4.

ACS Paragon Plus Environment

Page 34 of 37

Page 35 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 5.

ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 6.

ACS Paragon Plus Environment

Page 36 of 37

Page 37 of 37 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Table of Content 175x204mm (96 x 96 DPI)

ACS Paragon Plus Environment