Computational Prediction of a New ADMET Endpoint for Small

Oct 23, 2018 - The human gut microbiota (HGM), which are evolutionarily commensal in the human gastrointestinal system, are crucial to our health. How...
0 downloads 0 Views 719KB Size
Subscriber access provided by University of South Dakota

Pharmaceutical Modeling

Computational Prediction of a New ADMET Endpoint for Small Molecule: Anti-Commensal Effect on Human Gut Microbiota Suqing Zheng, Wenping Chang, Wenxin Liu, Guang Liang, Yong Xu, and Fu Lin J. Chem. Inf. Model., Just Accepted Manuscript • Publication Date (Web): 23 Oct 2018 Downloaded from http://pubs.acs.org on October 23, 2018

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Computational Prediction of a New ADMET Endpoint for Small Molecule: Anti-Commensal Effect on Human Gut Microbiota Suqing Zheng,a,b* Wenping Chang,a Wenxin Liu,a Guang Liang,a,b Yong Xu,c and Fu Lin a* School of Pharmaceutical Sciences, Wenzhou Medical University, Wenzhou, Zhejiang, P. R. China, 325035

a:

Chemical Biology Research Center, Wenzhou Medical University, Wenzhou, Zhejiang, P. R. China, 325035

b:

c:

Center of Chemical Biology, Guangzhou Institutes of Biomedicine and Health, Chinese Academy of Sciences, Guangzhou, Guangdong, P. R. China, 510530 *: To whom all correspondence should be addressed. E-mail: [email protected]; [email protected].

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 25

ABSTRACT

The human gut microbiota (HGM), which are evolutionarily commensal in human gastrointestinal, are crucial to our health. However, HGM can be broadly shaped by the multifaceted factors such as intake of drugs. About one-quarter of the existing human drugs, which are designed to target the human cells rather than HGM, can notably alter the composition of HGM. Therefore, the anti-commensal effect of human drugs should be avoided to the maximum extent possible in the drug discovery and development process. Nevertheless, the anti-commensal effect of small molecule is a new ADMET endpoint, which was never predicted with the computational method before. In this work, we present the first machine-learning based consensus classification model with the accuracy (0.811±0.012),

precision

(0.759±0.032),

specificity

(0.901±0.019),

sensitivity

(0.628±0.036), F1-score (0.687±0.023), and AUC (0.814±0.030) respectively on the test set. Furthermore, we develop an easy-to-use “e-Commensal” program for the automatic prediction. Based on this program, virtual-screening of food-constituent database (FooDB) indicates that 5888 of 23202 food-relevant compounds are forecasted to possess the anticommensal effect on HGM. Several top-ranked anti-commensal compounds in our prediction are further scrutinized and confirmed by the experiment in the existing literature.

2 ACS Paragon Plus Environment

Page 3 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

To the best of our knowledge, it is the first classification model and stand-alone software for the prediction of commensal or anti-commensal compound impacting on HGM.

INTRODUCTION The human gut microbiota (HGM) inhabiting the human gastrointestinal (HGI) are composed of viruses, eukarya, bacteria, and archaea with about 4000 strains.1 HGM have gradually and adaptively developed the interactive and beneficial commensal-relationship with the host after the co-evolution for millions of years. More specifically, HGM can be colonized abundantly in the host for the nutrition, whilst in return HGM can greatly reinforce the gut integrity and complement many aspects of the human physiology such as the nutrient absorption, metabolism, and immune function.1-14 The disruption of this microbial community, known as dysbiosis, has been highly implicated in the various diseases such as the obesity, inflammatory bowel disease and colorectal cancer.1, 7, 13, 15-22 Hence the viable HGM play a crucial role in maintaining the gut homeostasis and preventing the invasion of pathogens, and thereby are very pivotal to the human health.2, 3, 5, 12, 18

Despite the indispensable role of HGM, the function and composition of HGM can be broadly shaped by the various factors such as the host genetics, diet, and intake of 3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 25

antibiotics.23-26 Surprisingly, the recent work of Maier et al. manifests that the human drugs, which are originally targeted to the human cells but not to the microbiota, are demonstrated to have the profound effect on the human gut microbiota. More concretely, the in-vitro screening of about 1,000 human drugs on 40 typical gut bacterial strains, reveals that around one-fourth of the human drugs can suppress the growth of at least one of 40 strains.27 Their significant work underscores that the great importance should be attached to the anti-commensal effect of human drugs, which is a new ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) endpoint and was not well considered before. Hence, more attention could be paid to this effect in the drug discovery and development process.

Due to the costly and tedious experimental evaluation of the anti-commensal or commensal effect of the small molecules, it would be very helpful if this effect can be predicted by insilico method. To the best of our knowledge, no prior works regarding to the computational prediction of this new endpoint were reported. Thus, here we present the first consensus classification model based on four machine-learning methods and extended-connectivity fingerprint (ECFP)28 to predict whether the compound of interest has the anti-commensal effect or not. Furthermore, we develop a convenient software “e-Commensal” (Figure 1) for the automatic prediction. Finally, as an application, our e-Commensal program is used 4 ACS Paragon Plus Environment

Page 5 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

to estimate how many food-related compounds in the comprehensive food-constituent database FooDB (http://foodb.ca/) have the anti-commensal effect, which so far has never been systematically explored from the computational perspective, albeit it is well known that diet can affect the function and composition of HGM23-26.

5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Page 6 of 25

Figure 1. The main features of e-Commensal program for the automatic prediction of anti-commensal and commensal compound

6 ACS Paragon Plus Environment

Page 7 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

METHODS Data curation and preprocessing The compounds with anti-commensal or commensal effect on human gut microbiota are collected from the single work of Maier et al.27, which will provide the consistent and comprehensive data with the same experimental protocol. Anti-commensal compound is defined if this compound can inhibit at least one bacterial strain in the human gut, while the commensal compound is assumed if this compound cannot suppress either one of the 40 typical bacterial strains in their experimental assays. The data curation process is summarized as follow. Firstly, the SMILE strings of compounds are manually retrieved from the PubChem database according to the chemical names,29 since only the chemical names of the compounds tested in the work of Maier et al.27 are reported in their supporting information; Secondly, the resultant SMILE strings are automatically converted to the Tripos Mol2 format files by Openbabel v2.4 program.30 Thirdly, two additional criteria are applied for the further filtering. (1) For the disconnected structure such as salt, only the organic fragment will be kept. If the anion and cation fragments are organic molecules, both fragments will be treated as two separate molecules. (2) Only the compounds with the common elements are considered. Finally, the whole dataset including 391 anti-commensal compounds and 790 commensal compounds is publicly available in e-Commensal program, by which user can simultaneously browse the 3D structures with the Tripos Mol2 7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 25

format and their corresponding labels (Anti-commensal or Commensal), and can also conveniently export the entire dataset for users’ own applications. Moreover, all the compounds with the SMILE strings and their corresponding labels are listed in Table S1. The protocol for the model-training, model-evaluation, model-interpretation, modeldeployment and model-application In this work, four machine-learning methods including K-nearest neighbors (KNN)31, support vector machine (SVM)

32,

random forest (RF)33, and gradient boosting machine

(GBM)34 are used to construct the consensus classification model. This protocol has been well developed in our previous work about the bitterant prediction35, hence it is detailed in the supporting information (Text S1-S9) and is briefly described in the following step-bystep procedure: (1) Multiple data-partition into the cross-validation dataset (Dataset-CV) for the five-fold cross-validation and the test dataset (Dataset-Test) for the model evaluation; (2) Generation of ECFP based molecular descriptor that is extensively used in the QSAR and ligand-based virtual screening studies36-43; (3) Feature-selection based on the feature importance derived from the random forest method; (4) Model training with all the combination of four machine learning methods, four ECFPs, four feature-number options, and nineteen data-splitting schemes to achieve 1216 models; (5) Model evaluation by the accuracy, precision, specificity, sensitivity, F1-score and AUC on the test set; (6) Y-randomization test of our models to inspect the model robustness; (7) Construction of 8 ACS Paragon Plus Environment

Page 9 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

the consensus model based on the nineteen classification models; (8) Defining the applicability domain of our model to ensure the prediction with a confident inference. (9) Implementation of a convenient graphic software “e-Commensal” for the users to automatically predict the anti-commensal compound with our model. (10) Virtual screening of the food-constituent database FooDB according to e-Commensal program.

RESULTS AND DISCUSSION Prediction of anti-commensal or commensal compound by the consensus classification model In this work, four machine-learning methods (KNN, SVM, GBM, and RF) are used to build the consensus anti-commensal/commensal classification model, which is obtained with the protocol detailed in Text S1-S9. After the systematic model training with the parameters (Table S2) and subsequent testing, 1216 classification models (M0001-M1216 in Table S3) are harvested by taking account of all the combination of four ECFPs (1024bit-ECFP4, 2048bit-ECFP4, 1024bit-ECFP6, and 2048bit-ECFP6), four selected feature-numbers (full, 512, 256, and 128 features), four machine-learning methods (KNN, SVM, GBM, and RF), and nineteen random data-splitting schemes. The scatter-plot of ∆F1-score vs. F1score for all the models is made in Figure 2, considering that ∆F1-score is utilized to monitor the potential over-fitting or under-fitting of the model and F1-score is employed 9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

to evaluate the model performance. Figure 2 demonstrates that ∆F1-score for most of the models is lower than 0.1, which illustrates that majorities of the models have the similar performance on the test set and in the cross-validation. Hence, most of models do not suffer from the apparent under-fitting or over-fitting from this point of view.

Figure 2. the scatter-plot of ∆F1-score vs. F1-score for 1216 classification models. ∆ F1-score (referring to |F1-score (test set) - F1-score(cross-validation)|) is used to monitor the potential over-fitting or under-fitting, and F1-score is used to measure the model performance.

To further ensure the robustness of 1216 classification models, Y-randomization test44 is conducted by randomly shuffling the experimental labels (anti-commensal or commensal) 10 ACS Paragon Plus Environment

Page 10 of 25

Page 11 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

in the dataset for cross-validation (Text S6 and Table S4), and all the outcomes are listed in Table S5. For the easier description, the scatter plot of F1-score(test set) vs. AUC(test set) for all the models is made in Figure S6, which unambiguously shows that the model performances after Y-randomization decrease remarkably compared to the models before Y-randomization test. Therefore, our models without Y-randomization are robust and not attained by chance. However, it is not pragmatic and efficient to deploy all 1216 classification models at the same time for the prediction of anti-commensal or commensal compound, thus one typical consensus model (CM) in Table 1 is proposed and constructed purely based on 19 best models (Table S6) from the 1216 classification models (Table S3) with the highest F1scores in each data-splitting scheme. Observed from Table 1, our consensus classification model provides the accuracy (0.811±0.012), precision (0.759±0.032), specificity (0.901±0.019),

sensitivity

(0.628±0.036),

F1-score

(0.687±0.023),

and

AUC

(0.814±0.030) on the test set by averaging over nineteen data-splitting schemes. In order to help users automatically predict the anti-commensal or commensal compound, this consensus model has been implicitly integrated into e-Commensal program.

11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Page 12 of 25

Table 1. The performance of the consensus model for anti-commensal/commensal compound classification Model

∆F1-score Accuracy Precision Specificity Sensitivity F1-score AUC F1-score (test set) (test set) (test set) (test set) (test set) (test set) (CV) CM 0.811(0.012) 0.759(0.032) 0.901(0.019) 0.628(0.036) 0.687(0.023) 0.814(0.030) 0.681(0.037) 0.028(0.021) Notes: (1) the number in each parenthesis is the standard deviation, which is derived based on the nineteen random data-splitting schemes; (2) ∆F1-score referring to | F1-score (test set) - F1-score (cross-validation) | is adopted to inspect the possible over-fitting or under-fitting; (3) “CV” and “CM” are short for “cross-validation” and “consensus model” respectively.

12 ACS Paragon Plus Environment

Page 13 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Interpretation of anti-commensal/commensal classification model Our classification model can be interpreted on the basis of the feature importance (FI), which emphasizes the importance of each ECFP fingerprint bit “1” contributing to the anticommensal/commensal compound classification. Our e-Commensal program can provide the unique function to synchronously view the structural feature in the context of 3D structure and the corresponding feature importance for each ECFP fingerprint bit “1”. To this end, the model derived from the full features is adopted due to the fact that the feature selection will ignore some ECFP bits and thereby hinder us from visualizing all the “1” bits. So the feature importance, which is derived from RF model with the full features of 2048bit-ECFP6, is integrated into our e-Commensal program for the interactive visualization of ECFP fingerprint-bit “1”, its associated 3D structural feature and feature importance. To explore the important features contributing to the anti-commensal/commensal compound classification, the structural features with the top five largest feature importance (FI) are shown in Figure S7-S11. These five most important bits are 510-bit, 240-bit, 1039bit, 1663-bit, and 566-bit, and their corresponding FIs are 0.006026, 0.005857, 0.005676, 0.005293, and 0.004601 respectively. Figure S7-S11 clearly illustrates that the structure feature for the fingerprint bit can be vividly highlighted in the 3D viewer window, while its associated feature importance (FI) is intuitively displayed in the window titled “FI”. 13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 25

Therefore, this synchronization function of structure feature, feature importance, and fingerprint bit can help us to determine what the most important structural feature is in the given molecule.

Defining the applicability domain of model To follow the guideline of Organization for Economic Co-operation and Development (OECD), the applicability domain of the model should be defined appropriately. In the present work, the applicability domain of our model is provided on the basis of the averagesimilarity of given compound, which is minutely described in Text S7 of supporting information. The histograms of the average-similarity in Figure 3 reveal that the averagesimilarity of 0.1 can be adopted as the threshold to define the applicability domain of our model. If the average-similarity of the given compound is larger than this threshold (0.1), it means that this compound is within the applicability domain of our model, and the prediction for this compound could be thought as a reliable interpolation. On the contrary, the prediction may not be regarded as a confident inference. To automatically inspect whether the compound of interest is located inside the applicability domain of our model, we have developed a handy function in our “e-Commensal” program.

14 ACS Paragon Plus Environment

Page 15 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 3. The histograms of average-similarity are utilized to define the applicability-domain of our classification models. The average-similarity threshold of 0.1 (blue dot-line) is defined and implemented in our e-Commensal program to automatically inspect whether the given compound is located within the applicability domain of our classification model.

Exploration of the anti-commensal/commensal compounds in FooDB As we known, food is one of the main drivers of the function and composition of HGM.23-26 Thus, we are curious that how many food-constituents in the comprehensive FooDB (http://foodb.ca/) may have the anti-commensal effect on the human gut microbiota. Hence the subset of FooDB is composed of 23202 compounds after the filtering with the criteria (Text S9) such as the applicability domain of our model, and is subjected to the virtual screening by e-Commensal program. In light of this screening result, 5888 compounds in the food are predicted to be anti-commensal by our consensus classification model, while 15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 16 of 25

the remaining 17314 compounds are forecasted to be commensal. This is consistent with our common sense that the majorities of food-related compounds probably have the commensal effects on HGM due to the fact that HGM are gradually adapted to our food intake from the evolutionary perspective. However, caution should be still taken that 5888 predicted anti-commensal compounds are worth more systematic experimental evaluation on the HGM by the microbiologists.

In order to further improve the reliability of prediction result and narrow down the candidate compounds for the experiment in the future, the threshold of prediction probability for the anti-commensal compound is increased from the default 0.5 to 0.8, which results in 771 predicted anti-commensal food-related compounds. For the ease of illustration, five anti-commensal compounds with the highest prediction probabilities are taken as the instances, which are Cefdinir (FooDB ID: FDB009416), Amoxycillin (FooDB ID: FDB002369), Tylosin (FooDB ID: FDB012374), Levofloxacin (FooDB ID: FDB022746), and Ceftiofur (FooDB ID: FDB011593) respectively (Table 2). Further scrutinization of these five compounds in the literatures indicates that they turn out to be antibiotics, which are strongly relevant to the anti-commensal effect on HGM. It should be noted that Cedinir, Tylosin and Levofloxacin are actually confirmed to have the anticommensal effect on HGM in the existing experimental work.27 Therefore, our consensus 16 ACS Paragon Plus Environment

Page 17 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

model may offer a promising guidance for the experiment. For the convenience of the experimental microbiologists, 771 predicted anti-commensal compound-IDs in FooDB, and their corresponding predicted probabilities are summarized in Table S7. Table 2. Five predicted anti-commensal compounds with the highest prediction probabilities FOODB ID Name Structure

FDB009416

Cefdinir

FDB002369

Amoxycillin

FDB012374

Tylosin

FDB022746

Levofloxacin

17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

FDB011593

Page 18 of 25

Ceftiofur

Introduction of e-Commensal program To automate all the processes as mentioned above, e-Commensal program is developed based on our previous e-Bitter software.35 This e-Commensal program harnesses our consensus classification model based on the machine-learning methods to automatically predict the anti-commensal or commensal compound. Moreover, this program has been extensively debugged and tested on the English version of Windows operation systems such as Windows 7, 8, and 10, and could be easily used by the experimental microbiologists/medicinal-chemists without any professional knowledge about the machine-learning methods. The whole package containing the program, manual and tutorial can be publicly accessible through the public shared folder of Dropbox (https://www.dropbox.com/sh/7twh05ohxwyjz3e/AACEN84wV1u2kaWc7vjpSX4Ua?dl =0) and Baidu Cloud (https://pan.baidu.com/s/16bftdI7EPkmfd3d-7l9caA with the fourletter extraction code “okzl”).

18 ACS Paragon Plus Environment

Page 19 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

In this program, there are several main features: (1) Visualize the curated datasets for the classification of anti-commensal/commensal compounds; (2) Enquiry the datasets with the similarity search based on the given structure specified by the user; (3) Generate ECFP based molecular descriptor that is natively implemented in this program; (4) Predict anticommensal or commensal compound with the consensus classification model, which is derived based on the diverse machine-learning methods by scikit-learn python library (v0.18.2) in the Winpython (v3.5.4.0); (4) Database screening for the identification of potential anti-commensal or commensal compounds; (5) Visualize ECFP bit in the context of 3D structure and view its corresponding feature importance contributing to the classification; (6) Inspect the applicability domain of our classification model for the compound to be predicted. Only the key functions are highlighted in Figure 1, while the full capabilities are elaborated in the manual.

CONCLUSIONS In this work, we present the first machine-learning based program “e-Commensal” for the prediction of commensal or anti-commensal compound impacting on the human gut microbiota, which was never probed by the computational method. Several key features of this program are highlighted in this work. Firstly, e-Commensal program integrates the consensus classification model based on four machine-learning methods (KNN, SVM, 19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 25

GBM, and RF), which gives the accuracy, precision, specificity, sensitivity, F1-score, and AUC of 0.811±0.012, 0.759±0.032, 0.901±0.019, 0.628±0.036, 0.687±0.023, and 0.814±0.030 respectively on the test set by averaging nineteen data-splitting schemes to reduce the performance bias resulted from the random data-partition. Secondly, this program furnishes an appealing function to interactively visualize ECFP fingerprint-bit “1”, its corresponding structural feature and feature importance, which could intuitively interpret our anti-commensal/commensal classification model according to the feature importance. Thirdly, e-Commensal program can automatically inspect whether the given molecule is located within the applicability-domain of our classification model, so as to ensure whether our prediction is a confident inference. Based on e-Commensal program, virtual screening of FooDB reveals that 17314 and 5888 food-related compounds are predicted to be commensal and anti-commensal compounds respectively, which is consistent with our common sense that the majorities of food-related compounds have the commensal effects on the human gut microbiota. To the best of our knowledge, we develop the first consensus classification model and the first graphic software to automatically predict the commensal or anti-commensal compound, and we anticipate that this program can aid the experimental microbiologists to investigate more anti-commensal compounds impacting on HGM and also help the experimental medicinal chemists discover or design the drugs with the little effect on HGM.

20 ACS Paragon Plus Environment

Page 21 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

ACKNOWLEDGMENT The authors acknowledge the support of startup funding of Wenzhou Medical University, National Natural Science Foundation of China (Grant No. 21502144), and Zhejiang Provincial Natural Science Foundation of China (Grant No. LY17B030007). The authors also thank the reviewers for their suggestions.

Supporting Information The Supporting Information is available free of charge on the ACS Publications website. Text S1-S9, Figure S1-S10 and Table S1-S7 are given in the supporting information.

REFERENCES 1. Lin, L.; Zhang, J. Role of intestinal microbiota and metabolites on gut homeostasis and human diseases. BMC Immunol. 2017, 18, 2-26. 2. Walsh, C. J.; Guinane, C. M.; O’Toole, P. W.; Cotter, P. D. Beneficial modulation of the gut microbiota. FEBS Lett. 2014, 588, 4120-4130. 3. Flint, H. J.; Scott, K. P.; Louis, P.; Duncan, S. H. The role of the gut microbiota in nutrition and health. Nat. Rev. Gastroenterol. Hepatol. 2012, 9, 577-589. 4. Walter, J.; Ley, R. The Human Gut Microbiome: Ecology and Recent Evolutionary Changes. Annu. Rev. Microbiol. 2011, 65, 411-429. 5. Thursby, E.; Juge, N. Introduction to the human gut microbiota. Biochem. J. 2017, 474, 1823-1836. 6. Cani, P. D. Human gut microbiome: hopes, threats and promises. Gut 2018, 67, 1716-1725. 7. Shanahan, F. The gut microbiota-a clinical perspective on lessons learned. Nat. Rev. Gastroenterol. Hepatol. 2012, 9, 609. 21 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 25

8. Schmidt, T. S. B.; Raes, J.; Bork, P. The Human Gut Microbiome: From Association to Modulation. Cell 2018, 172, 1198-1215. 9. Yadav, M.; Verma, M. K.; Chauhan, N. S. A review of metabolic potential of human gut microbiome in human nutrition. Arch. Microbiol. 2018, 200, 203-217. 10. Nicholson, J. K.; Holmes, E.; Kinross, J.; Burcelin, R.; Gibson, G.; Jia, W.; Pettersson, S. Host-Gut Microbiota Metabolic Interactions. Science 2012, 336, 12621267. 11. Noh, K.; Kang, Y. R.; Nepal, M. R.; Shakya, R.; Kang, M. J.; Kang, W.; Lee, S.; Jeong, H. G.; Jeong, T. C. Impact of gut microbiota on drug metabolism: an update for safe and effective use of drugs. Arch. Pharmacal. Res. 2017, 40, 1345-1355. 12. Kim, B.-S.; Jeon, Y.-S.; Chun, J. Current Status and Future Promise of the Human Microbiome. Pediatr. Gastroenterol. Hepatol.Nutr. 2013, 16, 71-79. 13. Wilkinson, E. M.; Ilhan, Z. E.; Herbst-Kralovetz, M. M. Microbiota-drug interactions: Impact on metabolism and efficacy of therapeutics. Maturitas 2018, 112, 53-63. 14. Tramontano, M.; Andrejev, S.; Pruteanu, M.; Klünemann, M.; Kuhn, M.; Galardini, M.; Jouhten, P.; Zelezniak, A.; Zeller, G.; Bork, P.; Typas, A.; Patil, K. R. Nutritional preferences of human gut bacteria reveal their metabolic idiosyncrasies. Nat. Microbiol. 2018, 3, 514-522. 15. Meng, C.; Bai, C.; Brown, T. D.; Hood, L. E.; Tian, Q. Human Gut Microbiota and Gastrointestinal Cancer. Genomics Proteomics Bioinformatics 2018, 16, 33-49. 16. Vázquez-Baeza, Y.; Callewaert, C.; Debelius, J.; Hyde, E.; Marotz, C.; Morton, J. T.; Swafford, A.; Vrbanac, A.; Dorrestein, P. C.; Knight, R. Impacts of the Human Gut Microbiome on Therapeutics. Annu. Rev. Pharmacol. Toxicol. 2018, 58, 253-270. 17. Selber-Hnatiw, S.; Rukundo, B.; Ahmadi, M.; Akoubi, H.; Al-Bizri, H.; Aliu, A. F.; Ambeaghen, T. U.; Avetisyan, L.; Bahar, I.; Baird, A.; Begum, F.; Ben Soussan, H.; Blondeau-Éthier, V.; Bordaries, R.; Bramwell, H.; Briggs, A.; Bui, R.; Carnevale, M.; Chancharoen, M.; Chevassus, T.; Choi, J. H.; Coulombe, K.; Couvrette, F.; D'Abreau, S.; Davies, M.; Desbiens, M.-P.; Di Maulo, T.; Di Paolo, S.-A.; Do Ponte, S.; dos Santos Ribeiro, P.; Dubuc-Kanary, L.-A.; Duncan, P. K.; Dupuis, F.; El-Nounou, S.; Eyangos, C. N.; Ferguson, N. K.; Flores-Chinchilla, N. R.; Fotakis, T.; Gado Oumarou H D, M.; Georgiev, M.; Ghiassy, S.; Glibetic, N.; Grégoire Bouchard, J.; Hassan, T.; Huseen, I.; Ibuna Quilatan, M.-F.; Iozzo, T.; Islam, S.; Jaunky, D. B.; Jeyasegaram, A.; Johnston, M.-A.; Kahler, M. R.; Kaler, K.; Kamani, C.; Karimian Rad, H.; Konidis, E.; Konieczny, F.; Kurianowicz, S.; Lamothe, P.; Legros, K.; Leroux, S.; Li, J.; Lozano Rodriguez, M. E.; Luponio-Yoffe, S.; Maalouf, Y.; Mantha, J.; McCormick, M.; Mondragon, P.; Narayana, T.; Neretin, E.; Nguyen, T. T. T.; Niu, I.; Nkemazem, R. B.; O'Donovan, M.; Oueis, M.; Paquette, S.; Patel, N.; Pecsi, E.; Peters, J.; Pettorelli, A.; Poirier, C.; Pompa, V. R.; Rajen, H.; Ralph, R.-O.; Rosales-Vasquez, J.; Rubinshtein, D.; Sakr, S.; Sebai, M. 22 ACS Paragon Plus Environment

Page 23 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

S.; Serravalle, L.; Sidibe, F.; Sinnathurai, A.; Soho, D.; Sundarakrishnan, A.; Svistkova, V.; Ugbeye, T. E.; Vasconcelos, M. S.; Vincelli, M.; Voitovich, O.; Vrabel, P.; Wang, L.; Wasfi, M.; Zha, C. Y.; Gamberi, C. Human Gut Microbiota: Toward an Ecology of Disease. Front. Microbiol. 2017, 8, 1265. 18. Lynch, S. V.; Pedersen, O. The Human Intestinal Microbiome in Health and Disease. N. Engl. J. Med. 2016, 375, 2369-2379. 19. Cho, J. A.; Chinnapen, D. J. F. Targeting friend and foe: Emerging therapeutics in the age of gut microbiome and disease. J. Microbiol. 2018, 56, 183-188. 20. Clemente, Jose C.; Ursell, Luke K.; Parfrey, Laura W.; Knight, R. The Impact of the Gut Microbiota on Human Health: An Integrative View. Cell 2012, 148, 12581270. 21. Cénit, M. C.; Matzaraki, V.; Tigchelaar, E. F.; Zhernakova, A. Rapidly expanding knowledge on the role of the gut microbiome in health and disease. Biochim. Biophys. Acta, Mol. Basis Dis. 2014, 1842, 1981-1992. 22. Sun, J.; Chang, E. B. Exploring gut microbes in human health and disease: Pushing the envelope. Genes Diseases 2014, 1, 132-139. 23. Videhult, F. K.; West, C. E. Nutrition, gut microbiota and child health outcomes. Curr. Opin. Clin. Nutr. Metab. Care 2016, 19, 208-213. 24. Scott, K. P.; Gratz, S. W.; Sheridan, P. O.; Flint, H. J.; Duncan, S. H. The influence of diet on the gut microbiota. Pharmacol. Res. 2013, 69, 52-60. 25. Derrien, M.; Veiga, P. Rethinking Diet to Aid Human-Microbe Symbiosis. Trends Microbiol. 2017, 25, 100-112. 26. Lloyd-Price, J.; Abu-Ali, G.; Huttenhower, C. The healthy human microbiome. Genome Med. 2016, 8, 51-61. 27. Maier, L.; Pruteanu, M.; Kuhn, M.; Zeller, G.; Telzerow, A.; Anderson, E. E.; Brochado, A. R.; Fernandez, K. C.; Dose, H.; Mori, H.; Patil, K. R.; Bork, P.; Typas, A. Extensive impact of non-antibiotic drugs on human gut bacteria. Nature 2018, 555, 623-628. 28. Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742-754. 29. Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B. A.; Wang, J.; Yu, B.; Zhang, J.; Bryant, S. H. PubChem Substance and Compound databases. Nucleic Acids Res. 2016, 44, D1202-D1213. 30. O'Boyle, N. M.; Banck, M.; James, C. A.; Morley, C.; Vandermeersch, T.; Hutchison, G. R. Open Babel: An Open Chemical Toolbox. J. Cheminform. 2011, 3, 33-46. 31. Cover, T.; Hart, P. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21-27. 23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 25

32. Vapnik, V. N., The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc.: 1995; p 188. 33. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5-32. 34. Friedman, J. H. Stochastic Gradient Boosting. Comput. Stat. Data Anal. 2002, 38, 367-378. 35. Zheng, S.; Jiang, M.; Zhao, C.; Zhu, R.; Hu, Z.; Xu, Y.; Lin, F. e-Bitter: Bitterant Prediction by the Consensus Voting From the Machine-Learning Methods. Front. Chem. 2018, 6. 36. Ekins, S.; Williams, A. J.; Xu, J. J. A Predictive Ligand-Based Bayesian Model for Human Drug-Induced Liver Injury. Drug Metab. Dispos. 2010, 38, 2302-2308. 37. Chen, L.; Li, Y.; Zhao, Q.; Peng, H.; Hou, T. ADME Evaluation in Drug Discovery. 10. Predictions of P-Glycoprotein Inhibitors Using Recursive Partitioning and Naive Bayesian Classification Techniques. Mol. Pharm. 2011, 8, 889-900. 38. Hu, G.; Kuang, G.; Xiao, W.; Li, W.; Liu, G.; Tang, Y. Performance Evaluation of 2D Fingerprint and 3D Shape Similarity Methods in Virtual Screening. J. Chem. Inf. Model. 2012, 52, 1103-1113. 39. Braga, R. C.; Alves, V. M.; Silva, M. F. B.; Muratov, E.; Fourches, D.; Lião, L. M.; Tropsha, A.; Andrade, C. H. Pred-hERG: A Novel web-Accessible Computational Tool for Predicting Cardiac Toxicity. Mol. Inform. 2015, 34, 698-701. 40. Braga, R. C.; Alves, V. M.; Muratov, E. N.; Strickland, J.; Kleinstreuer, N.; Trospsha, A.; Andrade, C. H. Pred-Skin: A Fast and Reliable Web Application to Assess Skin Sensitization Effect of Chemicals. J. Chem. Inf. Model. 2017, 57, 10131017. 41. Rodríguez-Pérez, R.; Vogt, M.; Bajorath, J. Support Vector Machine Classification and Regression Prioritize Different Structural Features for Binary Compound Activity and Potency Value Prediction. ACS Omega 2017, 2, 6371-6379. 42. Varsou, D.-D.; Melagraki, G.; Sarimveis, H.; Afantitis, A. MouseTox: An Online Toxicity Assessment Tool for Small Molecules Through Enalos Cloud Platform. Food Chem. Toxicol. 2017, 110, 83-93. 43. Wang, N.-N.; Huang, C.; Dong, J.; Yao, Z.-J.; Zhu, M.-F.; Deng, Z.-K.; Lv, B.; Lu, A.-P.; Chen, A. F.; Cao, D.-S. Predicting human intestinal absorption with modified random forest approach: a comprehensive evaluation of molecular representation, unbalanced data, and applicability domain issues. RSC Adv. 2017, 7, 19007-19018. 44. Rücker, C.; Rücker, G.; Meringer, M. Y-Randomization and Its Variants in QSPR/QSAR. J. Chem. Inf. Model. 2007, 47, 2345-2357.

24 ACS Paragon Plus Environment

Page 25 of 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

Journal of Chemical Information and Modeling

TOC Graphic 25

ACS Paragon Plus Environment