Perturbation-Theory Machine Learning modelling of immunotoxicity for

Jul 22, 2019 - ChEMBL biological activities prediction for 1-5-Bromofur-2-il-2-bromo-2-nitroethene (G1) is a difficult task for cytokine immunotoxicit...
0 downloads 0 Views 1009KB Size
Subscriber access provided by GUILFORD COLLEGE

Article

Perturbation-Theory Machine Learning modelling of immunotoxicity for drugs targeting inflammatory cytokines and study of the anti-microbial G1 using cytometric bead arrays Esvieta Tenorio-Borroto, Nilo Castañedo, Xerardo García-Mera, Kenneth Rivadeneira, Juan Carlos Vazquez Chagoyan, Alberto Barbabosa Pliego, Cristian R. Munteanu, and Humbert González-Díaz Chem. Res. Toxicol., Just Accepted Manuscript • DOI: 10.1021/acs.chemrestox.9b00154 • Publication Date (Web): 22 Jul 2019 Downloaded from pubs.acs.org on July 25, 2019

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

Perturbation-Theory Machine Learning modelling of immunotoxicity for drugs targeting inflammatory cytokines and study of the anti-microbial G1 using cytometric bead arrays Esvieta Tenorio-Borroto1,2, Nilo Castañedo3, Xerardo García-Mera1, Kenneth Rivadeneira4, Juan Carlos Vázquez Chagoyán2, Alberto Barbabosa Pliego2, Cristian R. Munteanu*,4,5, Humbert González-Díaz6,7 Department of Organic Chemistry, Faculty of Pharmacy, University of Santiago de Compostela, 15782, Santiago de Compostela, Spain 2 Center for Research and Advanced Studies in Animal Health, Faculty of Veterinary Medicines and Animal Husbandry, Autonomous University of Mexico State (UAEM), 50200, Toluca, México 3 Chemical Bioactive Center (CBQ), Central University of Las Villas (UCLV), 50100, Santa Clara, Cuba 4 RNASA-IMEDIR, Computer Science Faculty, University of A Coruna (UDC), 15071, A Coruña, Spain 5 Biomedical Research Institute of A Coruña (INIBIC), University Hospital Complex of A Coruña (CHUAC), 15006, A Coruña, Spain 6 Department of Organic Chemistry II, University of the Basque Country UPV/EHU, 48940, Leioa, Spain 7 IKERBASQUE, Basque Foundation for Science, 48011, Bilbao, Spain 1

For TOC only

ABSTRACT ChEMBL biological activities prediction for 1-5-Bromofur-2-il-2-bromo-2-nitroethene (G1) is a difficult task for cytokine immunotoxicity. The current study is presenting experimental results for G1 interaction with mouse Th1/Th2 and proinflammatory cytokines using cytometry bead array (CBA). In the in vitro test of CBA, the results show no significant differences between the mean values of the Th1/Th2 cytokines for the samples treated with G1 with respect to the negative control, but there are moderate differences for cytokine values between different periods (24/48 hours). The experiments show no significant differences between the mean values of the proinflammatory cytokines for the samples treated with G1, regarding the negative control, except for the values of tumor necrosis factor (TNF) and Interleukin (IL6) between the group treated with G1 and the negative control at 48 hours. Differences occur for these cytokines 1 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 30

in the periods (24/48 hours). The study confirmed that the antimicrobial G1 did not alter the Th1/Th2 cytokines concentration in vitro in different periods, but it can alter TNF and IL6. G1 promotes free radicals production and activates damage processes in macrophages culture. In order to predict all ChEMBL activities for drugs in other experimental conditions, a ChEMBL dataset was constructed using 25 biological activity, 1366 assays, 2 assay types, 4 assay organisms, 2 organism, and 12 cytokine targets. Molecular descriptors calculated with Rcpi and 15 Machine Learning methods were used to find the best model able to predict if a drug could be active or not against a specific cytokine, in specific experimental conditions. The best model is based on 120 selected molecular descriptors and a Deep Neural Network with Area Under the curve of the Receiver Operating Characteristic of 0.904 and accuracy of 0.832. This model predicted 1,384 G1 biological activities against cytokines in all ChEMBL dataset experimental condition (see https://github.com/kennethriva/Machine-Learning-for-drugs-cytokines). 1. Introduction Cytokines are soluble, low molecular weight polypeptides and glycopeptides produced by a broad range of cell types of haematopoietic and non-haematopoietic origin that have suppressive or enhancive effects on cellular proliferation, differentiation, activation, and motility. Cytokines regulate a complex network of cellular interactions, which involve lymphoid cells, antigen presenting cells and haematopoietic cells mediating the immune response towards the grafted organ 1. Because cytokines can serve as potential biomarkers for health and disease and changes in cytokine levels can serve as readout systems for therapeutic interventions, cytokine research has been and is the focus of many researchers. Of note, cytokines and cytokine antagonists have also exhibited therapeutic potential in a number of chronic and acute diseases 2. Activated T-helper (Th) cells are another major source of the cytokines involved in the regulation of a cascade of immune responses. Cytokines produced by T-helper cells are generally classified as two subtypes: Th1 and Th2 according to their cytokines pattern. Th1 cells secrete predominantly interferon gamma (IFN-γ) and interleukin-2 (IL-2), which are known to promote cell-mediated immunity and to enhance various effector cell functions, including natural killer (NK) cell activity. Th2 cells are involved in humoral responses and secrete interleukins IL-4, IL5, IL-6, IL-9, IL-10 and IL-13. Th1 cells differentiation is regulated by the transcription factors signal transducer and activator of transcription 4 (STAT-4) and T-bet which are different and antagonistic to those controlling the Th2 cells development: STAT-6 and DNA ‘GATA’ binding protein 3 (GATA-3). Particularly IFN-γ promotes the differentiation of naïve CD4+ T cells to the Th1 subset and inhibits the proliferation of Th2 cells and IL-2 is responsible for the clonal expansion of T cells after antigen recognition, and the presence of IL-4 inhibits the development of Th1 cells 3. Recently, researchers have found that T-cell phenotypes are more complex than this simple dichotomous classification and a predominant Th1-type cytokine pattern has been observed in the pathogenesis of autoimmune disorders such as Hashimoto's thyroiditis and rheumatoid arthritis (RA) 4, 5. In infectious diseases, the Th1-type cytokine pattern has been 2 ACS Paragon Plus Environment

Page 3 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

associated with resistance to infection. On the other hand, Th2-type cytokine pattern has been associated with infection by gastrointestinal parasites. A second class of cytokines are the proinflammatory family. These cytokines possess the property of inducing other inflammatory cytokines and leading to reciprocal cytokine induction. Localized reciprocal cytokine induction can cause chronic inflammation and escalate disease severity. Two of the pro-inflammatory cytokines that play a critical role in RA are tumor necrosis factor alpha (TNF-α) and interleukin-1 (IL-1), which up-regulate each other in a positive feedback loop to cause most of the pathologic symptoms 6. The roles of proinflammatory cytokines such as TNF-α and interleukin 1 beta (IL-1β) have been well validated in preclinical animal models and in clinical patients. These cytokines are the major inducers of proliferation of pannus-resident fibroblasts that produce collagenase and other proteolytic enzymes, which in turn cause cartilage degradation. Other cytokines of relevance in different diseases include IFN-γ and granulocyte-macrophage colony stimulating factor (GM-CSF), both of which can activate macrophages 7. On the other hand, the down-regulation of inflammation can occur by the production of IL-10 and transforming growth factor-beta (TGF- β), but the in vivo role of these cytokines in several diseases is not fully clear. Macrophages are essential cells for the innate immune response. Inflammation involves, in part, migration of leukocytes across the vasculature with concurrent exudation of fluids as the vascular permeability increases 8. TNF-α, and IL-6 are two proinflammatory cytokines that form part of a complex defense network that protects the host against inflammatory agents, microbial invasion and injury. However, overproduction or aberrant regulation of these cytokines may harm the host, by inducing tissue injury or alteration of the immune system.The analysis and quantification of cytokines in biological fluids has become a widely used procedure in research and clinical laboratories, and it is clearly important in furthering our understanding of many immunological functions. Through these strategies more than 100 cytokines have been identified and have been shown to interact in a complex network in order to fulfil their biological functions. The biological effects they induce include stimulation or inhibition of cell proliferation, cytotoxicity/apoptosis, antiviral activity, differentiation, and up-regulation of expression of intracellular, secreted, and surface membrane proteins. Flow cytometry has been advanced significantly by the development of both fluorescent dye synthesis and high-speed analytical technologies in recent years. However, with the demand for multiplexing capability, shorter analysis time, smaller sample volume and higher sensitivity, other techniques are being explored to perform immunoassays. Regarding this trend, current multiplexed technologies such as bead-based flow cytometry and bead-based suspension array technologies allow multiple analyses in a single sample to be assessed simultaneously, which has been widely used in multiplexed assays in complex system recently 9. Compared with conventional analytical technologies, bead-based multiplexed technology possesses unique properties such as detection speed, excellent sensitivity, multiple analysis capability, splendid specificity, flexibility and low cost, which offer it the potential to become an excellent multiplexed technology. Since these technologies are capable of yielding a large amount of information which can be retrieved in 3 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 4 of 30

base date multiplex, numerous groups have dedicated themselves to collect and sort data from a vast variety of assays. These experiments allowed the creation of large databases available online for public research. The ChEMBL database mainly consists of medicinal chemistry bioassay data, integrated from a wide variety of sources for chemical biology and drug-discovery research problems 10. Currently, the database contains 15.2 million bioactivity measurements for more than 1 million compounds and 12000 protein targets11. ChEMBL contains over 20,000 outcomes for drug assays related to cytokines Th1/Th2, with different degrees of curation. As a consequence, the search of computational models to predict the possible results for new drugs in all these assays have become a goal of the major importance to reduce experimentation costs. The methods TOPS-MODE was introduced by Estrada and implemented in software MODESLAB with TOPS-MODE name 12. The software have been applied to high-throughput multi-target QSAR 13 and to complex datasets with multiplexing assay conditions (mj), similar with ChEMBL 14,15. High-throughput multi-target QSAR has been widely used to predict biology activity (immunotoxicity) from chemical structure and corresponding physicochemical properties. On the other hand, there are not high-throughput mt-QSAR models of multiplexing assay endpoints for drug effects over cytokines Th1/Th2 using TOPS-MODE or other technique. The three objectives of the present work are the following: (1) Experimentally characterization of the interactions between 1-5-Bromofur-2-il-2-bromo2-nitroethene (G1) with cytokines associated with Th1/Th2 and proinflammatory response by using CBA-based immunoassays on mouse (cytometry bead assays); (2) Development of a theoretical multiplexing-QSAR (mt-QSAR) model able to predict the immunotoxic effect of drugs over humoral and cellular response by using a large set of mj experimental conditions. For a large ChEMBL dataset of drugs - cytokines, the descriptors of drugs and protein targets will be calculated using Rcpi. Next, we shall fit and validate a new high-throughput mt-QSAR model by using different Machine Learning techniques from python libraries. (3) Extension of experimental G1-cytokine interactions: prediction of other immunotoxic endpoints multiplexing assay for G1, not determined experimentally in this work by using the best mt-QSAR model. 2. Materials and methods 2.1 Biology assays

4 ACS Paragon Plus Environment

Page 5 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

2.1.1. Reagents and antibody 1-5-Bromofur-2-il-2-bromo-2-nitroethene (G1); CAS number 35950-55-1, was kindly supplied from the Chemical Bioactives Centre, Sample purity was 99.93%. G1 was dissolved in dimethylsulfoxide (DMSO). DMSO was purchased in turn from Sigma–Aldrich Co. (DF, México). Flow cytometry was performed using a FACcalibur cytometer (Becton Dickinson, México). Thereafter, FACS data were analyzed with FCAPArray.v3.0 software. All CBA Mouse Th1/Th2 Cytokine Kit (Catalog No. 551287) and CBA Mouse Inflammatory kits (Catalog No. 552364) were purchased from BD (BD Biosciences, México). 2.1.2. Animals Female Balb/C mice weighing 18–20 g were purchased from the UNAM-Harlan laboratories (DF, México). All animals (n=6) were allowed to acclimate to our laboratory facilities for at least 7 days before their inclusion in an experiment. They were housed in standard laboratory conditions (22.3 °C; relative humidity 50–55%; 12h light/dark cycle) and given ad libitum access to food and water. This work agreed with Ethical Principles in Animal Research adopted by México (NOM, 1999). The study was performed following approval by the Veterinary Ethics Committee of the Faculty de Medicine Veterinarian of Universidad Autónoma del Estado de México. 2.1.3. In vitro culture of thymic lymphocytes and macrophages After sacrifice, thyme and peritoneal fluid were aseptically removed and cells were suspended at a concentration of 1×106/mL in RPMI 1640 (BD. México) supplemented with 10% fetal calf serum, 1% and cultured in flat-bottomed 96-well plates. Cells were cultured in triplicates (300µL/well) and incubated at 37 °C and 5% CO2. Cell suspensions were washed twice by centrifugation. Cell viability (over 95%) was determined using trypan blue exclusion. Lymphocytes and macrophages numbers were adjusted to 1 × 106 cell/mL and plated 100 μL/ well in 96-well flat-bottomed tissue culture plates (UNIPARTS, Toluca, Mexico). Cells were incubated in RPMI 1640 complete medium containing 10% FBS, and 1 mg/mL of LPS. This cell was incubated for 24 and 48 h at 37 °C under 5% CO2 in a humidified chamber. Non-adherent cells were removed by gently washing with PBS and fresh RPMI 1640 with 10% of DMSO complete medium was replaced and incubated at 80ºC for 72 hours. 2.1.4. Analysis of mouse cytokines Th1/Th2 by flow cytometry A CBA-based immunoassay designed to measure six cytokines commonly associated with Th1/Th2 cellular response was developed. This kit combines beads with the ability to measure the levels of IL-2, IL-4, IL-5, IL-10, TNF, and IFN-γ. The second CBA-based immunoassay was 5 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 30

designed to measure six cytokines commonly associated with proinflammatory response. This second kit combines beads with the ability to measure the levels of monocyte chemoattractant protein-1 (MCP-1), IL-6, IL-10, TNF, interleukin-12 as a disulfide-linked heterodimeric 70-kDa cytokine (IL-12p70) and IFN-γ. A major concern in the utilization of this type of complex assay is the potential variability that may be observed. Analysis of cytokines was conducted using a mouse cytometric bead array kit (CBA; BD Biosciences, San Diego, CA) and was analyzed on a FACSCalibur flow cytometer. Standard curves were determined for each cytokine from a range of 0–5000 pg/ml. The lower limit of detection for the CBA, according to the manufacturer is 2.5–52.7 pg/ml, depending on the analyte. The following cytokines from the thymus lymphocyte were measured: interleukin-4 (IL-4), interleukin-5 (IL-5), interleukin-2 (IL-2), interleukin-10 (IL-10), Interferon-γ (IFN-γ), and tumor necrosis factor-α (TNF-α). The following macrophage cytokines were measured: interleukin-6 (IL-6), interleukin-10 (IL-10), Interferon-γ (IFN-γ), and tumor necrosis factor-α (TNF-α). For each sample, plasma and the standard mixture of cytokines, 50 µl of sample or standard of cytokine were added to the mixture of 50 µl of each Ab-bead reagent and Ab-PE detector. The mixture (150 µl) was incubated in darkness at room temperature for 160 min and then washed before data acquisition with flow cytometry. 2.1.5. Statistical Analysis of experimental assays All statistical analyses were performed using the STATISTICA package program version 6.0. Mean values and standard deviations were calculated from the experimental data obtained, and an analysis of variance (ANOVA) was applied to a 5% level of significance, using compound concentration and Mean Weigh as main factors. Pairwise comparisons were carried out using the Tukey test, at the same level of significance. 2.2 Computational methods 2.2.1 ChEMBL dataset preprocessing ChEMBL 16 is a general data set composed of >20,000 multiplexing assay endpoints which were downloaded from the public database ChEMBL 10,17. This dataset includes 12,906 parts of drug cytokine into multiplexing assay conditions (mj). Every drug evaluated in different mj conditions were assigned to 1 out of 2 possible activity classes: active (C = 1) or non-active compounds (C = 0). One compound may lead to 1 or more statistical cases because it may give different outcomes (statistical cases) for alternative biological assays carried out in diverse sets of multiplex conditions. One compound has C = 1 if the Z-score of this compound Zij > Zcutoff. The Zij scores depend of the value observed (Vij) for the ith-drug assayed in conditions (mj). It also depends on the average (AVGj(Vij)) and Standard deviation (SDj) of the values measured Vij. 6 ACS Paragon Plus Environment

Page 7 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

The parameter nj is the number of cases measured in a given sub-set of conditions mj. Thus, the final dataset of the study contains unbalanced classes (class 0:class 1 = 3:1). The formulas used for Zij scores are shown in equations 1-3. 𝑍𝑖𝑗 = 𝑉𝑖𝑗 ― 𝐴𝑉𝐺𝑗(𝑉𝑖𝑗) 𝑆𝐷𝑗(𝑉𝑖𝑗) (1)

𝑛

𝐴𝑉𝐺𝑗(𝑉𝑖𝑗) =

∑1𝑗𝑉𝑖𝑗 𝑛𝑗

(2)

𝑛

𝑆𝐷𝑗(𝑉𝑖𝑗) =

∑1𝑗(𝑉𝑖𝑗 ― 𝐴𝑉𝐺𝑗(𝑉𝑖𝑗))2 𝑛𝑗 ― 1

(3)

The dataset is based on six type of ChEMBL experimental conditions: -

-

STANDARD_TYPE_UNITSj = 25 type of biological activity + units: Inhibition (%), IC50 (nM), EC50 (nM), Kd (nM), Ki (nM), Ka (10'6/M/s), Activity (%), Selectivity (), Residual Activity (%), Ratio IC50 (), Activity (uM), Ratio (), Vmax (), Km (nM), ED50 (mg.kg-1), k_off (s-1), Tm (degrees C), T1/2 (hr), k_on (M-1.s-1), Delta Tm (degrees C), GI50 (nM), K (/min), IC50 (ug.mL-1), Time (min), and FC (); ASSAY_CHEMBLID = 1366 unique assays; ASSAY_TYPE = 2 values: 'F', 'B'); ASSAY_ORGANISM = 4 values: 'Homo sapiens', 'Escherichia coli', 'Unknown', 'Mus musculus'); ORGANISM = 2 values: 'Homo sapiens', 'Mus musculus'; TARGET_CHEMBLID = 12 target cytokines: CHEMBL2094115, CHEMBL260, CHEMBL1974, CHEMBL6186, CHEMBL4897, CHEMBL1649052, CHEMBL1887, CHEMBL3120039, CHEMBL1275217, CHEMBL3596084, CHEMBL2034796, and CHEMBL2176788.

A set of unique 7,202 ligands/drugs has been used. The set of the six experimental conditions generate 1,384 multiplexing assay condition (mj).

7 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 30

2.2.2 Theoretical modelling steps Perturbation-Theory Machine Learning (PTML) modeling technique is useful to seek predictive models for complex datasets with multiple Big Data features 18, 19. We can predict scoring function values f(vij)calc for the ith compound in the jth preclinical assay with multiple conditions of assay cj = (c0, c1, c2, …cn). PT operators are used as input. The operators more commonly used are Moving Average (MA) operators similar those used by Box & Jenkins in ARIMA models 20,21. It is possible to develop linear PTML models in order to predict the biological activity and/or classify compounds as active or inactive in terms of biological activity. The MA operators (MAij) are experiment-centered features (transformed features) obtained as the difference between the original feature (molecular descriptor for drugs and protein targets, Vij, ith-drug assayed in conditions mj) and the average value of the original feature (Vij) for specific experimental condition mj (see Eq. 4). 𝑀𝐴𝑖𝑗 = 𝐴𝑣𝑔𝑗(𝑉𝑖𝑗) ― 𝑉𝑖𝑗 (4) The model uses as original features drug and protein target descriptors calculated in R. Rcpi package 22 was used to calculate 143 molecular descriptors for 7,202 unique drugs and 110 molecular descriptors for 12 cytokine proteins (see Supp. Material, ListOfMolecularDescriptors.xlsx). After calculating experiment-centered descriptor (MA) and removing duplicates, the pool dataset contains 1,518 features. MA features are normalized using the standard transformer in scikit-learn 23 from python. The holdout cross-validation method was used with a stratified split of 75% - 25% for training and test subsets. A big number of features (1,518 MAs) could produce overfitting and consume intensive CPU/GPU. Therefore, a feature selection methods based on Random Forest was used to select the best 120 MAs. Only the best 10 features for each experimental condition for proteins and drugs (best 10 x 6 conditions x 2 = 120 features) will be used for the Machine Learning calculations. Additionally, principal component analysis (PCA) reduction was tested (106 components for 99% variability). Because our data is not balanced, class weights have been used for classes. The steps for the modelling section are based on a series of python scripts presented as Jupiter notebooks (see the study GitHub repository 24) : 1) The original ChEMBL dataset, drug and protein descriptors are merged into a raw dataset (ds_raw): 1-Merge-Datasets.ipynb. 2) Feature engineering to create new features as difference of the original descriptors and the average of the descriptors in specific experimental conditions (ds_MA.csv). Thus, the descriptors will include experimental condition information by using the experimentalcentering of the features (2-Experimental-centered-dataset.ipynb). 3) Standardize the dataset features and create training and test subsets: 3-CreateDatasets.ipynb. 4) Build models with 14 Machine Learning methods with default parameters as baseline results: 4-Machine-Learning-baseline-generator.ipynb. The following methods from sklearn are included: KNeighborsClassifier - Nearest Neighbors 25, LinearSVC - Linear 8 ACS Paragon Plus Environment

Page 9 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

Support vector machine (SVM) 26, LinearDiscriminantAnalysis - Linear Discriminant Analysis 27, LogisticRegression - Logistic regression 28, SVC - Support vector machine (SVM) with linear and Radial Basis Functions (RBF) 29, AdaBoostClassifier - AdaBoost 30, GaussianNB - Gaussian Naive Bayes 31, MLPClassifier - Multi-Layer Perceptron (MLP) / Neural Networks 32, DecisionTreeClassifier - Decision Trees 33, RandomForestClassifier - Random Forest 34, GradientBoostingClassifier - Gradient Boosting 35, BaggingClassifier - ensemble meta-estimator Bagging 36, and XGBClassifier - XGBoost 37. 5) Search grid for the best methods - find the best model hyperparameters using 5-fold cross-validation (several python scripts). 6) Deep Learning 38 models using keras 39. The best DL model was constructed with 5-FCDNNs.ipynb. The best model was used to predict the class (active vs non-active compounds) for the experimental drug G1 in all combination of the experimental conditions from our dataset (1,384 predictions). This step was possible using the G1 descriptors with the target descriptors and the experimental conditions from the dataset (see Figure 1). The models are characterized by different metrics such as Area Under the Receiver operating Characteristics (AUROC / AUC) 40, total accuracy (ACC), recall, precision, and F1-score 41.

Figure 1. Computational Method flow. 2.2.3 Machine Learning approach In this study, a supervised approach will be used in order to classify new observations based on a set of labeled examples. In order to find a high-throughput multiplexing-QSAR model, we 9 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 30

compared several Machine Learning algorithms from python scikit-learn library 23. In our case, we are dealing with a classification problem, more precisely with a binary classification problem in which learning functions classifies the data item into one of two predefined classes: active (C = 1) or non-active compounds (C = 0). 2.2.3.1 Algorithms 15 ML linear and non-linear algorithms were used in order to seek the mt-QSAR model. Each of these algorithms has its own set of parameters that need to be tuned to find the best model. The k-nearest neighbor algorithm (kNN) 25 is one of the first and most used classifiers in almost any initial procedure in a ML tasks. Since the main goal is to classify based on a majority vote of its k neighbors, kNN has a much more flexible decision boundary compared to logistic regression 42. It will likely outperform logistic regression when the true decision boundary is highly irregular and far from linear. The main drawbacks is the performance observed in high dimensionality datasets. Gaussian Naive Bayes (GNB) is an algorithm which is included in a family of algorithms based on the Bayes theorem, which describes the probability of an event, based on prior knowledge of conditions, be related of conditions to the event 31. Linear discriminant analysis (LDA) 27 represents a generalization of Fisher's linear discriminant. It is searching for a linear combination of features that is able to separates two or more classes. Thus, it is related to ANOVA, regression analysis, PCA and factor analysis. Support vector machines (SVM) has become one of widely used algorithms for supervised classification tasks since it offers high accuracy compared to other classifiers. The main goal is to construct a hyperplane in multidimensional space to separate different classes. Thus, the maximum marginal hyperplane that best divides the classes can be achieved 43,44. Hyperplane distribution is mediated by support vectors 26. SVM handles the non-linear relationships by using the kernel trick where different types of nonlinear kernel functions such as Gaussian radial basis (RBF) or Laplace radial basis (LAP) can be represented 29. Decision Tree (DT) is a non-parametric supervised learning method built around a series of decisions rules inferred from the data features. These decision rules create a “tree” structure where parameters such as tree depth and node number need to be set in order to split the tree based on the decision taken. The paths from root to leaf represent classification rules 33. Random forest (RF) is an ensemble method by aggregating several decision trees. By using all these learners, a reduction of the variance is achieved compared to the one obtained from a single learner 34. Each tree is generated based on a bootstrap sample drawn randomly from the original dataset using a classification or regression tree (CART) method and the Decrease Gini Impurity (DGI) as the splitting criterion 34. Both bagging and random variable selection for tree building is introduced to create the “forest”. The final step is to average over a large ensemble of low-bias, high-variance but low correlation trees. Algorithms such as Adaboost 30 and Gradient Boosting 35 can be also labeled as ensemble methods. The main goal of this sort of methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability or 10 ACS Paragon Plus Environment

Page 11 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

robustness over a single estimator. There are two families of ensemble methods, averaging and boosting methods, the former is based on the predictions average over the estimators whereas in the latter base estimators are built sequentially trying to reduce the bias of the combined estimator. The final goal is to build a powerful ensemble from a combination of weak models. In both Adaboost and Gradient Boosting, training samples are reweighted to point out to the misclassified instances made by previous trees. Once all trees are grown, predictions are obtained by the weighted majority vote of the trees 45. Bagging meta-estimators can be seen as a class of ensemble algorithms, in which several instances of a black-box estimators on random subsets of the original training set are built with the final purpose of aggregating their individual predictions to form a final global prediction. The goal is to reduce the variance over base estimators by adding randomization into the construction workflow and making an ensemble out of it eventually 36. Since its ability to reduce overfitting, usually perform better with strong and complex models. By contrast, boosting methods usually achieve better performances by working with weak models. XGBoost is another ensemble algorithm in which weak classifiers are added to correct errors made by themselves. The main goal is to build a strong enough classifier by reducing error rate made by weak learners. Recently XGBoost classifier has caught community attention due to its performances in some popular competitions as the ones occurred through the Kaggle competition projects 37. Each of the learners is a classification or regression tree (CART) as defined by Friedman in 46. Logistic regression (LR) 28 is a simple linear model, which is able to estimate probability of a binary response based on a number of risk factors. The strategy to avoid overfitting is the use of a regularization term, typically known as L1 and L2 penalty terms 47. A main drawback is the limited flexibility of LR while dealing with non-linear separable problems 48. Multilayer perceptron (MLP) 32 is one of the artificial neural networks (ANNs) approaches usually used as a first simple approximation in artificial networks. MLP algorithm used in this study is known as multi-layer feed-forward neural network (MFFN) in which data and calculations flow in a single direction, from the input data to the outputs by using interconnection of perceptrons. This type of ANNs has been widely used in classification and prediction problems as in cancer research 49. Deep Learning 38 is a subfield of Machine Learning. The current work is using the fullyconnected deep neural networks (FC-DNN) that represents an extensions of the classical neural networks: there are more than one hidden layer, more neurons in each layer and better activation functions in the neurons. 2.2.3.2 Grid Search of model hyperparameters After the baseline results using the default/one specific set of parameters for 14 classifiers, a grid search step has been executed to improve the first results. In addition, DL topologies (FC-DNN)

11 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 30

have been tested. All calculations used class_weight due to the imbalanced dataset and seed = 0 for reproducibility. The following parameters have been tested: ● LogisticRegression: "C":np.logspace(-3,3,7), "penalty":["l1","l2"]; ● SVC: 'C':[1,10],'gamma':['scale',0.1,0.001,0.0001],'kernel':['linear','rbf']; ● RandomForestClassifier: "n_estimators":[100,500,1000], "max_depth":[3, 5, 7, 9, 12, 20, None],max_features":[1,3,10,'auto'],"min_samples_split":[2,3,4,5,10,15,20,40],"min_sam ples_leaf":[1, 2, 4, 6, 8, 10],"bootstrap":[True, False],"criterion":["gini", "entropy"]; "n_estimators":[1000],"max_depth":[12,None],"max_features":[10,20,'auto'],"min_sampl es_split":[40],"min_samples_leaf":[6,8,10,20],"bootstrap":[True],"criterion": ["entropy"]; "n_estimators":[1000],"max_depth":[None],"max_features":[10],"min_samples_split":[4 0],"min_samples_leaf":[20],"bootstrap":[True],"criterion":["entropy"]; ● FC-DNN: neuronH1L = [10,20,40,60,80,120,240], topologyL=['n-n:3','n-n:4','n:2', 'n-n:2n:3','n-n:2-n:3-n:4','n:2','n:3','n:4'], activationL = ['relu'], dropL= [0.1,0.5,0.7], optimizerL = ['Adam'], initL=['glorot_normal'], batchsL= [1024], epochL=[100]; topologyL=['n-n*2n*3-n*2-n'], activationL=['relu','tanh','selu','elu','linear'], dropL= [0.4,0.5,0.6], optimizerL=['Adam'], initL=['glorot_normal'], batchsL=[1024], epochL=[100]. 3. Results and Discussion 3.1 Experimental results 3.1.1. Cytokines Th1/Th2 assays Quantifying of cytokines Th1/Th2 is crucial for understanding, compound immunotoxicity, humoral and cellular response to cytokines and other biological questions. Cytokines were seen important markers to reveal the nature of the immune disturbances that are inflicted by antimicrobial immunotoxicity in mice. In summary, the measurement of cytokines has become an important tool in immunotoxicology. In our study, we examined the cytokines Th1 /Th2 in lymphocytes using flow cytometry. A CBA-based immunoassay designed to measure six cytokines commonly associated with Th1/Th2 cellular response was developed. This kit combines beads with the ability to measure the levels of IL-2, IL-4, IL-5, TNF, and IFN-. The results presented in Figure 2 represents standard curves with broad dynamic ranges for each of the five cytokines derived from the mouse Th1/Th2 cytokine kit. A major concern in the utilization of this type of complex assay is the potential variability that may be observed. The determination coefficient of standard curves is R2> 99 (with 5 logistics parameters). In addition, the cytokines analyses with flow cytometry were performed; in order to follow the pictogram (pg/mL) of cytokines produced in the lymphocytes populations treated with G1 at 10mg/mL after 24 and 48 hours (Table 1 and Table 2).

12 ACS Paragon Plus Environment

Page 13 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

Figure 2. Graphic Dot Plot represent standard curves with broad dynamic ranges for each of the faith cytokines derived from the mouse Th1/Th2 cytokine kit (A:TNF, B:INFγ, C: IL5, D: IL4, E:IL2) with coefficient of determination (R2 > 0. 99). 13 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 30

Table 1. Concentrations of cytokines (Th1-Th2) in thymus lymphocytes cultivated with LPS 24 hours. G

TNFα

nT 1 2 3 4 5 6

INFγ

IL-5

IL-4

IL-2

IL-10

7.65 7.64 7.73 7.35 7.55 7.11

7.23 7.30 7.17 6.73 7.24 7.33

8.23 7.45 8.17 7.73 6.24 7.16

Concentrations (pg/mL) G1 Cicl LPS Cicl-LPS G1-LPS NC

13.02 12.35 9.79 7.79 17.26 16.51

10.46 10.50 10.43 10.26 11.02 10.49

9.60 9.42 9.79 7.69 10.11 8.56

nT= Number Tube ; G= Group ; NC= Negative Control; LPS=Lipopolysaccharide; Cicl= Ciclosporin;

Table 2. Concentrations of cytokines (Th1-Th2) in thymus lymphocytes cultivated with LPS 48 hours. G

TNFα

INFγ

nT 1 2 3 4 5 6

IL-5

IL-4

IL-2

IL-10

7.56 7.35 7.33 7.02 6.21 7.78

7.40 8.35 7.53 7.25 7.21 7.35

Concentrations (pg/mL) G1 Cicl LPS Cicl-LPS G1-LPS CN

26.98 14.94 13.99 15.73 22.69 17.78

11.17 11.12 11.1 9.63 11.91 9.53

10.36 9.79 9.55 8.46 10.23 10.18

8.18 7.85 8.06 7.69 8.20 7.69

nT= Number Tube ; G= Group ; NC= Negative Control; LPS=Lipopolysaccharide; Cicl= Ciclosporin;

14 ACS Paragon Plus Environment

Page 15 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

Figure 3. Histogram represent the behavior of cytokines in lymphocytes exposed to 10 mg/Kg of G1 for 24 and 48 hours and activated with LPS derived from the mouse Th1-Th2 cytokine kit with coefficient of determination (R2 > 0.99) in 5 logistic parameters (A:Curve; B:Negative Control; C:G1+LPS; D:G1 for 24 hours; E: Negative Control; F: G1+LPS; G:G1 for 48 hours). Table 1 show the cytokines concentrations treated with G1 for 24 hours. The concentration of INFγ was 10.46 pg/mL. Result do not show significant differences for p 0.99) in 5 logistic parameters (A:Curve; B:Negative Control; C:G1+LPS; D:G1 24 hours; E: Negative Control; F: G1+LPS; G:G1). In general, the results show no significant differences (p ≤ 0.05) between the mean values of the proinflammatory cytokines for the samples treated with G1 in concentrations (10 μg / mL) with respect to the negative control (NC) but if there are significant differences for Values of TNF and IL6 between the group treated with G1 and the Negative Control at 48 hours. In addition, there are differences in these same cytokines in the periods (24 and 48 hours). These results could lead to the G1 can promote the production of free radicals and activate processes of damage in the culture of macrophages.

3.2 Computational results An initial dataset consists in 1,518 features for 12,906 triples of drug - target - experimental ontology. After standardized the raw dataset, 3 datasets are generated for classification models that are able to predict drug activity in specific experimental conditions: standardized dataset (Std.ds), Std.ds with selected features (Std-FS.ds), and PCA of Std.ds (Std-PCA.ds). A big number of features could create overfitting and make difficult the interpretation of a classification model (Std.ds). This was the reason to use feature selection (Std-FS.ds) and PCA 21 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 30

transform (Std-PCA.ds). The selection was based on the feature importance/score given by a RF classifier. Std-FS.ds contains 120 features (10 best features for each of the 6 experimental conditions, for drugs and proteins) for 12,906 examples. The feature importance ranges between 73 and 83. The PCA dataset explains 99% of the initial features variance with only 126 components (Std-PCA.ds). In the first step, a baseline script built classifiers for 3 datasets (standardized data, FS/PCA of features) using 14 ML methods (see Figure 6 and Figure 7). The classifiers were obtained using the holdout method for fast calculation and baseline comparison (see Supp. Material “ML_baseline_generator.xlsx”). The final model will be characterized by a cross-validation process. The majority of the classifier are able to obtain a test AUROC > 0.70 and each dataset can generate at least one classifier with AUROC > 0.80, without searching for the best model hyperparameters.

Figure 6. AUROC of baseline models - Classifier view.

Figure 7. AUROC of baseline classifiers - Dataset view.

22 ACS Paragon Plus Environment

Page 23 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

Table 5. Dataset description and the best baseline classifiers - SVC (linear kernels). Standardized Dataset

Feature selection

PCA

No. of features

ACC

AUROC

Std.ds

-

-

1518

0.835

0.865

Std-FS.ds

+

-

120

0.827

0.859

Std-PCA.ds

-

+

126

0.823

0.831

For all three datasets, SVC with linear kernels obtained the best performance (Table 5). It can be observed that the reduction of the feature number from 1,518 to 120 (Std-FS.ds) is decreasing the model performance with only 0.006 (0.7%). The PCA reduction to 126 components is not able to keep the same difference and lower the performance with 0.034 (3.9%). Thus, the best model in this baseline screening was obtained with Std.ds with AUROC = 0.865 but Std-FS.ds with only 8% of the initial features is able to provide an AUROC of 0.859. Only Gaussian NB classifier is not able to build a model with AUROC > 0.70. It is interesting that, generally, the linear methods are able to build better models than the tree-based methods using default/unique set of parameters. Details about the accuracies, F1-score and AUROC for the baseline models (ML with default/simple parameters 24) are presented in Table 6 by ordering the results by the test AUROC for Std-FS.ds. Table 6. ML baseline results for test subset ordered by AUROC for feature selected (FS) dataset. ML classifier

ACC

F1-score

AUROC

SVC (linear)

0.827

0.736594544

0.859

LinearSVC

0.823

0.725168756

0.845

SVC (rbf)

0.822

0.719804401

0.838

LogisticRegression

0.824

0.718811881

0.835

LinearDiscriminantAnalysis

0.836

0.718600954

0.825

GradientBoostingClassifier

0.840

0.715545755

0.817

AdaBoostClassifier

0.833

0.694915254

0.799

MLPClassifier

0.814

0.64743967

0.762

RandomForestClassifier

0.807

0.642570281

0.760

BaggingClassifier

0.808

0.626128838

0.746

XGBClassifier

0.800

0.591888466

0.720

KNeighborsClassifier

0.793

0.581977472

0.715

DecisionTreeClassifier

0.779

0.573309396

0.711

GaussianNB

0.708

0.530815109

0.683

23 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 30

In the second step, grid search methods (5-fold CV) are applied to find better models using the best classifiers of Std-FS. We choose this dataset because it has less features and the features are the initial standardized values, not PCA components (combinations of the initial features). In addition to the current ML classifiers, 609 FC-DNN topologies have been tested. The best models are characterized in Table 7. All the models have F1-score > 0.70 (see online GitHub). Table 7. Best grid search models for ML/DL models. Classifier

Best parameters

ACC

AUROC

SVC (linear)

'gamma': 'scale', 'kernel': 'linear', 'C': 1

0.827

0.859

LogisticRegression

'penalty': 'l2', 'C': 100.0

0.829

0.859

RandomForestClassifier

'max_depth': None, 'criterion': 'entropy', 'bootstrap': True, 'min_samples_split': 40, 'min_samples_leaf': 20, 'n_estimators': 1000, 'max_features': 10

0.840

0.875

FC-DNN

topology:n-n*2-n*3-n*2-n', n:120, activation:'relu', optimizer:'Adam', kernel_initializer:'glorot_normal', batch_size:1024, epochs:120, BatchNormalization(), droput=0.6

0.832

0.904

The best models obtained with the grid search have AUROC of 0.859 for LogisticRegression and SVC (linear), 0.875 for RandomForestClassifier and 0.904 for FC-DNN. Therefore, the best model obtained with FC-DNN was saved as file and it has the structure presented in Table 8: 5 hidden layers with 248,161 parameters to train. Table 8. FC-DNN description. Layer (type) Output Shape Param # dense_43 (Dense) (None, 120) 14520 batch_normalization_36 (Batch Normalization) (None, 120) 480 dropout_36 (Dropout) (None, 120) 0 dense_44 (Dense) (None, 240) 29040 batch_normalization_37 (Batch Normalization) (None, 240) 960 dropout_37 (Dropout) (None, 240) 0 dense_45 (Dense) (None, 360) 86760 batch_normalization_38 (Batch Normalization) (None, 360) 1440 dropout_38 (Dropout) (None, 360) 0 dense_46 (Dense) (None, 240) 86640 batch_normalization_39 (Batch Normalization) (None, 240) 960 dropout_39 (Dropout) (None, 240) 0 dense_47 (Dense) (None, 120) 28920 batch_normalization_40 (Batch Normalization) (None, 120) 480 dropout_40 (Dropout) (None, 120 0 dense_48 (Dense) (None, 1) 121 Note: Total params: 250,321; Trainable params: 248,161; Non-trainable params: 2,160.

FC-DNN was used to calculate 1384 prediction for G1 in all the experimental conditions presented into the dataset (see Supp. material, G1_predictionsExperCondsOrd.csv). A set of 490 24 ACS Paragon Plus Environment

Page 25 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

multiplexing conditions are predicting active G1 compounds for 6 cytokine proteins (P36888 / CHEMBL1974), Q9QZM3 / CHEMBL2176788, Q16539 / CHEMBL260, O35235 / CHEMBL3596084, Q9H4B4 / CHEMBL4897, Q96BR1 / CHEMBL6186), 6 STANDARD_TYPE_UNITSj (Activity (%), IC50 (nM), Inhibition (%), Ki (nM), Residual Activity (%), FC ()), 490 different ASSAY_CHEMBLIDs, both types of ASSAY_TYPE, 3 ASSAY_ORGANISMs (Homo sapiens, Unknown, Mus musculus), and both ORGANISMs (Homo sapiens, Mus musculus). All the inputs, outputs files, results, python/R scripts and jupyter notebooks are available into an open GitHub repository: https://github.com/kennethriva/Machine-Learning-for-drugs-cytokines. 4. Conclusions The current study aim is to describe the G1 - cytokines interactions using experiments and QSAR-like computational models. This subject represents a very difficult task and it is very important for the prediction of immunotoxicity. Thus, experimental results for G1 interaction with mouse Th1/Th2 and proinflammatory cytokines using cytometry bead array technique were presented. The experiments used only few experimental condition compared with the extended ChEMBL dataset. In general, the results show no significant differences (p ≤ 0.05) between the mean values of the Th1 / Th2 cytokines for the samples treated with G1 in concentrations (10 μg / mL) with respect to the negative control (NC) and there are moderate differences for cytokine values between different periods (24 and 48 hours) in the same parameters. Our results indicate that lymphocyte stimulation is required to utilize cytokine production for the prediction of immunotoxicity. In addition, we propose, confirming with this study, that the antimicrobial G1 did not alter the concentration of Th1 / Th2 cytokines in vitro in different periods. On the other hand, the results show no significant differences (p ≤ 0.05) between the mean values of the proinflammatory cytokines for the samples treated with G1 in concentrations (10 μg / mL) with respect to the negative control (NC) except for the values of TNF and IL6 between the group treated with G1 and the negative control at 48 hours. In addition, there are differences in these same cytokines in the periods (24 and 48 hours). These results could lead to the G1 can promote the production of free radicals and activate processes of damage in the culture of macrophages. In summary, the cytometry bead array technique provides a quick and simple method to simultaneously analyze the smallest amounts of cytokines in vitro. In addition, a QSAR model using drug and protein descriptors was proposed to predict biological activities of G1 in multiple ChEMBL experimental conditions. Thus, a ChEMBL dataset was constructed based on 25 type of biological activity, 1366 unique assays, 2 assay types, 4 assay organisms, 2 source organism, and 12 cytokines. Each drug and protein was transformed into a series of molecular descriptors. The experimental-centered descriptors were the input for 15 Machine Learning methods such as Nearest Neighbors, Linear Discriminant Analysis, Linear Support vector machine, Logistic regression, Support Vector Machine with Linear and Radial 25 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 30

Basis Functions, AdaBoost, Gaussian Naive Bayes, Multi-Layer Perceptron, Decision Trees, Random Forest, Gradient Boosting, ensemble meta-estimator Bagging, XGBoost and FullyConnected Deep Neural Networks (FC-DNN). FC-DNN provided the best model using only 120 selected drug and protein molecular descriptors, with 5 hidden layers (120 inputs:120-240-360-240-120:1 output). The performance of the model was demonstrated by an AUROC = 0.904 and accuracy = 0.832 (test subset). FCDNN was used to predict 1384 biological activities of G1 in multiple experimental condition from the ChEMBL dataset. A number of 490 predictions pointed to active G1 compounds against cytokines in specific multiplexing assay conditions. An open GitHub repository contains all details of this study 24. Acknowledgment G.D.H. acknowledges research grants from Ministry of Economy and Competitiveness, MINECO, Spain (FEDER CTQ2016-74881-P) and Basque government (IT1045-16). This work was supported by the Collaborative Project in Genomic Data Integration (CICLOGEN)'' PI17/01826 funded by the Carlos III Health Institute from the Spanish National plan for Scientific and Technical Research and Innovation 2013-2016 and the European Regional Development Funds (FEDER) - “A way to build Europe”. This project was also supported by the General Directorate of Culture, Education and University Management of Xunta de Galicia ED431D 2017/16 and “Drug Discovery Galician Network” Ref. ED431G/01 and the “Galician Network for Colorectal Cancer Research” (Ref. ED431D 2017/23), and finally by the Spanish Ministry of Economy and Competitiveness for its support through the funding of the unique installation BIOCAI (UNLC08-1E-002, UNLC13-13-3503) and the European Regional Development Funds (FEDER) by the European Union. Additional support was offered by the Consolidation and Structuring of Competitive Research Units - Competitive Reference Groups (ED431C 2018/49), funded by the Ministry of Education, University and Vocational Training of the Xunta de Galicia endowed with EU FEDER funds. CR Munteanu acknowledges the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. References (1) Eberl, G.; Colonna, M.; Di Santo, J. P.; McKenzie, A. N. J. Innate Lymphoid Cells. Innate Lymphoid Cells: A New Paradigm in Immunology. Science 2015, 348 (6237), aaa6566. (2) Savioli, B.; Abdulahad, W. H.; Brouwer, E.; Kallenberg, C. G. M.; de Souza, A. W. S. Are Cytokines and Chemokines Suitable Biomarkers for Takayasu Arteritis? Autoimmun. Rev. 2017, 16 (10), 1071–1078. (3) Zhang, Y.; Zhang, Y.; Gu, W.; He, L.; Sun, B. Th1/Th2 Cell’s Function in Immune System. Adv. Exp. Med. Biol. 2014, 841, 45–65. (4) Kuchroo, V. K.; Nicholson, L. B. Cytokines in Autoimmune Disease. In Cytokines and Autoimmune Diseases; pp 389–406. (5) Gil, J. Multiple System Atrophy/Dysautonomia Complicated by Viral Infections, 26 ACS Paragon Plus Environment

Page 27 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

(6)

(7) (8)

(9) (10) (11) (12) (13) (14)

(15)

(16)

(17) (18)

(19)

Hashimoto’s and Other Autoimmune Disorders: Functional Neurology Approach and Applications. Front. Neurol. 2016, 7. https://doi.org/10.3389/conf.fneur.2016.59.00079. Lubberts, E.; van den Berg, W. B. Cytokines in the Pathogenesis of Rheumatoid Arthritis and Collagen-Induced Arthritis. In Cytokines and Chemokines in Autoimmune Disease; Santamaria, P., Ed.; Advances in Experimental Medicine and Biology; Springer US: Boston, MA, 2003; Vol. 520, pp 194–202. Salajegheh, A. Granulocyte-Macrophage and Granulocyte Colony Stimulating Factor (GMCSF and G-CSF). In Angiogenesis in Health, Disease and Malignancy; 2016; pp 127–132. Bijuklic, K.; Jennings, P.; Kountchev, J.; Hasslacher, J.; Aydin, S.; Sturn, D.; Pfaller, W.; Patsch, J. R.; Joannidis, M. Migration of Leukocytes across an Endothelium-Epithelium Bilayer as a Model of Renal Interstitial Inflammation. American Journal of Physiology-Cell Physiology 2007, 293 (1), C486–C492. Stewart, A.; Banerji, U. Utilizing the Luminex Magnetic Bead-Based Suspension Array for Rapid Multiplexed Phosphoprotein Quantification. In Methods in Molecular Biology; 2017; pp 119–131. Gaulton, A.; Hersey, A.; Nowotka, M.; Bento, A. P.; Chambers, J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L. J.; Cibrián-Uhalte, E.; et al. The ChEMBL Database in 2017. Nucleic Acids Res. 2017, 45 (D1), D945–D954. Bender, A. Databases: Compound Bioactivities Go Public. Nat. Chem. Biol. 2010, 6 (5), 309. Estrada, E.; Molina, E.; Uriarte, E. Quantitative Structure-Toxicity Relationships Using TOPS-MODE. 2. Neurotoxicity of a Non-Congeneric Series of Solvents. SAR QSAR Environ. Res. 2001, 12 (5), 445–459. Marzaro, G.; Chilin, A.; Guiotto, A.; Uriarte, E.; Brun, P.; Castagliuolo, I.; Tonus, F.; González-Díaz, H. Using the TOPS-MODE Approach to Fit Multi-Target QSAR Models for Tyrosine Kinases Inhibitors. Eur. J. Med. Chem. 2011, 46 (6), 2185–2192. Prado-Prado, F.; García-Mera, X.; Escobar, M.; Sobarzo-Sánchez, E.; Yañez, M.; RieraFernandez, P.; González-Díaz, H. 2D MI-DRAGON: A New Predictor for Protein-Ligands Interactions and Theoretic-Experimental Studies of US FDA Drug-Target Network, Oxoisoaporphine Inhibitors for MAO-A and Human Parasite Proteins. Eur. J. Med. Chem. 2011, 46 (12), 5838–5851. Riera-Fernández, P.; Martín-Romalde, R.; Prado-Prado, F. J.; Escobar, M.; Munteanu, C. R.; Concu, R.; Duardo-Sanchez, A.; González-Díaz, H. From QSAR Models of Drugs to Complex Networks: State-of-Art Review and Introduction of New Markov-Spectral Moments Indices. Curr. Top. Med. Chem. 2012, 12 (8), 927–960. Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; et al. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40 (Database issue), D1100–D1107. Heikamp, K.; Bajorath, J. Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets. J. Chem. Inf. Model. 2011, 51 (8), 1831–1839. Simón-Vidal, L.; García-Calvo, O.; Oteo, U.; Arrasate, S.; Lete, E.; Sotomayor, N.; González-Díaz, H. Perturbation-Theory and Machine Learning (PTML) Model for HighThroughput Screening of Parham Reactions: Experimental and Theoretical Studies. J. Chem. Inf. Model. 2018, 58 (7), 1384–1396. Gonzalez-Diaz, H. PTML: Perturbation-Theory Machine Learning Notes. In Proceedings of 27 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31) (32) (33) (34) (35) (36) (37) (38) (39) (40) (41)

Page 28 of 30

MOL2NET 2018, International Conference on Multidisciplinary Sciences, 4th edition; 2018. https://doi.org/10.3390/mol2net-04-05463. Pankratz, A. Forecasting with Univariate Box - Jenkins Models: Concepts and Cases; John Wiley & Sons, 2009. Jenkins, G. M. Autoregressive-Integrated Moving Average (ARIMA) Models. In Encyclopedia of Statistical Sciences; 2004. Cao, D.-S.; Xiao, N.; Xu, Q.-S.; Chen, A. F. Rcpi: R/Bioconductor Package to Generate Various Descriptors of Proteins, Compounds and Their Interactions. Bioinformatics 2015, 31 (2), 279–281. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12 (Oct), 2825–2830. kennethriva. kennethriva/Machine-Learning-for-drugs-cytokines https://github.com/kennethriva/Machine-Learning-for-drugs-cytokines (accessed Feb 8, 2019). Cover, T.; Hart, P. Nearest Neighbor Pattern Classification. IEEE Trans. Inf. Theory 1967, 13 (1), 21–27. Burges, C. J. C.; Others. Simplified Support Vector Decision Rules. In ICML; Citeseer, 1996; Vol. 96, pp 71–77. Cristianini, N. Fisher Discriminant Analysis (Linear Discriminant Analysis). In Dictionary of Bioinformatics and Computational Biology; 2004. Peduzzi, P.; Concato, J.; Kemper, E.; Holford, T. R.; Feinstein, A. R. A Simulation Study of the Number of Events per Variable in Logistic Regression Analysis. J. Clin. Epidemiol. 1996, 49 (12), 1373–1379. Patle, A.; Chouhan, D. S. SVM Kernel Functions for Classification. In 2013 International Conference on Advances in Technology and Engineering (ICATE); 2013; pp 1–9. Hastie, T.; Rosset, S.; Zhu, J.; Zou, H. Multi-Class AdaBoost. Stat. Interface 2009, 2 (3), 349–360. Friedman, N.; Geiger, D.; Goldszmidt, M. Bayesian Network Classifiers. Mach. Learn. 1997, 29 (2), 131–163. White, B. W.; Rosenblatt, F. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Am. J. Psychol. 1963, 76 (4), 705. Swain, P. H.; Hauska, H. The Decision Tree Classifier: Design and Potential. IEEE Transactions on Geoscience Electronics 1977, 15 (3), 142–147. Breiman, L. Random Forests. Mach. Learn. 2001, 45 (1), 5–32. Friedman, J. H. Stochastic Gradient Boosting. Comput. Stat. Data Anal. 2002, 38 (4), 367– 378. Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24 (2), 123–140. Chen, T.; Guestrin, C. Xgboost: A Scalable Tree Boosting System. Proceedings of the 22nd acm sigkdd international 2016. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521 (7553), 436–444. keras-team. keras-team/keras https://github.com/keras-team/keras (accessed Feb 8, 2019). Bradley, A. P. The Use of the Area under the ROC Curve in the Evaluation of Machine Learning Algorithms. Pattern Recognit. 1997, 30 (7), 1145–1159. Goutte, C.; Gaussier, E. A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In Lecture Notes in Computer Science; 2005; pp 345–359. 28 ACS Paragon Plus Environment

Page 29 of 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Chemical Research in Toxicology

(42) Murphy, K. P. Machine Learning: A Probabilistic Perspective. 2012. Cité en 2014, 117. (43) Vapnik, V. N. Estimation of Dependences Based on Empirical Data; Springer-Verlag, 1982. (44) Vapnik, V. N. The Nature of Statistical Learning Theory; 1995. (45) Freund, Y.; Schapire, R. E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. System Sci. 1997, 55 (1), 119–139. (46) Friedman, J. H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29 (5), 1189–1232. (47) Ng, A. Y. Feature Selection, L 1 vs. L 2 Regularization, and Rotational Invariance. In Proceedings of the twenty-first international conference on Machine learning; ACM, 2004; p 78. (48) Hosmer, D. W., Jr.; Lemeshow, S.; Cook, E. D. Applied Logistic Regression, Second Edition: Book and Solutions Manual Set; Wiley, 2001. (49) Thakur, A.; Mishra, V.; Jain, S. K. Feed Forward Artificial Neural Network: Tool for Early Detection of Ovarian Cancer. Sci. Pharm. 2011, 79 (3), 493–505. (50) Fabbretti, A.; Brandi, L.; Petrelli, D.; Pon, C. L.; Castañedo, N. R.; Medina, R.; Gualerzi, C. O. The Antibiotic Furvina® Targets the P-Site of 30S Ribosomal Subunits and Inhibits Translation Initiation Displaying Start Codon Bias. Nucleic Acids Res. 2012, 40 (20), 10366–10374. (51) Liptrott, N. J.; Penny, M.; Bray, P. G.; Sathish, J.; Khoo, S. H.; Back, D. J.; Owen, A. The Impact of Cytokines on the Expression of Drug Transporters, Cytochrome P450 Enzymes and Chemokine Receptors in Human PBMC. Br. J. Pharmacol. 2009, 156 (3), 497–508. (52) Renton, K. Cytochrome P450 Regulation and Drug Biotransformation During Inflammation and Infection. Curr. Drug Metab. 2004, 5 (3), 235–243. (53) Osna, N. A.; Clemens, D. L.; Donohue, T. M., Jr. Interferon Gamma Enhances Proteasome Activity in Recombinant Hep G2 Cells That Express Cytochrome P4502E1: Modulation by Ethanol. Biochem. Pharmacol. 2003, 66 (5), 697–710. (54) Li, X.-Y.; Zhang, C.; Wang, H.; Ji, Y.-L.; Wang, S.-F.; Zhao, L.; Chen, X.; Xu, D.-X. Tumor Necrosis Factor Alpha Partially Contributes to Lipopolysaccharide-Induced Downregulation of CYP3A in Fetal Liver: Its Repression by a Low Dose LPS Pretreatment. Toxicol. Lett. 2008, 179 (2), 71–77. (55) Wang, R.-F. Innate Immune Regulation and Cancer Immunotherapy; Springer Science & Business Media, 2012. (56) Miller, C. M. D.; Smith, N. C.; Ikin, R. J.; Boulter, N. R.; Dalton, J. P.; Donnelly, S. Immunological Interactions between 2 Common Pathogens, Th1-Inducing Protozoan Toxoplasma Gondii and the Th2-Inducing Helminth Fasciola Hepatica. PLoS One 2009, 4 (5), e5692. (57) Lee, B.-H.; Cheng, Y.-H.; Wu, S.-C. Graptopetalum Paraguayense Ameliorates Airway Inflammation and Allergy in Ovalbumin- (OVA-) Sensitized BALB/C Mice by Inhibiting Th2 Signal. Evid. Based. Complement. Alternat. Med. 2013, 2013, 237096. (58) Mitchell, R. E.; Hassan, M.; Burton, B. R.; Britton, G.; Hill, E. V.; Verhagen, J.; Wraith, D. C. IL-4 Enhances IL-10 Production in Th1 Cells: Implications for Th1 and Th2 Regulation. Sci. Rep. 2017, 7 (1), 11315. (59) Okoye, I. S.; Wilson, M. S. CD4+ T Helper 2 Cells--Microbial Triggers, Differentiation Requirements and Effector Functions. Immunology 2011, 134 (4), 368–377. (60) Ansel, K. M.; Djuretic, I.; Tanasa, B.; Rao, A. Regulation of Th2 Differentiation and Il4 29 ACS Paragon Plus Environment

Chemical Research in Toxicology 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

(61) (62) (63) (64) (65) (66) (67) (68) (69) (70) (71) (72)

Page 30 of 30

Locus Accessibility. Annu. Rev. Immunol. 2006, 24, 607–656. Diehl, S.; Rincón, M. The Two Faces of IL-6 on Th1/Th2 Differentiation. Mol. Immunol. 2002, 39 (9), 531–536. Obremski, K. The Effect of in Vivo Exposure to Zearalenone on Cytokine Secretion by Th1 and Th2 Lymphocytes in Porcine Peyer’s Patches after in Vitro Stimulation with LPS. Pol. J. Vet. Sci. 2014, 17 (4), 625–632. Perdomo-Celis, F.; Salgado, D. M.; Narváez, C. F. Levels of Circulating Tumor Necrosis Factor-α in Children with Symptomatic Dengue Evaluated by ELISA and Bead-Based Assays. Viral Immunol. 2017, 30 (1), 45–53. Molgora, M.; Supino, D.; Garlanda, C. Regulation of Immunity and Disease by the IL-1 Receptor Family Members IL-1R2 and IL-1R8. Immunopharmacology and Inflammation. 2018, pp 225–246. https://doi.org/10.1007/978-3-319-77658-3_10. Mittal, M.; Siddiqui, M. R.; Tran, K.; Reddy, S. P.; Malik, A. B. Reactive Oxygen Species in Inflammation and Tissue Injury. Antioxid. Redox Signal. 2014, 20 (7), 1126–1167. Kalliolias, G. D.; Ivashkiv, L. B. TNF Biology, Pathogenic Mechanisms and Emerging Therapeutic Strategies. Nat. Rev. Rheumatol. 2016, 12 (1), 49–62. Xi, L.; Wang, C.; Chen, P.; Yang, Q.; Hu, R.; Zhang, H.; Weng, Q.; Xu, M. Expressions of IL-6, TNF-α and NF-κB in the Skin of Chinese Brown Frog (Rana Dybowskii). European Journal of Histochemistry. 2017. https://doi.org/10.4081/ejh.2017.2834. Chen, L.; Deng, H.; Cui, H.; Fang, J.; Zuo, Z.; Deng, J.; Li, Y.; Wang, X.; Zhao, L. Inflammatory Responses and Inflammation-Associated Diseases in Organs. Oncotarget 2018, 9 (6), 7204–7218. Sultani, M.; Stringer, A. M.; Bowen, J. M.; Gibson, R. J. Anti-Inflammatory Cytokines: Important Immunoregulatory Factors Contributing to Chemotherapy-Induced Gastrointestinal Mucositis. Chemother. Res. Pract. 2012, 2012, 490804. Siebert, J. C.; Inokuma, M.; Waid, D. M.; Pennock, N. D.; Vaitaitis, G. M.; Disis, M. L.; Dunne, J. F.; Wagner, D. H., Jr; Maecker, H. T. An Analytical Workflow for Investigating Cytokine Profiles. Cytometry A 2008, 73 (4), 289–298. Galbiati, V.; Mitjans, M.; Corsini, E. Present and Future Ofin Vitroimmunotoxicology in Drug Development. Journal of Immunotoxicology. 2010, pp 255–267. https://doi.org/10.3109/1547691x.2010.509848. Kopitar-Jerala, N. The Role of Interferons in Inflammation and Inflammasome Activation. Front. Immunol. 2017, 8, 873.

30 ACS Paragon Plus Environment