Computational Discovery and Experimental ... - ACS Publications


Aug 18, 2016 - that recognize pathogen associated molecular patterns. They play a critical ... the pathogen associated microbial patterns (PAMPs).1 TL...
0 downloads 0 Views 962KB Size


Subscriber access provided by Northern Illinois University

Article

Computational Discovery and Experimental Confirmation of TLR9 Receptor Antagonist Leads Maria Zatsepin, Angela Mattes, Steffen Rupp, Doris Finkelmeier, ARIJIT BASU, Anke Burger-Kentischer, and Amiram Goldblum J. Chem. Inf. Model., Just Accepted Manuscript • DOI: 10.1021/acs.jcim.6b00070 • Publication Date (Web): 18 Aug 2016 Downloaded from http://pubs.acs.org on August 21, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

Journal of Chemical Information and Modeling is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Computational Discovery and Experimental Confirmation of TLR9 Receptor Antagonist Leads

Maria Zatsepin1, Angela Mattes2, Steffen Rupp2, Doris Finkelmeier2, Arijit Basu1, Anke Burger-Kentischer2*and Amiram Goldblum1* 1

Molecular Modeling and Drug Design, Institute for Drug Research, The Hebrew University of

Jerusalem, Israel 91120 and

2

Fraunhofer IGB Institute for Interfacial Engineering and

Biotechnology, Nobelstr. 12, Stuttgart, Germany 70569. * These authors contributed equally to this research

Abstract Toll-like receptors (TLR) are receptors of innate immunity that recognize Pathogen Associated Molecular Patterns. They play a critical role in many pathological states, in acute and chronic inflammatory processes. TLR9 is a promising target for drug discovery, since it has been implicated in several pathologies, including defense against viral infections and psoriasis. Immune-modulators are promising molecules for therapeutic intervention in these indications. TLR9 is located in the endosome and activated by dsDNA with CpG motives encountered in microbial DNA. Here we report on a combined approach to discover new TLR9 antagonists by computational chemistry and cell based assays. We used our in-house Iterative Stochastic Elimination (ISE) algorithm to create models that distinguish between TLR9 antagonists ("actives") and other molecules ("inactives"), based on molecular physico-chemical properties. Subsequent screening and scoring of a dataset of 1.8 million commercially available molecules led to the purchasing of top scored molecules, which were tested in a new cell based system based on human PRRs stably expressed in NIH3T3 fibroblasts. As described previously, this cell line shows a very low endogenous PRR-activity and contains a reporter gene which is selectively activated by the

1 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 2 of 32

integrated human PRR enabling rapid screening of potential ligands. IC50 values of each of these top scored molecules were determined. Out of 60 molecules tested, 56 showed antagonistic activity. We discovered 21 new highly potential antagonists with IC50 values lower than 10 µM, with five of them having IC50 values under 1µM.

Introduction Toll-like receptors (TLRs) are part of the Pattern Recognition Receptors (PRRs) by which our innate immune system senses the invasion of pathogens. These receptors recognize specific molecular patterns that are present in microbes and viruses – the pathogen associated microbial patterns (PAMPs). 1 TLR9 is a promising target for drug discovery, since it has been implicated in several pathologies, including defense against viral infections but also in development of sepsis, atopic dermatitis and psoriasis. Human TLR9 is a type I transmembrane protein and is characterized by an extracellular N-terminal ectodomain (ECD) consisting of heavily glycosylated leucine-rich repeats (LRR) folded into tandem copies and a cytoplasmic TIR domain for signal transmission. The signaling complex consists of an m-shaped TLR9 dimer.2 The TLR9 extracellular domain binds unmethylated CpG, short viral or bacterial DNA motifs which are its agonists. TLR9 activation leads to NF-κB activation and induces positive regulation of cytokine secretion, initiating immune response. The protein is relocalized from endoplasmic reticulum to endosome and lysosome upon stimulation with these agonists.

3

In order to

modulate immune response in a defined way, specific modulators of the individual PRRs are necessary. Identification of modulators directly targeting individual PRRs is problematic since most naturally occurring cells either express many different PRRs or show almost no expression, resulting in undefined or low responses. In order to screen for modulators specifically addressing individual PRRs we developed a system based on NIH3T3 cells, 4 stably expressing the human TLRs individually or in combinations of two to three receptors at a high level, including TLR9. In addition these cell-lines contain an NF-κB driven reporter gene, enabling the detailed study of individual PRR-ligand interaction signaled through NF-κB. We combined this screening method with our awarded

5

complex combinatorial classification algorithm, now called “Iterative

Stochastic Elimination” algorithm (ISE)

6

in order to create models that distinguish between 2

ACS Paragon Plus Environment

Page 3 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

TLR9 antagonists ("actives") and other molecules ("inactives"), based on molecular physicochemical properties.

As indicated above, TLRs not only sense microbial or viral invasion but also can be activated by endogenous molecules as well as by low molecular weight synthetic compounds. Given the role of the innate immune machinery to provoke inflammation, TLRs signaling has been suggested to be involved in the development of many acute and chronic (auto-) inflammatory processes of allergic and dermatologic diseases. 7 Antagonists therefore are thought to have a therapeutic role in suppressing overactive immune responses, as it occurs in chronic inflammatory and autoimmune disease. Agonists of the TLRs, on the other hand, would be immune system enhancers and have been proposed to be useful in the treatment of cancer and infectious diseases. 7,8

TLR9 is expressed within different skin cells types

7

and has been reported to be

associated with skin infections like Herpes simplex/Varicella zoster, Candida spp, 9 Verruca and Molluscum as well as diseases with auto-immune character including Atopic dermatitis, Psoriasis, Lichen Planus, Lupus Erythematosus, and even as having an exacerbation effect on basal cell carcinoma. 7 TLR9 antagonists are claimed to be useful in treating diseases that are characterized by an unwanted and excessive immune response. Diseases of this type include the autoimmune diseases mentioned above, transplant rejection, and sepsis. Several TLR9 agonists and antagonists are already under preclinical developments. 8,10

In this paper, we report the computational discovery of a large set of novel and effective TLR9 antagonists and the confirmation of their binding and antagonistic activity in cell based in vitro experiments. Ligand based modeling by ISE led to exceptional results, as 56 out of 60 molecules sent for experimental validation were confirmed as antagonists. The 60 were picked in two iterations, an initial and an optimized one of 30 molecules each, by screening 1.8 million commercially available molecules through ISE models and computational solubility evaluations. In the experimental in vitro tests, we discovered 21 new highly potential antagonists with IC50 values lower than 10 µM, with five of them having IC50 values under 1µM. The top antagonists are shown in Figure 1. 3 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 1. The five top TLR9 antagonists discovered. None of them are mentioned in the literature or in patents

The use of computational approaches for drug discovery and design entered the mainstream of medicinal chemistry as a result of the developments in computer hardware and of the availability of crystal structures of protein targets.11 Experimental results for protein structures, in particular of protein-ligand complexes, are very useful for computational studies that can accelerate discovery.

12

Such structures allow several "structure based" approaches (such as docking and

pharmacophore) to be performed in order to discover or design ligands for binding to these proteins. A few Toll-Like Receptor structures have been elucidated and are used for predictions and modeling in silico.

13

However the crystal structure of TLR9 was not known during this

research and therefore the "ligand based" strategy has been applied throughout. Among other "ligand based" methods, Quantitative Structure Property Relations (QSPR) and Classification methods are well known and documented.

14

QSPR focuses on searching

relations and correlations between computed properties of molecules and their known biological activities. It is possible in some cases to obtain linear relationships between ligand properties and their known experimental actions, and use the equations (that were formed by employing a small number of molecules) as the basis for predicting the quantitative effects of many others. In classification algorithms in this field, the focus is on discovering the crucial properties that differentiate between active and inactive molecules,

15

or between highly active and weakly

active molecules. In general – that is differentiation between two classes of molecules. It is usually assumed that classification methods enable to make only binary decisions. However, some classification algorithms produce several classifications in each problem, and those may serve to refine the binary decision by having a "score" (index) for the chance of a molecule to be in one or another class.

4 ACS Paragon Plus Environment

Page 4 of 32

Page 5 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

We developed in the last decade a generic algorithm for classification that produces a large set of results, which enables a quantification of the predictions and is mostly used to solve highly complex combinatorial problems. The algorithm, called "Iterative Stochastic Elimination" (ISE) 16 was the main tool used in our current search for novel and effective TLR9 antagonists. A review of the ISE method and its application for many different problems has been published recently, 6 and we present here only the application of ISE to the discovery of TLR9 antagonists.

Methods In Silico: Source and Preparation of the Model Building Dataset For building an appropriate ISE model with a high potential for discovery, a substantial amount of known and diverse active ligands of TLR9 were retrieved from ChEMBL database, scientific literature,18 from PubChem Bio Assays

19

17

from

and from several patents.20-22 These

molecules were considered to be the class of "actives". The main idea was to find specific small (molecular mass was limited to MW 1300) inhibitor molecules that are known to bind to the receptor and have measured IC50 values. Standardization of the obtained structures was performed using the WASH application of the MOE software. 23 The removal of minor molecular components (such as anions in salts) was performed by calculating the number of heavy atoms in all components of a single structure, and removing the minor structures. The protonation state was adjusted to near-neutral aqueous environment. We added to the training (learning) set ten thousand molecules chosen randomly from the Enamine database

9

to act as the "inactive" class. Molecules were picked by limiting some of

their properties based on the idea of an "applicability domain" so that these properties will not differ very much from those of the "actives", in order not to skew the models. These properties were MW, calculated logP, number of hydrogen bond donors and the number of hydrogen bond acceptors. The average values of the active molecules plus/minus two standard deviations were the range of these 4 properties that limited the choice of the randomly picked molecules. The

5 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

assumption is that, statistically, most of those random chosen molecules are inactive although they have not been tested.

To exclude similar active molecules which might bias the results, a "Tanimoto" similarity filtering was performed, allowing a maximal similarity of 0.7, lower than the "traditional" 0.85, which is the lower limit for considering molecules to be highly similar. Of each pair of molecules that are highly similar, the one with the higher IC50 (lower activity) was kept. However, the results of modeling with the full set of actives were compared (vide infra) to the results of the "Tanimoto filtered" set.

Descriptors for both of the active sets (“filtered” and “full”) and for the random molecules were calculated using MOE.

23

Only 2D descriptors were included. Highly correlated

descriptors (in these specific sets) as well as descriptors with low standard deviations were eliminated. 6 A five-fold random partition

14

to training and test sets was used throughout. Results are

considered for only the test sets, each appearing once along the study and thus the final results are for all the actives, considered as a test set.

Evaluation of the Models A model consists of a large set of filters and each filter has an associated score for its ability to distinguish between the actives and inactives in the training set. Each molecule of the test set is screened through the model to obtain a cumulative score which reflects its ability to pass all, most, or part of the filters. A molecule gets a positive score if it passes a filter successfully, or a negative score if it does not. All filters have only positive "weights" from 0 to 1, and all scoring ends up with a number that is normalized by the total number of filters. This score ("Bioactivity Index") is related to the probability of a given molecule to be active. The index ranges between 1 (highest probability to be active) and -1 (highest probability to be inactive).

Solubility In silico Due to its important role in the determination of drugs’ ADMET properties, aqueous solubility is central for in vitro screening assays and therefore for the decisions of purchasing molecules. 6 ACS Paragon Plus Environment

Page 6 of 32

Page 7 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Measuring aqueous solubility, especially the intrinsic solubility, is a low-throughput experiment, therefore there is a great need for computerized evaluation of the solubility of commercially available molecular libraries and of molecules that are virtually synthesized. That is the reason for the existence of many such algorithms, none of which supply a definite and validated evaluation to a very wide range of structures, which result from ISE screening. Therefore, for solubility evaluation we use five different solubility models, each being calculated by a different algorithm. With 5 methods it was possible to look for a "consensus". The VCCLAB 24 program calculates aqueous solubility (ALOGpS) based on 1291 known soluble molecules and provides aqueous solubility predictions using Artificial Neural Networks. The LMMD

25

software developed solubility calculations,

26

using two aqueous solubility

models: ASMS (aqueous solubility based on molecular surface) and ASMS-LOGP (aqueous solubility based on molecular surface using ClogP as a descriptor). The well-known SciFinder 15 software (of the American Chemical Society) calculates Molar Intrinsic Solubility via Advanced Chemistry Development (ACD/Labs) Software V11.02. Additional solubility values were retrieved with MOE program 23 and Enamine software. 9 The best candidates were tested for solubility by "voting", i.e., looking for the majority of calculated solubilities from the above algorithms as well as their averages and standard deviations. Structure based attempts TLR9 structures from horse and mouse sources were published only after the termination of this work, including complexes with DNA fragments as agonists and antagonists.27 We studied those structures and attempted to test them for docking at the request of a reviewer. However, those turned not to be useful for any conclusions due to the huge size of the DNA antagonists and their large interaction surface with the proteins. Mutations with respect to human TLR9 were another source of complication which prevented the usefulness of docking (see supporting information paragraphs under “Docking of ligands to horse TLR9 receptor”).

7 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 8 of 32

In vitro: Screening of the Selected Compounds for Activity and Biocompatibility Using a Cell-Based TLR9-Reporter Assay A general setup for a cell based test-system that allows for the identification of modulators of pattern recognition receptors (PRRs) via a HTS-compatible assay was developed previously (DE 10 2006 031 483, EP 2 041 172, 4) and adapted to TLR9 monitoring in this work. It was used in this study to verify the in silico results generated. For this assay human TLR9 was stably transfected and expressed in NIH3T3 fibroblasts as described. 4 This cell line shows a very low endogenous PRR-activity and was equipped with a reporter gene which is activated by the human TLR9 stably expressed in the cell line, as described previously.

4,24

We used this cell-

based reporter system to validate experimentally the potential of the TLR9 antagonists of the 60 top candidates identified. Furthermore, their biocompatibility was tested in parallel phenotypically. The reporter plasmid-only control cell line was used as a control to measure the background of alkaline phosphatase (SEAP) expression of the reporter cell line and ensure that the effect observed in the experimental reaction was caused by the compound applied. To demonstrate the specific induction and blocking of the TLR9 reporter cell line, experiments using a known non inducing control ligand TLR9 (ODN2216c, Invivogen) and a known TLR9 antagonist (ODN TTAGGG, A151, Invivogen) were carried out. For the screening experiments the reporter cell line expressing TLR9 and the control cell line were plated in 96-well cell culture plates at a density of 0,3x105 cells in DMEM media supplemented with 10% FCS (Invitrogen), 1% L-glutamine (200mM Invitrogen) and 1% Penicillin-Streptomycin (PenStrep: 100x, Invitrogen). The cells were incubated for 16h to grow adherent monolayers at 37°C in a 5% saturated CO2 humidified atmosphere. After overnight incubation, the media was removed and replaced by DMEM medium supplemented with 1% PenStrep, 1%L-glutamine and 0.5% FCS. For analyzing TLR9-modulators the cells were stimulated with the ligand ODN2216 (Invivogen), specific to TLR9 at a defined concentration, activating the reporter cell line.

25

In parallel the potential antagonists, including the known

TLR9 antagonist (ODN TTAGGG, A151, Invivogen), were added in a dose response manner. Using this approach we are able to monitor blocking of the activation induced by ODN2216. The cells were again incubated in a humidified atmosphere for 16h to allow expression of SEAP. Binding of the antagonist leads to a block of the TLR9 signaling, resulting in no or in a reduced 8 ACS Paragon Plus Environment

Page 9 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

expression of the reporter gene (SEAP). SEAP was measured by transferring 50 µl supernatant from the wells into a 96-well PS microplate flat bottom (Greiner BIO-ONE). The enzyme substrate reaction was initiated by adding 50µl of p-Nitrophenyl Phosphate (pNPP). Enzyme activity was detected photometrical, using an UV-vis reader at 405 nm. Two to three independent experiments were performed in triplicates and the IC50 was calculated from the dose response curves. The IC50 of the data set was determined as the point of intersection where the concentration equals 50 % absorption of the positive control. The values were presented as mean ± Standard error of the mean.

Cytotoxicity assays: Throughout the experimental phase described above the cell monolayer is examined microscopically to rule out toxic effect of the compounds. In addition, we use Fluorescein diacetate (FDA Fluorescein-diacetate (FDA, F-7378 SIGMA, MW 416.4) as a widely used quenched fluorescent dye to count viable cells and analyze their vitality. To asses viability of the cells after completion of the TLR-reporter assay the following assay was performed. After the transfer of the supernatant and the final TLR-assay measurement, the wells of the MP containing the cells were washed with phosphate-buffered saline (PBS, 200 µl) and then filled with 200 µl PBS containing 10 µg/ml FDA. After 45-min incubation at room temperature (RT), fluorescence of fluorescein (RFU [relative fluorescence units]) was measured (485 nm excitation and 538 nm emission wavelength). Data are presented in the supporting information as Table S1.

Results Data Preparation In a first step 239 active ligands which block Homo sapiens TLR9 receptor with IC50 values in 17

Additional 59

20-22

Eventually, the

the range of 20nM to 49782nM were retrieved from ChEMBL database. inhibitors with values in the range of 3.4nM- 250nM were found in patents.

total amount of 289 known inhibitors (excluding compounds without specified IC50 values) were “washed” of counterions and prepared as the “actives” database.

9 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 10 of 32

For the “inactive” class of the training set, molecules from the Enamine database 28 were picked randomly after filtering the database according to the "Applicability Domain" of the actives, which is the average value of specific parameters plus/minus two standard deviations from those average values. In order to avoid bias due to molecules that are similar (Tanimoto coefficient = 1) we used only the active molecules that have Tanimoto values of 0.7 and less vis-a-vis each other, so that they are substantially different. This filtration resulted in having only 151 molecules including just 7 molecules with IC50 below 1500 nM. The decision was then to build two ISE models with two different initial sets: the first set - the Tanimoto “filtered” set with 151 active molecules, and the second "full" set with all the 289 actives without any similarity filtering, which includes 54 molecules with IC50 values below 400 nM. Two dimensional descriptors for the 151 active molecules of the “filtered” set and for the 289 active molecules of the “full” set were calculated. In order to avoid bias and difficulties in interpretations, highly correlated descriptors were eliminated based on the Correlation Matrix of descriptors' values. Subsequently, 118 descriptors remained (out of an initial set of ~180 2D descriptors) in the "filtered" set and 108 descriptors remained in the "full" set of active molecules. These remaining descriptors were subsequently calculated for the ten thousand random "inactives" taken from the Enamine database.

ISE Models: Generation and Optimization We generated two ISE models: i) model made from the entire set of active molecules and ii) model made from molecules having Tanimoto cut off < 0.7. A five-fold partition was employed in generating both ISE models. As a first level of validation, filters generated for the learning sets (which include all the 9999 inactives and the 289 actives ("full set") or 151 actives ("Tanimoto filtered set") of the five-fold modeling process were combined and Indexes by the ISE model were given to each molecule in each of the models. Results for one fold (out of 5) of the "Tanimoto filtered set" are shown in Table 1 and results for the full set (1 fold) are shown in Table 2. In both these tables, each ISE index ("border" columns) represents a separation between those molecules that are "negatives" (i.e., assumed to be inactives) with lower indexes than the "border" number, while those that are with higher indexes than that "border" number are assumed to be actives. This leads also to the clear 10 ACS Paragon Plus Environment

Page 11 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

assignment of "true'" and "false" actives and inactives. For example, in Table 1, the column at an ISE index of 0.3 shows that 9756 (out of the total 9999 "inactives") are found with indexes that are lower than 0.3 (Thus, at this separation, they will be assumed to be TN, true negatives). The rest, 243 inactives, have higher indexes than 0.3 and are thus wrongly assumed to be actives, and assigned as false positives (FP). For the actives, there are 117 found at that level with lower indexes than 0.3, thus being false negatives (FN) while 34 are above that index values, thus true positives (TP). From those 4 numbers it is possible to construct the Matthews Correlation Coefficient (MCC) value 29 which gives an idea of the quality of that particular separation line. “Enrichment” is not easily evaluated in these models due to two possible interpretations of that term. In the tables, we present “enrichment” limited to the molecules that we used in our learning sets alone. The "enrichment" at each separation line in the tables depends on TP/(TP +FP) (true positives out of the total positives) divided by the chance to find actives, which is the ratio of the actives to the total training set. Thus at the 0.3 index line of Table 1, the enrichment is (34/34+243)/ (151/10149) = 8.2. It is increased at the separation line of index 0.5 to 10.1 and to 10.9 at the top line with an index = 0.6. A wider aspect of enrichment is elaborated upon in the discussion. Another way to represent the quality of the classification of the models is by the MCC values within the tables. MCC values are a standard measure of quality in machine learning, measuring the quality of binary (two class) classification models.

11 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 32

Table 1. Numbers of molecules in the Tanimoto filtered model (151 actives), one fold*

TN FN TP FP Enrich. MCC

0 9500 92 59 499 7.1 0.181

0.1 9595 102 49 404 7.3 0.1668

0.2 9683 108 43 316 8.0 0.1659

0.3 9756 117 34 243 8.2 0.1499

0.4 9841 123 28 158 10.1 0.1531

0.5 9903 134 17 96 10.1 0.1188

0.6 9968 145 6 31 10.9 0.0736

0.7 9999 151 0 0 -

*Numbers on the top line indicate values of ISE indexes of 0.1, 0.2 etc. and each column assumes that the ISE index ("border") separates between actives (with higher indexes) and inactives (with lower indexes)

Table 2. Numbers of molecules in the "Full" model (289 actives), one fold*

TN FN TP FP Enrich. MCC

0 9831 217 72 168 10.7 0.255

0.1 9856 226 63 143 10.9 0.2403

0.2 9901 237 52 98 12.3 0.2345

0.3 9927 242 47 72 14 0.2402

0.4 9947 253 35 52 14.3 0.2096

0.5 9959 277 12 40 8.4 0.0885

0.6 9999 289 0 0 -

*Numbers on the top line indicate values of ISE indexes of 0.1, 0.2 etc. and each column assumes that the ISE index ("border") separates between actives (with higher indexes) and inactives (with lower indexes)

For the "full" set, with no filtering based on diversity, the results are a bit better in terms of enrichment. At the separation level 0.4, the enrichment is ~14. It is lower in the next index level of 0.5, going down to ~8. If all the filters, about 2500 from the 5-fold runs are combined, and the full learning set is screened through those filters, we get an interesting result (Figure 2 and Table 3), taking the "separating line" at an index of 0.4.

12 ACS Paragon Plus Environment

Page 13 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 2. ISE indexes for the full training set of 10,289 molecules. Indexes of all have been assigned in 5 iterations. Known actives are presented as rectangles and inactives are shown by circles. It is clear that the numbers of false positives (circles with higher ISE indexes) becomes smaller at higher indexes. The Y-axis simply enumerates the full set from 1 to 10289 with no significant order. Table 3. MCC scores for the “full” set for all the filters combined*

TN FN TP FP Enrich MCC

0 9335 210 79 65 19.5 0.375

0.1 9473 225 64 27 25 0.386

0.2 9987 236 53 13 28.6 0.377

0.3 9996 252 37 4 32 0.335

0.4 10000 260 29 0 35.6 0.313

0.5 10000 289 0 0 -

*Numbers on the top line indicate values of ISE indexes of 0.1, 0.2 etc. and each column assumes that the ISE index ("border") separates between actives (with higher indexes) and inactives (with lower indexes)

The differentiation between TP and FP is substantial. There are 29 true positives, and no false positives above the index of 0.4, when the five sets of filters from the 5-folds are combined. The MCC values for each threshold of the “full” set (Table 2) are higher than those of the “filtered” set (Table 1). Also, enrichment values (above 35 at the index value of 0.4 at Table 3) are larger than any of those in the one-fold models above. Filters of the “Tanimoto” (all 5-folds)

13 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 32

set gave false predictions for 26 out of the 41 best rated molecules (not shown). Therefore we decided to proceed with the filters of the “full” model for screening molecules.

Screening of External Databases The Enamine database containing about 1.8 million molecules was virtually screened using the ISE "full" model. The scan finds a large amount of 1789 molecules with activity indexes above 0.2 as seen in Figure 3. By using the "applicability domain" of known TLR9 inhibitors, the numbers were reduced, as shown in Figure 4.

Figure 3. Numbers of molecules found by screening through the "full" model filters, above each ISE index value. These molecules were further filtered using the applicability domain of the TLR9 inhibitors. The results are shown in Figure 4.

14 ACS Paragon Plus Environment

Page 15 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Figure 4. ISE indexing results after applying the applicability domain filtering

Molecules above an index of 0.4 are especially interesting, as we have not found any false positives above that index in our model. However, another major issue is the solubility of these molecules. For the 68 molecules with ISE indexes above 0.4, aqueous solubility values were calculated using five different software, as described under METHODS. A standard deviation for the different solubility values for each candidate was calculated. Thirty molecules with good calculated solubility (for at least 3 out of 5 values) and having low standard deviations were selected for in vitro testing (Supporting information Table S2).

Experimental validation of TLR9 modulation by the selected molecules In a first step it was shown that the TLR9 reporter cell line is specifically inducible. This is illustrated in Figure 5A. The TLR9 cell line was activated with the specific ligand ODN2216 (a defined oligo nucleotide) using different concentrations, showing a strong, dose dependent response. A control ligand ODN2216c (Invivogen) with a slightly different sequence did not elicit any significant reaction of the TLR9 reporter cell line, indicating it´s specificity. The control cell line containing only the reporter gene without TLR9, showed no signal after adding ODN2216. Furthermore, we could show in Figure 5B that the reporter cell line TLR9 can be also blocked specifically. For this purpose a known TLR9 antagonist (A151, Invivogen, ODN

15 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

TTAGGG) was used. The TLR9-blocking ligand is able to neutralize the stimulatory effect of the ODN2216 (Figure 5B) in the reporter cell line, showing its suitability for validating TLR9 agonists and antagonists.

Figure 5. Incubation of the TLR9 reporter cell line with the specific ligand ODN2216 and with the control ligand ODN2216c (Invivogen) in different concentrations (5A). The reporter cell line TLR9 showed activation by using ODN2216, but could not be activated with the control ligand. With the control SEAP cell line no ODN2216 signal could be measured. The reporter cell line TLR9 can be blocked specifically using the TLR9 ODN TTAGGG antagonist (A151, Invivogen, 5B).

Experimental characterization of the potential TLR9 antagonists The inhibition of TLR9 by the potential antagonists was performed in the presence of the nonmethylated CpG ODN 2216 agonist. The ODN ligand has the sequence 5’ggGGGACGA:TCGTCgggggg-3’. Those represented by capital letters are phosphodiesters, and those in lower case are phosphorothioate bonds. The Palindrome is underlined. To this activated system the selected compounds were added within a range of concentrations starting with 0.025 µM up to 14 mM. TLR9-signalling cascade activity was measured as described above. The in vitro testing of the first set of 30 purchased candidates for TLR9 inhibition resulted in 26 of these 30 candidates showing TLR9 inhibition activity, 9 molecules with IC50 values in the range of 7.27-30.34 µM (Supporting information Table S3, structures and indexes are shown in supporting information Table S4), 3 of those having IC50 values under 10µM. Due to the high values of IC50, further improvement of those candidates is required if they should be used as potential therapeutics. The distribution of IC50 values for all 30 screened compounds is shown in Figure 6a. 16 ACS Paragon Plus Environment

Page 16 of 32

Page 17 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

a)

b)

Figure 6. a) IC50 distribution for the best 30 compounds, initial discovery through ISE model b) IC50 distribution for the compounds identified through the optimized models

ISE Optimization Models To improve our initial results, we constructed two additional ISE models. That was achieved by two different constructions of the learning sets (Figure 6b shows the IC50 distribution of compounds of the optimization models).

17 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Model 1: May be titled a "Best –Rest" model. The 53 inhibitors with top IC50 values of 3250nM were used as the set of actives, accompanied by 10,000 randomly picked molecules that were used as inactives. Filters with 5 descriptor ranges each were picked out of 114 different descriptors (Supporting information Table S5) and 5 final filters remained. Those filters were used to distinguish between the highly active ligands ("Best") and random molecules ("Rest"). Due to the small number of filters, the possible values of ISE indexes are limited (Figure 7): each model could produce only six values for the molecular indexes, which are: -0.99, -0.59, -0.19, 0.19, 0.59 and 0.99. The models were tested on all the initial active and random molecules. The model gave only four FP above the ISE index of 0.0, and no FN below 0.0. MCC above the 0.5 index was 0.99.

Figure 7. ISE Indexing of the "Best-Rest" Model (rectangles – actives; circles – inactives)

Model 2: The "Best" 53 highly active ligands are the class of actives, while the molecules with weak activity (236 molecules with IC50 above 3 µM) are the other class. There is no need to assume inactives, as all activities are known in this case of a "Best-Worst" model. Again, 114 descriptors were used for model construction which terminated with only 5 final filters. The model gave 3 FPs above 0, no FN below 0 and MCC for the 0.5 index of 0.98.

18 ACS Paragon Plus Environment

Page 18 of 32

Page 19 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

The 883 molecules which scored over 0.2 in the previous ISE model and were filtered by applicability domain were scanned by both the "Best-Rest" and (subsequently) by the "BestWorst" models. Solubility was evaluated as before. Eventually, 30 molecules of these two optimized models were selected for purchase and sent to in vitro tests (Supporting information Table S6. Structures are depicted in supporting information Table S7). Inhibition was tested in the presence of 5µM of the ODN 2216 agonist. As examples for measuring the activity of the antagonists, results for the molecules T6683896, T5570245, T5581953 and T5669070 are shown in Figure 8 A-D below. These molecules are shown in Figure 9 as B4, B3, B2 and B1, respectively, and their IC50 values are presented in the caption of Figure 9.

Figure 8. The TLR9 reporter cell line is shown in comparison to the control SEAP cell line without TLRs. The OD at 405 nm is presented for 5µM ODN2216 with indicated concentrations of four different antagonists in comparison to positive (5

µM ODN2216) and negative (without induction of the reporter cell line) controls. There is dose dependent reduction of the signal: increased antagonist concentration leads to a reduced ODN2216 signal. IC50 results were measured in presence of ODN 2216 Agonist (A) Antagonist T6683896 IC50 = 0.89 µM (B) Antagonist T5570245, IC50 = 0.8 µM. (C) Antagonist T5581953 IC50 = 0.2 µM. (D) Antagonist T5669070 IC50 = 0.05 µM. Two to three independent experiments were performed in triplicates and the IC50 was calculated from the dose response curves shown. The IC50 of the data set was determined as the point of intersection where the concentration equals 50 % absorption of the positive control. The values were presented as mean ± standard error of the mean.

19 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 20 of 32

All 30 compounds tested in this optimized iteration showed inhibitory activity, the highest IC50 was determined as 274 µM and the average IC50 value is 37.6µM (without the 4 highest values the average is 11µM). 18 out of the 30 compounds showed stronger inhibitory capacities (below 10µM). The IC50 values of 11 best rated molecules by both (Best-Rest and Best-Worst) of the optimization models are in the range of 0.053-148.87µM; with an average IC50 value of 18.7µM (without the highest value the average becomes 5.7µM). The IC50 values of 7 best rated molecules by the "Best-Rest" optimization model are in a range of 0.962-274.54 µM; with an average IC50 value of 51.8µM (without the highest value the average becomes 14.7µM). The IC50 values of 12 best rated compounds by "Best-Worst" optimization models are in a range of 0.896-212.12µM; with average IC50 value of 46.7µM (without the 2 highest values the average becomes 13.9 µM). The top antagonists from the two screening iterations are presented in Figure 9 (9A - 3 out of 30 from the initial model, 9B - 18 out of the 30 from the optimized models). Some of them may be clustered: Molecules A1 and A3 have in common a quinoline, 2-[2(dimethoxyphenyl)ethenyl]- moiety and A2 also shares the quinoline scaffold. Molecules B1, B5 and B13 have in common the 1-pyrrolidinyl-ethyl-piperazine moiety; molecules B2, B6, B10, B11, B12, B17 and B18 have in common a 4-ethylamino-imidazole moiety; molecules B4, B8 and B9 have in common a 4-(phenyl-methylamino)-piperainze moiety. All others have no common moiety. Results for the inhibitory concentrations are shown in supporting information tables S3 and S4 for the initial 30 molecules, and in tables S6 and S7 for the 30 optimized molecules.

20 ACS Paragon Plus Environment

Page 21 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Screening round-1 O

O N

O N

O

N

O N

O

A3 IC50 = 8.91

A2 IC50 = 7.77

A1 IC50 = 7.27

Screening round-2 N N O

N

N

N

N N

N N

Cl

N

NH

O O N H

O

NH

N

N

O

N

B1

N

O B4

B3

B2 O

N N N

O N

Cl

N

O

N

N

N O

N

N

N

HN

N

O

NH

O

O

N N B6

B5

B8

B7

N N

N

H N

N

N N

N N

O

N

O O NH

O B9

NH

O

B10

N

NH

O

B11 B12

N

N

O N

N N

N

H N

O

N

N

O O

O

N

Cl

H N

N

N

N N

N O B15

N B13

B16

B14

N N

N N

O

O N NH

O

N NH

O N

B17 B18

21 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 22 of 32

Figure 9. The set of molecules with lowest IC50 values (under 10 micromolar ). 9A) three compounds from the initial 40 molecules 9B) 18 molecules from the optimized screening of 30 molecules. IC50 Values of 9A antagonists are given in the figure. For 9B, IC50 values in micromolar units are B1 0.0534, B2 0.201, B3 0.815, B4 0.896, B5 0.962, B6 1.44, B7 1.6. B8 1.62, B9 1.78, B10 1.87, B11 3.14, B12 3.16, B13 3.82, B14 6.32, B15 6.43, B16 6.78, B17 7.51, B18 7.56

Discussion and Conclusions The goal of this research was to discover novel antagonists of TLR9 that qualify as candidates for treating over-reactive immune responses to viral and other infections. Main tools for such discovery are our ISE algorithm applied to virtual molecular discovery, and our proprietary in vitro TLR9 reporter assay. Our results present an outstanding success of this combination for discovery, and of ISE’s ability to detect novel, diverse and highly active molecules. Herein, we discuss a few of the issues that are at the basis of the computational modeling by ISE. Descriptors ISE applied to models for molecular discovery may be based on many types of molecular representations – binary fingerprints, structural features, physico-chemical properties of several dimensions, quantum mechanical indices of electronic structures and more. In this study, we have chosen to use 2D descriptors known to be especially useful for virtual screening due to the fact that 2D descriptors may be easily calculated from 2D chemical structures. The 2D descriptors lack conformational and topological parameters, which are likely to be verified or tested by other methods, such as docking.

Five-folds modeling ISE models are the result of a 5-fold construction of partial models of the learning set, by using 4/5 of the learning set for training, and 1/5 as test set. That is performed in 5 iterations so that each 20% of the full (learning) set serves once as test set and 4 times as training set. Four of the sets are picked randomly, each containing 1/5 of the active molecules and ~2000 inactives. Filters of each of the 5 models serve to determine indexes of ~60 + 2000 molecules (in case of ~300 actives) in each test set, so that at the end of the 5-folds – all molecules have been indexed just once, as shown in Figure 2. The final model combines filters produced by each of the 522 ACS Paragon Plus Environment

Page 23 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

folds, but limited to MCC values that are not less than 20% than the MCC value of the top filter of all five folds. Filters are combined from the different folds on condition that the qualities of the models are closely related, as reflected by MCC values of the top filters of each. Bias of similar filters is avoided by clustering, that is, testing whether filters with comparable MCC values are different by less than 2% in identifying true positives, a case which leads to the elimination of one of the two filters compared, the one with lower MCC. Enrichment Factor issues Enrichment factors are reported in tables 1, 2 and 3. An enrichment factor of 10 means that it is 10 times more probable to discover active molecules by evaluating them with the model than by picking them randomly. But even an enrichment of 25 has limited usefulness due to the following: Column "0.1" in Table 3 lists 64 "true positives" with indexes above 0.1 (TP, out of 289 known actives) and 27 false positives (FP) with such indexes (out of 10000 inactives), and the enrichment is ~25 = 64/(64+27) / (289/10289), the denominator being the chance to randomly pick an active out of the total. This is a small enrichment due to its practical aspects. The values of TP/(TP+FP) in each column may suggest how many molecules should be sent for in vitro testing in order to discover new candidates. The "discovery expectation" using these values, suggests that for that 0.1 column in Table 3 it is expected that out of 91 molecules, about 64 should be found to be true positives, i.e., about 2/1 out of each 3 molecules (64/27) sent for testing (due to an index higher than 0.1) could be a new candidate. That is however highly overestimated in favor of discovery, because of the low number of inactives in the learning set. That number should be closer to the expected rate of discovery in High Throughput Screenings (HTS), which is ~1 hit among 1000 tested molecules as a “rule of thumb”. 30 But such dilution in virtual modeling, of 1000 inactives for each active of the learning set, would require extremely longer computation times. It is therefore useful to "adjust" the numbers after model construction. The current dilution by 10,000 in the “full set” of Table 3, for ~300 actives, should increase about 30-fold in order to "fit" HTS expectations, thus the number of FP with indexes above 0.1 (Table 3) should be assumed to be 30 times the current number of FP. Thus, 27X30 = 810, and so the "practical" enrichment is only (64/874) / (289/10289) = 2.6 and the chance for discovery

23 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 24 of 32

is ~1 out of 13. So to secure discovery, at least 40 molecules should be purchased for the chance of finding 3 hits. Enrichments for practical purposes are therefore much lower than those computed from the learning sets, due to the relatively small numbers of inactives compared to a much larger "chemical space" in chemical libraries. The reason for those small numbers is to avoid very long computations. The above discussion is valid for classification modeling in which actives are to be discovered among randoms ("decoys"), with "dilution" at the HTS levels of 1:1000. That is the case for two of the three models used in our research – the "full" actives vs. randoms, and the "Best-Rest" model. For the "Best-Rest" model we have 53 actives for which we should have picked ~50,000 inactives to reflect HTS conditions in silico. In Table 3 of the "full" set, increasing the index is accompanied by a decrease in both TP and FP, but the proportion of TP/FP increases. Using the larger TP/FP values from the model increases the chance for discovery if screened molecules are picked only above the index for the top TP/FP . The enrichments for the Best-Worst model have a different character, because there is no "dilution" with inactives, but a division of the actives to two classes, high (“best”) actives and low (“worst”) actives. The condition for that modeling is that the two groups, the “best” and the “worst” would be of closely related sizes (~50 vs ~250 in our model). In that case, the enrichment factors do not have to be “diluted” and may be used, together with TP/FP, to decide upon the index from which to acquire molecules.

Solubility Calculations Many purchased or synthesized drug candidates do not dissolve easily in biological fluids, a fact which hampers in vitro or in vivo tests. Aqueous solubility, which reflects the solubility in biological fluids, is thus important to evaluate prior to acquisition. Measuring solubility is a lowthroughput process and therefore not feasible for thousands of compounds. On the other hand, none of the current algorithmic predictions of solubility is reliable enough to be applied to any molecule. Due to that absence of highly accurate and reliable solubility models, one of the possible solutions is to use a “voting” approach, by which a few different in silico solubility models are applied to predict any molecule. The results of the different algorithms are then 24 ACS Paragon Plus Environment

Page 25 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

compared and if there is self-consistency among them, i.e., the standard deviations are not too large, those predictions may be used to reflect the expected solubilities. “Voting” on the basis of 5 different algorithms was helpful in our decision making for final purchase of molecules. In case of molecules which are expected to be charged at biological pH, or that are prepared as salts, the charged state is a major factor promoting solubility. Virtual Screening and Initial Candidates Virtual screening was performed with a large set of commercially available molecules from the Enamine database, about 1.8 million. Only 1789 molecules (~.1% of the total) passed our ISE initial (“full”) model above an index value of 0.2, and 229 of them passed above an index of 0.4. After filtering by applicability and solubility, 30 out of 229 candidates were chosen for in vitro tests and out of those 26 molecules were characterized as active antagonists, 9 out of those having relatively low IC50 µM values (under 16 µM) and 3 are under 10µM. Those results imply the high distinguishing features of the chosen ISE model, while the range of activity values retrieved (µM-mM) ideally satisfies the initial aim, which is distinguishing between inactive molecules and molecules with activity for the TLR9 receptor (hit molecules). Improving the models In the second round of modeling, we found a very small number of filters (only 5) for the models. This may be a result of the small number of molecules used as the set of actives (only 54). With that small number, it is not surprising that only few specific feature combinations are needed to distinguish these 53 molecules from the others, 10,000 inactives picked randomly. However, comparing with our first (actives vs. randoms ~300 vs.10,000) model, the current model retrieved highly active compounds (from both Best-Rest and Best-Worst), and many of them were found to have low IC50 values for TLR9 antagonism. Moreover, all of the highly rated compounds by those models were found to be active in the experiments. From the 883 molecules that conformed to the applicability domain and were indexed above 0.2 by the initial ISE, 30 final candidates for in vitro tests were selected. These candidates were predicted as actives by one of the optimization models or by both of them. For comparison and evaluation of the optimization models separately, the molecules were selected in a way that 11 of them showed highest results in both of the models, 7 showed high results only in the “Best-Rest” model and 12 showed high results only in the “Best-Worst” model.

25 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 26 of 32

The TLR9 reporter cell line could be shown to be highly specific for the known TLR9 agonist ODN2216, and antagonist A151. Therefore, it was considered as a well-defined tool to characterize the molecules identified biochemically. The in vitro tests showed very good results for all the molecules predicted by both of the optimization models separately, as well as for the combination of the models (Supporting information Table S6). The average activity value for the 11 compounds that were rated as best by both of the models is 18.7 µM, while excluding the worst compound (with IC50 = 148.87µM) reduces the average to 5.7 µM. The average activity value for all 18 compounds of the "Best Rest" model is 33.42 µM, and for all the 23 best rated compounds of the "Best Worst" model it is 33.31µM. Excluding the four worst molecules in the experiment with IC50 values > 100 µM , the average activity values are 9.66 µM and 9.82 µM, respectively. Five compounds under 1µM were retrieved, each of the models retrieved four of them, so that three out of these five highly active compounds were recognized by both models. Those results imply that both of the models distinguish correctly and very specifically between the dominant features of highly active compounds and between irrelevant features of foreign molecules, while each model separately focuses on different sets of features. Due to this fact, one model can filter out active molecules which were selected by the other model. We conclude that for the discovery of highly active molecules, using both models is helpful, and the “Best-Rest’ and “Best-Worst” models form the basis of better predictions than the “Actives-Randoms” initial model. Diversity of Antagonists' scaffolds A major issue in drug discovery is the novelty of molecular scaffolds. This may be partially reflected by computing similarity between the predicted set and the learning set, as well as within the predicted set.

31

Tanimoto similarity (T values) between molecules is the current standard measure that reflects the diversity and the novelty of a method of discovery. The Tanimoto similarity within the newly discovered TLR9 antagonists finds that 10 out of 60 compounds (17%) have a similarity lower than T = 0.5 with all the others and may be considered totally different scaffolds. Of the rest, 45 molecules (75%) have a maximum similarity lower than T=0.85 thus not highly similar to others. The average similarity of these 45 molecules is very low, T=0.28.

26 ACS Paragon Plus Environment

Page 27 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Did we find molecules that are not “copies” or are very similar to those that we learned from? The similarity of the 60 discovered compounds, from both rounds, to the known 289 TLR9 antagonists that were used for model construction was examined in more detail: Comparison by Tanimoto shows that 43 out of 60 (71%) have a maximum similarity to the CHEMBL compounds used for learning - below T=0.5 and 56 out of those 60 (93%) have a maximum similarity below T=0.7. It can thus be deduced that most of the newly discovered compounds are very different than the known CHEMBL compounds (Supporting information Table S8). Moreover, we find that the top 18 compounds (with best IC50 values, from the optimized set, supporting information Table S7) have no previous literature references. None of them were ever mentioned in articles or patents. Therefore, the discovery set is novel by all criteria, and is highly different than previously known compounds. The Tanimoto similarity within the ChEMBL TLR9 antagonists used as learning set is however different: Out of 230 compounds only 61 (26%) have a Tanimoto similarity lower the T=0.5 and may be considered different scaffolds, while 104 (45%) have Tanimoto index lower than 0.85. Thus, despite having learned from a set of TLR9 antagonists with molecules which share some molecular features, we ended with a discovery set that shares much less. These results thus confirm and validate the ability of ISE models to discover compounds with new scaffolds and to discover active molecules among molecules for which no activity was ever indicated previously.

Using ISE in Drug Discovery projects There are benefits and limitations of using ISE for discovering novel molecular activity. There are four basic situations confronting drug discovery projects for a specific target protein: 1) known ligands and known target structure 2) only ligands are known 3) only the target structure is known and 4) no ligands and no target structure available. There are of course a few levels of “knowledge” (many structures, complexes of ligands with protein, only few ligands, etc.). Assuming the “ideal” conditions, with many known ligands, and a few protein structures including complexes with agonists and antagonists, modeling with ISE is straightforward and may be followed by docking, pharmacophore and other techniques. However, discovery with ISE is possible with a sufficient number of ligands known, at least a few dozens. Diversity is not a necessary condition, except for excluding similar molecules. We use only 2D descriptors of 27 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 28 of 32

physico-chemical properties computed by MOE and this may be the basis of the "scaffold hopping" which characterizes most of the results of screenings through the models, because there is no structural similarity requirements. However, some of the resulting structures may be too large or too small for fitting into the binding sites of the targets. Also, as shown in this project, solubility is not a consideration and should be dealt with separately. One of the main advantages of the ISE process is the validation of descriptors that do not contribute to success, performed in iterations. The reduced number of combinations reaches a point from which all of them are computed and evaluated, and these evaluations are based on a simple knowledge function, MCC, which ends with the production of multiple filters that allow subsequent scoring of unknowns. We highly recommend the use of ISE in any case of many known ligands in which new scaffolds are of interest. Of the four situations presented above, those with protein + ligands (1) and those with only ligands in sufficient numbers (2) are the best for employing ISE.

Conclusions 1)

In the current project, we were able to discover 56 antagonists out of 60 tested molecules

picked by our modeling out of 1.8 million commercially available molecules, an unprecedented discovery rate 2)

For this particular target of TLR9, a ligand based classification was able to discover

potent hits, even in the absence of any knowledge of the target structure 3)

Among the various classification models, the "Best-Rest" and "Best-Worst" models

proved to supply the best hits 4)

Compounds discovered by ISE have new and diverse scaffolds in comparison to the

learning set and to each other. The discovered molecules are more diverse than the learning set. 5)

Most of the discovered molecules have not been mentioned in the scientific literature,

while others have not been associated with innate immune activity 6)

Predicted solubility evaluations by "voting" proved useful for the current work. No

molecules failed in the assay due to solubility

28 ACS Paragon Plus Environment

Page 29 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

Supporting Information All the tables of the supporting information section are mentioned in the manuscript. A short chapter describing the problematics of docking experiments may be found at the end of supporting information. Table S1: Vitality of the NIH3T3 cells after the performance of the assay Table S2: 68 Top Scored Candidates and Their Calculated Solubilities Table S3: 30 initial screened candidates and their IC50 Values Table S4: Structures, IC50 values and ISE indexes of the initial 30 candidates Table S5: Calculated Descriptors for the Sets Table S6: 30 Top optimized candidates: indexes, solubilities and IC50 Values Table S7: 30 Top optimized candidates: structures and IC50 Values Table S8: Tanimoto Similarity Values for the Discovered 60 Compounds Table S9: TLR9 active site residues within 5 Å of iDNA bound structure pdb id (3WPD). Table S10: Virtual screening results for the docking based validation experiment Short Chapter: Docking of ligands to horse TLR9 receptor including Figure S1: Structure of horse TLR9 in complex with a DNA antagonists, and the results of docking a single ligand in to that structure Supporting information is available free of charge via the Internet at http://pubs.acs.org. Corresponding Author Information: Amiram Goldblum +972-544-653292 [email protected]

Acknowledgment Financial support by a combined grant of Fraunhofer Institute IGB Stuttgart and by the Hebrew University of Jerusalem is greatefully acknowledged.

29 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 30 of 32

Abbreviations Used: DMEM - Dulbecco′s Modified Eagle′s Medium. ECD - extracellular N-terminal ectodomain. FCS - fetal calf serum. FN –false negatives. FP - false positives ISE- Iterative Stochastic Elimination. LRR - leucine-rich repeats. MCC – Matthews correlation coefficient. ODN – oligodeoxynucleotide. PAMP - pathogen associated microbial pattern. pNPP – para- Nitrophenol Phosphate. PPR - Pattern Recognition Receptor SEAP - secreted alkaline phosphatase. TIR domain - Toll Interleukin Receptor domain. TN- true negatives. TP –true positives.

References 1. Beutler, B., Innate Immunity: An Overview. Mol. Immunol. 2004, 40, 845-859. 2. Botos, I.; Segal, D. M.; Davies, D. R., The Structural Biology of Toll-like Receptors. Structure 2011, 19, 447-459. 3. Akira, S.; Takeda, K., Toll-like Receptor Signalling. Nat. Rev. Immunol. 2004, 4, 499-511. 4. Burger-Kentischer, A.; Abele, I. S.; Finkelmeier, D.; Wiesmüller, K.-H.; Rupp, S., A New Cellbased Innate Immune Receptor Assay for the Examination of Receptor Activity, Ligand Specificity, Signalling Pathways and the Detection of Pyrogens. J. Immunol. Methods 2010, 358, 93-103. 5. Americn Chemical Society Computers in Chemistry Division Symposium on Emerging Computational Technologies August 2000, Washington D.C. http://oldwww.acscomp.org/Awards/emerging2000.html. 6. Stern, N.; Goldblum, A., Iterative Stochastic Elimination for Solving Complex Combinatorial Problems in Drug Discovery. Isr. J. Chem. 2014, 54, 1338-1357. 7. Hari, A.; Flach, T. L.; Shi, Y.; Mydlarski, P. R., Toll-like Receptors: Role in Dermatological Disease. Mediat. Inflam. 2010, 2010. 8. Plitas, G.; Burt, B. M.; Nguyen, H. M.; Bamboat, Z. M.; DeMatteo, R. P., Toll-like Receptor 9 Inhibition Reduces Mortality in Polymicrobial Sepsis. J. Exp. Med. 2008, 205, 1277-1283. 9. Luisa, G. M.; Murciano, C.; Yanez, A.; Gozalbo, D., Role of Toll-like receptors in Systemic Candida albicans Infections. Front. Biosci. 2015, 21, 278-302. 10. Kanzler, H.; Barrat, F. J.; Hessel, E. M.; Coffman, R. L., Therapeutic Targeting of Innate Immunity with Toll-like Receptor Agonists and Antagonists. Nat. Med. 2007, 13, 552-559. 11. Berman, H. M.; Westbrook, J.; Feng, Z.; Gilliland, G.; Bhat, T. N.; Weissig, H.; Shindyalov, I. N.; Bourne, P. E., The Protein Data Bank. Nucleic Acids Res. 2000, 28, 235-242. 12. Wlodawer, A.; Vondrasek, J., Inhibitors of HIV-1 Protease: A Major Success of Structureassisted Drug Design 1. Annu. Rev. Biophys. Biomol. Struct. 1998, 27, 249-284. 30 ACS Paragon Plus Environment

Page 31 of 32

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Journal of Chemical Information and Modeling

13. Zhou, W.; Li, Y.; Pan, X.; Gao, Y.; Li, B.; Qiu, Z.; Liang, L.; Zhou, H.; Yue, J., Toll-like Receptor 9 Interaction with CpG ODN-An In Silico Analysis Approach. Theor. Biol. Med. Mod. 2013, 10, 18. 14. Stahura, F. L.; Bajorath, J., New Methodologies for Ligand-based Virtual Screening. Curr. Pharm. Des. 2005, 11, 1189-1202. 15. Baldi, P.; Brunak, S.; Chauvin, Y.; Andersen, C. A.; Nielsen, H., Assessing the Accuracy of Prediction Algorithms for Classification: An Overview. Bioinform. 2000, 16, 412-424. 16. Glick, M.; Rayan, A.; Goldblum, A., A Stochastic Algorithm for Global Optimization and for Best Populations: A Test Case of Side Chains in Proteins. Proceedings of the National Academy of Sciences 2002, 99, 703-708. 17. Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B., ChEMBL: A Large-scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012, 40, D1100-D1107. 18. Czarniecki, M., Small Molecule Modulators of Toll-like Receptors. J. Med. Chem. 2008, 51, 6621-6626. 19. Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Wang, J.; Zhou, Z.; Han, L.; Karapetyan, K.; Dracheva, S.; Shoemaker, B. A., PubChem's BioAssay Database. Nucleic Acids Res. 2012, 40, D400D412. 20. Zheng, W.; Spyvee, M.; Gusovsky, F.; Ishizaka, S. T. Benzoxazole derivatives as immune tolllike receptor stimulators and their preparation, pharmaceutical compositions and use in the treatment of immunological disorders. WO 2010/036905, 2010. 21. Lipford, G.; Forsbach, A.; Zepp, C. Small molecule toll-like receptor (TLR) antagonists. WO 2005/007672, 2005. 22. Lippford, G., B.; Zepp, C., M. ; Nguyen, T., B. Benzoxazole derivatives as immune toll-like receptor stimulators and their preparation, pharmaceutical compositions and use in the treatment of immunological disorders. WO 2008/030455, 2008. 23. Molecular Operating Environment (MOE), 2013.08, Chemical Computing Group Inc.: 1010 Sherbooke St. West, Suite #910, Montreal, QC, Canada, H3A 2R7, 2013. 24. Nothelfer, K.; Arena, E. T.; Pinaud, L.; Neunlist, M.; Mozeleski, B.; Belotserkovsky, I.; Parsot, C.; Dinadayala, P.; Burger-Kentischer, A.; Raqib, R., B Lymphocytes Undergo TLR2-Dependent Apoptosis upon Shigella Infection. J. Exp. Med. 2014, 211, 1215-1229. 25. Krieg, A. M.; Yi, A.-K.; Matson, S.; Waldschmidt, T. J.; Bishop, G. A.; Teasdale, R.; Koretzky, G. A.; Klinman, D. M., CpG Motifs in Bacterial DNA Trigger Direct B-cell Activation. Nature 1995, 374, 546-549. 26. Wang, J.; Krudy, G.; Hou, T.; Zhang, W.; Holland, G.; Xu, X., Development of Reliable Aqueous Solubility Models and Their Application in Druglike Analysis. J. Chem. Inf. Model. 2007, 47, 1395-1404. 27. Ohto, U.; Shibata, T.; Tanji, H.; Ishida, H.; Krayukhina, E.; Uchiyama, S.; Miyake, K.; Shimizu, T., Structural Basis of CpG and Inhibitory DNA Recognition by Toll-like Receptor 9. Nature 2015, 520, 702-705. 28. Enamine Ltd. Screening compound Provider At http://www.enamine.net/: Kiev, Ukrane, 2015. 29. Matthews, B. W., Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure 1975, 405, 442-451. 30. Posner, B. A.; Xi, H.; Mills, J. E. J., Enhanced HTS Hit Selection via a Local Hit Rate Analysis. J. Chem. Inf. Mod. 2009, 49, 2202-2210. 31. Willett, P., Similarity-based Virtual Screening Using 2D Fingerprints. Drug Discov. Today 2006, 11, 1046-1053.

31 ACS Paragon Plus Environment

Journal of Chemical Information and Modeling

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Table of Contents Graphic

32 ACS Paragon Plus Environment

Page 32 of 32