Biased Complement Diversity Selection for Effective Exploration of

1 day ago - Biased Complement Diversity Selection for Effective Exploration of Chemical Space in Hit-Finding Campaigns. Johanna M. Jansen* ...
0 downloads 0 Views 579KB Size
Application Note pubs.acs.org/jcim

Cite This: J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Biased Complement Diversity Selection for Effective Exploration of Chemical Space in Hit-Finding Campaigns Johanna M. Jansen,* Gianfranco De Pascale, Susan Fong, Mika Lindvall, Heinz E. Moser, Keith Pfister, Bob Warne,† and Charles Wartchow Novartis Institutes for BioMedical Research, 5300 Chiron Way, Emeryville, California 94608, United States

J. Chem. Inf. Model. Downloaded from pubs.acs.org by IDAHO STATE UNIV on 04/03/19. For personal use only.

S Supporting Information *

ABSTRACT: The success of hit-finding campaigns relies on many factors, including the quality and diversity of the set of compounds that is selected for screening. This paper presents a generalized workflow that guides compound selections from large compound archives with opportunities to bias the selections with available knowledge in order to improve hit quality while still effectively sampling the accessible chemical space. An optional flag in the workflow supports an explicit complement design function where diversity selections complement a given core set of compounds. Results from three project applications as well as a literature case study exemplify the effectiveness of the approach, which is available as a KNIME workflow named Biased Complement Diversity (BCD).



INTRODUCTION The search for quality hits to enter medicinal chemistry optimization for a drug discovery project often starts with a hitfinding campaign in which a large collection of compounds is assessed in a biochemical or cellular assay. An iterative hitfinding approach starts with screening of a subset of accessible compounds, followed by expansion around the hits. This type of approach is required when an assay has limited throughput and cannot accommodate all accessible compounds. Even when assays can accommodate a large collection, an iterative approach can be very effective with respect to finding quality starting points quickly without the time and resource investment of screening all accessible samples.1,2 Selection of the subset that initiates an iterative screening campaign is a critical step in the success of the campaign, second only to the quality and diversity of the physically available samples from which to select that subset. Various approaches have been published for selection of screening sets that focus on covering the diversity of the source set using measures of chemical and biological diversity.3−5 While drug discovery by definition implies an exploration of the unknown, the choices we make during the discovery process are influenced by what we already know (about a target, a phenotypic outcome, or a desired compound profile). This knowledge constitutes a bias that we can leverage to increase our chances of success when selecting our first set of © XXXX American Chemical Society

compounds to screen. Previous reports in the literature describe approaches to incorporate knowledge-driven attributes in the selection of screening collections6 or the design of compound libraries.7,8 For purposes of our screening set designs, we wanted to create a workflow that explicitly complemented exploration of one or more specific hypotheses with coverage of chemical space to leave us open to serendipitous and novel discoveries. The concept of complementing knowledge-based bias with diversity in integrated designs has been described in relation to combinatorial library design.9 The workflow described in this paper translates this concept to screening set design and expands it to be applicable to any collection of molecules. In the context of designing screening sets, we are guided by the mantra “every well counts”. The compounds that are selected to be part of the screening set should not have any known attributes that make them unfit as starting points for optimization; such compounds should be excluded from consideration upfront. For the compounds that are considered, we choose to pick representatives that have attributes associated with higher chances of success. Success is the identification of quality starting points, where “quality” is Special Issue: Women in Computational Chemistry Received: January 11, 2019

A

DOI: 10.1021/acs.jcim.9b00048 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Application Note

Journal of Chemical Information and Modeling

specific annotations. Exclusions are used when there are attributes that are known to be incompatible with the concept of a quality hit; compounds with such attributes should be excluded from the workflow at the start. Remaining compounds are assigned to up to four separate classes, labeled A through D. The highest-quality class is A, followed by B, C, and D. Attributes used for the exclusion and classification determination across the case studies include calculated physical chemical properties, matches to undesirable substructures, and evidence from prior screening history that a compound is a frequent hitter. Projects that want to use a complement design option around a core set should designate all of the compounds in the core set as class-A. The workflow will then select compounds (from classes B, C, and D) to complement the chemical space that is already covered by the class-A compounds in the core set. On the basis of the fully annotated data set, the workflow loops over every cluster. All singletons are passed into the final selection. If a cluster contains only class-D compounds, a single best representative is selected. For clusters that have more than one compound with a class A, B, or C label, biased selection occurs across three tiers in the third part of the workflow. If a complement design is requested, all of the compounds in class A get selected automatically (they are the core set), and the design is filled out with the remaining classes. A key component in the tiered selection process is the RDKit14 Diversity Picker node in KNIME. This node picks diverse rows from an input table on the basis of Tanimoto distance between fingerprints (Morgan fingerprints, radius 2, 2048 bit length) using the MaxMin algorithm.15 The node has a complement option, which is used between the different tiers to balance a focus on quality with a complementation in chemical diversity. This means that even without a core set, the workflow uses the notion of a complement design.

unique for each project. Quality criteria can include attributes like physical chemical property profile (e.g., low lipophilicity, high ligand efficiency), lack of activity in counterassays (e.g., cytotoxicity assays), confirmation in orthogonal assays, and novelty of chemotype. It has been our observation that the hit rate for the classes of compounds with attributes associated with high quality is often lower than the overall hit rate. By biasing our designs toward picking compounds with these quality attributes, we recognize that we may obtain fewer total hits, but we expect more quality hits. The concept of doing an integrated design in which the diversity set complements the focused set(s) within the same design run prevents sampling of the same chemical space twice, which would be “a waste of wells”. Over the course of many hit-finding campaigns, we have created a KNIME workflow10 for screening set design that balances sampling of chemical diversity with bias of knowledge-driven attributes for quality. This includes an option to fix certain portions of the design to explore specific hypotheses and then add a diversity set around that fixed core to be complementary in chemical space. The workflow, called Biased Complement Diversity or BCD, is available in the Supporting Information. The focus of the workflow and this paper is on the initial screening set selection. Projects will typically engage in an iterative process using various computational hit-list expansion approaches, where data from the first screening round is used to predict which other compounds might also be active.1,2,11−13 Following the description of the computational workflow, case studies from our internal drug discovery efforts as well as a literature case study will exemplify how BCD provides a balanced screening set selection that leverages available knowledge to sample diversity in a manner that enriches for quality in the confirmed hits.





WORKFLOW DESIGN The basic principle of our diversity selection uses clustering to describe chemical space and then applies the bias by picking the “highest quality” compounds per cluster. Every cluster is sampled and every singleton is included, which means that chemical space, as defined by the clustering, is fully sampled. Users can decide on their preferred method of clustering and their preferred fingerprints. An explicit complement design option is available for instances where a fixed set of compounds, a “core set”, has already been selected to test certain hypotheses and the team wants to sample diversity around this core to ensure that they explore complementary chemical space. There are five variables for the user to define: a variable indicating whether an explicit complement design is desired, two variables that govern the sampling density for the clusters, and two variables that determine selection from the least desirable region of space. The detailed workflow documentation is included in the Supporting Information. The three major steps in the workflow include creation and annotation of a compound library, quality assessment of every cluster of compounds on the basis of the annotations, and biased selection of representative compounds from each cluster (Figure S1 in the Supporting Information). Creation of the compound library starts with a list of unique chemical structures from which the subset should be selected. While all internal case studies used physically available samples as input, the workflow is equally applicable to collections of commercially available compounds or virtual compound collections. Quality attributes and exclusions are project-

CASE STUDIES Introduction. Three case studies from our internal drug discovery efforts and a malaria case study from a literature data set that can be used to explore the workflow are described below. The bacteriology case study illustrates the use of the BCD workflow where a quality bias is applied to the selection but there is no core set; the distinctive quality criterion is a decrease in cytotoxicity in the hit list. Target 1 shows an example where there is a knowledge-based core set informed by a computational target model. The BCD workflow selected a complement diversity set around the core set. The distinctive quality criterion for target 1 is the identification of novel chemical matter using a known mechanism. Target 2 illustrates the creation of a core set from multiple subsets to leverage knowledge in different forms, and its distinctive quality criterion is the identification of chemical matter that validates the proposed mechanism using orthogonal biophysical assays. The bacteriology and malaria hit-finding efforts are driven by cellular assays; the two target case studies are driven by biochemical assays. The exact cutoffs, properties, and settings for the exclusions, classification, and workflow variables are different for each case study and are included in the Supporting Information. This represents the fact that the needs, knowledge, and quality attributes are distinct for every project. In addition, our internal annotations related to quality attributes are dynamic. For example, substructure annotations get updated and reviewed regularly, prompted by new literature reports16,17 as B

DOI: 10.1021/acs.jcim.9b00048 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Application Note

Journal of Chemical Information and Modeling Table 1. Results from the Diversity (BCD) Screening Sets, Compared with Core Sets Where Applicable case study bacteriology target 1

source set size 832k 1157k

target 2

736k

malaria (literature)

306k

screening set size(s)

confirmed hit-rate (%)

no. of confirmed hits

no. of BM scaffolds in confirmed hits (shared with other subset)a

BCD: 76k BCD: 53k docking: 51k BCD: 19k prior hits: 506 privileged scaffolds: 15k preplated diversity: 18k BCD: 50k similarity: 476

0.1 0.2 0.6 0.4 2.1 0.5 0.4 0.1 1.1

40 119 293 81 10 68 72 31 5

37 108 (4) 217 (4) 81 (1) 10 64 (1) 72 30 2

a

Each number in parentheses is the number of BM scaffolds from the subset that is shared with the BM scaffolds from another subset.

are cytotoxic (see the Supporting Information for details). For the purposes of our screening set, we therefore chose to bias the design to compounds with clogD ≤ 3. Details of the exclusions and classification for the bacteriology design are given in the Supporting Information. The most pertinent detail in light of the above analysis is that all compounds with a clogD > 3 are assigned to class D. The selected screening set was assessed for bacterial growth inhibition across a panel of bacterial strains using an initial single-concentration screen. Primary hits against at least one wild-type GN strain were followed up in dose−response assays and resulted in 40 confirmed hits at an EC50 cutoff of 20 μM against at least one wild-type GN strain that are not cytotoxic against mammalian cells. It is worth noting that 65 compounds were confirmed at the EC50 level, but 25 of those did not pass the cytotoxicity counterscreen. In other words, the designed screening set resulted in a 62% proportion of validated hits without cytotoxicity, which is a significant improvement over the historical proportion of 31% and supports the use of the clogD-biased design to improve the quality of antibacterial hits. Target 1 Case Study. Hit-finding activities for target 1 were driven by a biochemical assay designed to discover compounds that displaced a probe molecule from a known binding pocket. The objective of the project was to find starting points for chemistry optimization that were different from known chemical matter binding in that binding pocket. A 51k knowledge-based subset of the available sample archive was generated via high-throughput docking to the most buried binding pocket shared by known ligands, followed by filtering using 3D pharmacophore features that were common to the bound ligands. The team wanted to complement this targeted hypothesis with a diversity set biased by the features described in the Supporting Information to allow for the discovery of unexpected binding modes and novel mechanisms. Hits from a single-concentration screen were triaged to focus on the most promising hits for submission to a dose−response follow-up assay using calculated properties, substructure flags, and visual inspection. Compounds that showed dose-dependent inhibition in the follow-up assay were further triaged using potency and ligand efficiency, shape of the dose−response curve (to focus on compounds within the mechanistic scope of the assay), analytical QC of the sample, and visual inspection of the chemical structure. The last of these included an assessment to remove compounds in which a key pharmacophore element was shared with known inhibitors and the rest of the molecule was not sufficiently novel. Further characterization of the confirmed hits, including crystallography and computational medicinal chemistry assess-

well as ongoing experience. Our general approach for incorporating matches to undesirable substructures is to recognize two levels of severity, where one functions as a warning (“flag”) and the other is expected to be incompatible with the general concept of quality hit (“out”). Table 1 gives an overview of the screening sets in the various case studies as well as measures of impact of the screening set design. The source set is the input for the selection workflow, and the screening sets are the output. Hit rates are reported at the level of confirmed hits as a ratio versus the number of compounds screened for each screening subset. The criteria to determine confirmed hits are unique to each project and include dose−response confirmation of the single concentration screening result, counterscreens (cytotoxicity, assay interference), and potency cutoffs (IC50, ligand efficiency). Visual inspection of compound structures and dose−response curves also typically removes compounds during the hitconfirmation work. The last two columns in the table report a chemical diversity measure by providing a comparison of the number of confirmed hits in each subset with the number of unique Bemis−Murcko (BM) scaffolds18 in that subset of confirmed hits. Bacteriology Case Study. The bacteriology case study describes the generation of a focused screening set with increased chances of finding validated hits that inhibit the growth of Gram-negative (GN) pathogens while not showing cytotoxicity in mammalian cells. The historic proportion of validated hits without cytotoxicity was only 31% (see the Supporting Information). The primary measure of success of the screening set design therefore became the increase of that proportion, and cytotoxicity was the key quality determinant in this study. The screening set was designed to drive hit finding across a panel of bacterial strains using inhibition of bacterial growth as a readout. The parameters for the screening set design were informed by an analysis of the relationship between lipophilicity, as represented by calculated logD (clogD), and risk of cytotoxicity for inhibitors of GN pathogens. This analysis was driven by two observations: First, compounds with high lipophilicity have a higher chance of being promiscuous inhibitors with an associated risk of being generally cytotoxic.19−21 Second, the physical chemical property space for antibacterial drugs targeting GN pathogens is shifted to higher polarity compared with that for traditional reference drugs.22 An analysis of our in-house data indicated that compounds with logD > 3 are unlikely to specifically inhibit the growth of GN pathogens: 87% of compounds inhibiting growth of GN pathogens with EC50 < 20 μM and clogD > 3 C

DOI: 10.1021/acs.jcim.9b00048 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Application Note

Journal of Chemical Information and Modeling

observed was for the prior hits from the phenotypic assay, which is the subset with the most pertinent knowledge about target 2; the other three subsets have similar hit rates. The BM scaffold analysis for the four subsets indicates chemical diversity between the sets and confirms that the complementation is effective. Further orthogonal validation using biophysical assays (SPR, protein-based NMR) was used to determine ultimate quality hits, which included two compounds from the preplated diversity set, five compounds from the BCD set, four compounds from the prior hits set, and three compounds from the privileged scaffolds set. At the level of quality hits, there is a larger enrichment of the BCD set compared with the preplated diversity set, which may reflect the better curation of the former. This observation informs our strategy for creating new preplated diversity sets using modified approaches to capture quality attributes for general hit-finding purposes. The hit-finding strategy for target 2 also met its objective of identification of chemical matter that validates the proposed mechanism using orthogonal biophysical assays. Malaria Literature Example. In order to make the BCD workflow more accessible, we include the application of the workflow to a public domain data set representing 306k compounds screened against Plasmodium falciparum strain 3D7 at St. Jude Children’s Research Hospital.23 This data set was also used in a challenge by the Teach−Discover−Treat (TDT) initiative24 and can be downloaded from the TDT Web site.25 As an example, the design was set up to select a 50k subset from the 306k library. In order to simulate a core set for the complement option, the four compounds mentioned in the original reference as control compounds for specific MOAs were used to query the 306k library for compounds that were similar to these controls using Tanimoto similarity for the whole molecule as well as for their BM scaffolds. Classification for classes B, C, and D was done with calculated physical chemical properties and substructure flags using RDKit KNIME nodes as detailed in the Supporting Information. Information for calculating confirmed hit rates for this example was taken from the Supporting Information for the original paper, which describes a selected set of 172 confirmed and cross-validated hits.23 Identifiers from the 172 compounds were matched against the compounds proposed in the screening set to assess how many of the proposed compounds belong to that set. As with all of the other case studies, the core set informed by knowledge of known ligands (using a similarity search) shows a higher hit rate than the BCD set, but both components contribute to confirmed hits. BCD Impact. A visual demonstration of the impact of the biasing of the compound selection across the case studies is shown in Figure 1. The pie charts show the distributions of compounds by quality class in the source set, the selected screening set (diversity sets plus complement sets), and the confirmed hit list. In the bacteriology case study, class A comprises the highest-class compounds; in the other three case studies, class A comprises the core sets. In all four cases, classA compounds get enriched (by design) in the screening set. This enrichment carries through in the confirmed hits. For the bacteriology case study, this means that the biasing was successful in producing quality confirmed hits; for the other case studies, this means that the knowledge in the core sets was relevant for the enrichment in the specific project. The three case studies with core set(s) show enrichment of class-B compounds in going from the source to the screening set,

ment, resulted in seven quality hits from the docking set and three quality hits from the BCD set, all belonging to distinct scaffolds. The team fast-tracked an initial chemotype from the docking set on the basis of (1) the binding mode hypothesis that was inherent in the knowledge that drove the docking exercise and (2) the structure−activity relationship (SAR) that was built-in from the presence of analogues in the docking set and from the intentional sampling of lower-potency but more ligand-efficient hits during the hit-list triaging. The latter compounds provided rapid insight into minimal pharmacophore features and drove an effective medicinal chemistry optimization strategy. Scaffolds from the hit-finding effort, including representatives from both the docking set and the BCD set, are ready to be picked up should a backup to the fasttracked chemotype be desired. Not all of the quality hits have been fully characterized with respect to mechanism; those that have been characterized all bind in the same site. As expected, the knowledge-based docking set outperforms the BCD set with respect to hit rate. In addition, all of the most potent and ligand-efficient hits of the desired mechanism were derived from the knowledge-based set. The presence of compounds from both subsets in the confirmed hit list gave the team confidence that knowledge was leveraged appropriately without giving up the opportunity for serendipitous discoveries. The BM scaffold analysis in Table 1 confirms that the BCD method is effective in sampling a complementary chemical space and that that chemical space yields valid hits for this target. The screening strategy as a whole met the objective of identification of novel chemical matter with a known mechanism. Target 2 Case Study. The objective of the target 2 hitfinding efforts was to identify chemical matter that disrupts a protein−protein interaction. There was no known chemical matter with this mechanism of action (MOA) for target 2, and the primary assay was a displacement assay using a peptide of the protein partner to target 2. Quality assessment of the hits was driven by orthogonal validation in biophysical assays (SPR, NMR). The screening set design for this project used three fixed sets as a core and employed the BCD workflow to do a complement diversity selection to arrive at the desired screening set size. The components of the core included about 500 compounds that were validated hits in a related phenotypic assay that could have readout compounds with the desired MOA. On the basis of the MOA, a set of 15k compounds with privileged scaffolds was selected, which included a subset of peptide mimetics. The third part of the core was an 18k preplated diversity set that has been assayed in many projects and therefore has good annotation with respect to biological activity. The concerns about this diversity set were related to a skewed distribution of physical chemical properties (MW was restricted to the 150−350 range; clogP was restricted to the 1−3 range) and to the presence of a number of compounds with substructural motifs suspected of nonspecific interactions. The complement design criteria were therefore focused on quality from the perspective of physical chemical properties and removal of substructures suspected of nonspecific interactions (see the Supporting Information) to increase the chances of validation in orthogonal assays. After a single concentration primary screen, hits were confirmed in a dose−response format of the biochemical assay coupled with a counterscreen to eliminate compounds that interfere with the assay. All four components of the screening set contribute to the confirmed hit list. The highest enrichment D

DOI: 10.1021/acs.jcim.9b00048 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Application Note

Journal of Chemical Information and Modeling

also report that the total primary hit count in the BCD simulation would have been half that of the total primary hit count of the unbiased diversity selection or the random selection because the BCD design picked fewer compounds from the higher-hit-rate classes. Our drug discovery efforts benefit from this trade-off between hit quality and total number of hits. The use of the complement feature in BCD is aimed at effective exploration of chemical space and ensuring that the same space is not sampled twice. The BM scaffold analysis in Table 1 provides one way of looking at chemical space coverage. In all cases, there is very little redundancy in the BM scaffolds (similar numbers of BM scaffolds compared with the total number of confirmed hits) and very little overlap between the subsets. This indicates that the complementation works as expected. A similar analysis was done with the level-3 scaffolds determined using the Schuffenhauer scaffold-tree algorithm.26 The results were comparable, and are included in the Supporting Information. As a final check, chemical space coverage and complementation were assessed using Optibrium’s StarDrop Chemical Space tool27 for the three case studies with an explicit core set. That analysis (see the Supporting Information) illustrates the different ways that chemical space is complemented using BCD depending on the nature of the core sets (diverse, broadly targeted, and narrowly targeted). Figure S2 shows that the workflow effectively samples chemical space with good complementation to any core sets that are present.

Figure 1. Class distributions for all four case studies across compound lists; B = bacteriology, T1 = target 1, T2 = target 2, M = malaria.

which is again by design since class B is the highest quality for the diversity component of those workflows. Class-B compounds in those case studies carry through into the confirmed hit list, thereby increasing the quality of that hit list. The Malaria case study allows a comparison with the classdistribution in the complete published hit list. The pie chart for that list is included in the figure at the bottom right and shows that class-B compounds are only a small proportion of the complete hit list. This data set confirms our experience that in many projects the hit rate in the most desirable compound class is lower than the hit rate in the least desirable class: the figure shows that the relative proportion of class B in the complete hit list has gone down from its proportion in the source set, whereas the relative proportion of class D has gone up. More specifically, the full library yields only seven hits belonging to class B, and the BCD design selected four of those. The size of the screening set in this case study was 16% of the size of the full library, which means that by screening 16% of the full library, 57% of the hits with a high quality profile would be retrieved. The BCD workflow therefore appears to be effective in enriching not just the knowledgedriven portion of the screening set (class A) but also the hits with a high quality profile (class B). It is important to emphasize that the BCD design is aimed at increasing the number of quality hits and that total hit rate is not a driver. The observation that the hit rate in the class of compounds with most desirable properties is lower than the overall hit rate was also found to be true in continued internal work focused on finding inhibitors of the growth of GN pathogens. The latter allowed us to do a comparison of a BCD design simulation with simulations of unbiased diversity design and random selection to address the question of whether BCD is more effective in selecting quality hits. For this comparison, we analyzed a high-throughput screening (HTS) data set measuring bacterial growth inhibition of an efflux-deficient Escherichia coli strain that had a 790k compound overlap with the source set from the bacteriology case study presented in Table 1; detailed description and analysis are included in the Supporting Information. The key point from Table S1 is that the BCD selection includes about 3 times more quality primary hits than unbiased diversity selection or random selection. We



CONCLUSION The BCD workflow supports hit-finding campaigns by selecting high-quality representatives to cover a given chemical space. Project teams formulate quality attributes for compounds they are willing to consider as starting points (inclusion vs exclusion) and attributes that are related to increased chances of success (e.g., lower logD to increase the chance of passing a cytotoxicity counterscreen). The distinctive aspect of BCD is the balancing of quality attributes with chemical diversity as well as the option to combine (multiple) knowledge-driven core sets with a quality-biased diversity complement. Those who want to use the workflow can incorporate their own attributes for quality, their preferred methods for describing diversity (fingerprints, clustering methods), and their own models and data for target or phenotype biasing in core set(s). The case studies show that the BCD workflow provides effective exploration of chemical space by complementing available knowledge with sampling of diversity. The improved quality of the resulting confirmed hits allows project teams to initiate optimization on selected chemical matter while in parallel executing SAR expansion efforts informed by computational models trained on the high-quality confirmed hits.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.9b00048. Workflow documentation; details of the analysis of the relationship between clogD and cytotoxicity for compounds that inhibit the growth of GN bacteria; details of the settings used in the case studies; E

DOI: 10.1021/acs.jcim.9b00048 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX

Application Note

Journal of Chemical Information and Modeling



Increased Hit Rates and Enhanced Chemical Diversity. J. Chem. Inf. Model. 2015, 55, 956−62. (13) Cortés-Ciriano, I.; Firth, N. C.; Bender, A.; Watson, O. Discovering Highly Potent Molecules from an Initial Set of Inactives Using Iterative Screening. J. Chem. Inf. Model. 2018, 58, 2000−2014. (14) RDKit: Open Source ChemInformatics. https://www.rdkit.org (accessed Feb 8, 2019). (15) Ashton, M.; Barnard, J.; Casset, F.; Charlton, M.; Downs, G.; Gorse, D.; Holliday, J.; Lahana, R.; Willett, P. Identification of Diverse Database Subsets using Property-Based and Fragment-Based Molecular Descriptions. Quant. Struct.-Act. Relat. 2002, 21, 598−604. (16) Capuzzi, S. J.; Muratov, E. N.; Tropsha, A. Phantom PAINS: Problems with the Utility of Alerts for Pan-Assay INterference CompoundS. J. Chem. Inf. Model. 2017, 57, 417−427. (17) Aldrich, C.; Bertozzi, C.; Georg, G. I.; Kiessling, L.; Lindsley, C.; Liotta, D.; Merz, K. M.; Schepartz, A.; Wang, S. The Ecstasy and Agony of Assay Interference Compounds. J. Chem. Inf. Model. 2017, 57, 387−390. (18) Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39, 2887−2893. (19) Leeson, P. D.; Springthorpe, B. The Influence of Drug-Like Concepts on Decision-Making in Medicinal Chemistry. Nat. Rev. Drug Discovery 2007, 6, 881−90. (20) Hughes, J. D.; Blagg, J.; Price, D. A.; Bailey, S.; Decrescenzo, G. A.; Devraj, R. V.; Ellsworth, E.; Fobian, Y. M.; Gibbs, M. E.; Gilles, R. W.; Greene, N.; Huang, E.; Krieger-Burke, T.; Loesel, J.; Wager, T.; Whiteley, L.; Zhang, Y. Physiochemical Drug Properties Associated with in Vivo Toxicological Outcomes. Bioorg. Med. Chem. Lett. 2008, 18, 4872−5. (21) Brown, D. G.; May-Dracka, T. L.; Gagnon, M. M.; Tommasi, R. Trends and Exceptions of Physical Properties on Antibacterial Activity for Gram-Positive and Gram-Negative Pathogens. J. Med. Chem. 2014, 57, 10144−10161. (22) O’Shea, R.; Moser, H. E. Physicochemical Properties of Antibacterial Compounds: Implications for Drug Discovery. J. Med. Chem. 2008, 51, 2871−8. (23) Guiguemde, W. A.; Shelat, A. A.; Bouck, D.; Duffy, S.; Crowther, G. J.; Davis, P. H.; Smithson, D. C.; Connelly, M.; Clark, J.; Zhu, F.; Jimenez-Diaz, M. B.; Martinez, M. S.; Wilson, E. B.; Tripathi, A. K.; Gut, J.; Sharlow, E. R.; Bathurst, I.; El Mazouni, F.; Fowble, J. W.; Forquer, I.; McGinley, P. L.; Castro, S.; Angulo-Barturen, I.; Ferrer, S.; Rosenthal, P. J.; Derisi, J. L.; Sullivan, D. J.; Lazo, J. S.; Roos, D. S.; Riscoe, M. K.; Phillips, M. A.; Rathod, P. K.; Van Voorhis, W. C.; Avery, V. M.; Guy, R. K. Chemical Genetics of Plasmodium falciparum. Nature 2010, 465, 311−5. (24) Riniker, S.; Landrum, G. A.; Montanari, F.; Villalba, S. D.; Maier, J.; Jansen, J. M.; Walters, W. P.; Shelat, A. A. Virtual-Screening Workflow Tutorials and Prospective Results from the TeachDiscover-Treat Competition 2014 against Malaria. F1000Research 2017, 6, 1136. (25) TDT 2014 Challenge 1. http://www.tdtproject.org/challenge1---malaria-hts.html (accessed Feb 8, 2019). (26) Schuffenhauer, A.; Ertl, P.; Roggo, S.; Wetzel, S.; Koch, M. A.; Waldmann, H. The Scaffold Tree − Visualization of the Scaffold Universe by Hierarchical Scaffold Classification. J. Chem. Inf. Model. 2007, 47, 47−58. (27) StarDrop v6.5. https://www.optibrium.com/ (accessed Feb 8, 2019).

comparison of BCD, unbiased diversity, and random selections; and a chemical space analysis (PDF) The KNIME workflow, developed in version 3.3.4 on Windows and Linux (ZIP)

AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Telephone: +1-510879-9548. ORCID

Johanna M. Jansen: 0000-0003-3937-6243 Present Address †

B.W.: Oric Pharmaceuticals, 240 E. Grand Avenue, South San Francisco, CA 94080, USA. Notes

The authors declare the following competing financial interest(s): Authors were employed by the Novartis Institutes for BioMedical Research while this work was conducted.



ACKNOWLEDGMENTS The authors thank Heather Hogg for creating and maintaining the bacteriology knowledgebase, Manuel Schwarze for supporting and maintaining our KNIME environment, Eric Martin for championing the concept of complement designs and helpful discussions in its application, and Folkert Reck for helpful discussions on property space for antibiotics.



REFERENCES

(1) Paricharak, S.; IJzerman, A. P.; Bender, A.; Nigsch, F. Analysis of Iterative Screening with Stepwise Compound Selection Based on Novartis In-house HTS Data. ACS Chem. Biol. 2016, 11, 1255−64. (2) Svensson, F.; Norinder, U.; Bender, A. Improving Screening Efficiency through Iterative Screening Using Docking and Conformal Prediction. J. Chem. Inf. Model. 2017, 57, 439−444. (3) Crisman, T. J.; Jenkins, J. L.; Parker, C. N.; Hill, W. A.; Bender, A.; Deng, Z.; Nettles, J. H.; Davies, J. W.; Glick, M. ″Plate Cherry Picking″: A Novel Semi-Sequential Screening Paradigm for Cheaper, Faster, Information-Rich Compound Selection. J. Biomol. Screening 2007, 12, 320−7. (4) Petrone, P. M.; Wassermann, A. M.; Lounkine, E.; Kutchukian, P.; Simms, B.; Jenkins, J.; Selzer, P.; Glick, M. Biodiversity of Small Molecules–A New Perspective in Screening Set Selection. Drug Discovery Today 2013, 18, 674−80. (5) Paricharak, S.; IJzerman, A. P.; Jenkins, J. L.; Bender, A.; Nigsch, F. Data-Driven Derivation of an ″Informer Compound Set″ for Improved Selection of Active Compounds in High-Throughput Screening. J. Chem. Inf. Model. 2016, 56, 1622−30. (6) Gilad, Y.; Nadassy, K.; Senderowitz, H. A Reliable Computational Workflow for the Selection of Optimal Screening Libraries. J. Cheminf. 2015, 7, 61. (7) Gillet, V. J. Designing Combinatorial Libraries Optimized on Multiple Objectives. Methods Mol. Biol. 2004, 275, 335−54. (8) Gillet, V. J. New Directions in Library Design and Analysis. Curr. Opin. Chem. Biol. 2008, 12, 372−378. (9) Martin, E. J.; Critchlow, R. E. Beyond Mere Diversity: Tailoring Combinatorial Libraries for Drug Discovery. J. Comb. Chem. 1999, 1, 32−45. (10) KNIMEOpen for Innovation. https://www.knime.com/ (accessed Feb 8, 2019). (11) Riniker, S.; Wang, Y.; Jenkins, J. L.; Landrum, G. A. Using Information from Historical High-Throughput Screens to Predict Active Compounds. J. Chem. Inf. Model. 2014, 54, 1880−91. (12) Maciejewski, M.; Wassermann, A. M.; Glick, M.; Lounkine, E. Experimental Design Strategy: Weak Reinforcement Leads to F

DOI: 10.1021/acs.jcim.9b00048 J. Chem. Inf. Model. XXXX, XXX, XXX−XXX