The Proximal Lilly Collection: Mapping, Exploring and Exploiting

Jun 10, 2016 - Does 'Big Data' exist in medicinal chemistry, and if so, how can it be harnessed? Igor V Tetko , Ola Engkvist , Hongming Chen. Future M...
2 downloads 8 Views 5MB Size
This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes.

Article pubs.acs.org/jcim

The Proximal Lilly Collection: Mapping, Exploring and Exploiting Feasible Chemical Space Christos A. Nicolaou,* Ian A. Watson, Hong Hu, and Jibo Wang Discovery Chemistry, Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, Indiana 46285, United States S Supporting Information *

ABSTRACT: Venturing into the immensity of the small molecule universe to identify novel chemical structure is a much discussed objective of many methods proposed by the chemoinformatics community. To this end, numerous approaches using techniques from the fields of computational de novo design, virtual screening and reaction informatics, among others, have been proposed. Although in principle this objective is commendable, in practice there are several obstacles to useful exploitation of the chemical space. Prime among them are the sheer number of theoretically feasible compounds and the practical concern regarding the synthesizability of the chemical structures conceived using in silico methods. We present the Proximal Lilly Collection initiative implemented at Eli Lilly and Co. with the aims to (i) define the chemical space of small, drug-like compounds that could be synthesized using in-house resources and (ii) facilitate access to compounds in this large space for the purposes of ongoing drug discovery efforts. The implementation of PLC relies on coupling access to available synthetic knowledge and resources with chemo/reaction informatics techniques and tools developed for this purpose. We describe in detail the computational framework supporting this initiative and elaborate on the characteristics of the PLC virtual collection of compounds. As an example of the opportunities provided to drug discovery researchers by easy access to a large, realistically feasible virtual collection such as the PLC, we describe a recent application of the technology that led to the discovery of selective kinase inhibitors.



INTRODUCTION A popular paradigm of drug discovery involves the exploration of large collections of compounds to identify promising hits to a target of interest. The initial set of compounds may be real, i.e., found in in-house molecular libraries or vendor collections, or virtual, found only in electronic form in virtual libraries. The initial exploration of these compound sets takes the form of experimental or virtual screening,1 with the latter being the only option in the case of virtual libraries. Promising structures are evaluated and, upon confirmation of their activity potential, may serve as a starting point for further research either via a secondary, more focused exploration of additional sets of compounds to retrieve near neighbors and pharmacophore equivalents, or, via synthesis of analog compounds. In such a process, the importance of the size, diversity and, naturally, relevance, of the initial set of compounds cannot be overlooked. Larger, more diverse sets are generally expected to present greater opportunity for discovering new compounds although drug-likeness and chemical structure property profile needs always be a concern. Given the practical limitations of expanding real compound libraries (cost, maintenance, logistics) and the sheer number of theoretically feasible compounds, the drug discovery community has invested in virtual library design and exploitation. In recent years, numerous efforts have been reported in the literature describing methods that enable the preparation of diverse or focused collections of chemical structure designs © 2016 American Chemical Society

through exploration of the chemical space defined by virtual molecules. A number of these methods attempt to map the chemical space and provide tools to investigate it as a whole or to explore certain regions. Reymond et al. attempted to exhaustively enumerate all chemical graphs of certain size satisfying rudimentary chemical rules.2 Their efforts resulted in GDB-13 and GDB-17 as well as the development of several efficient methods for searching and visualizing chemical space.3,4 Alternative methods place emphasis on the exploration of the chemical space via sophisticated optimization techniques to either generate diverse, representative subsets,5 identify compound collections focused on meeting certain criteria6,7 or retrieve specific compounds matching a user-defined profile.3 An example of methods typically using the latter approach is de novo design, which aims to construct chemical structures meeting one or more pharmaceutically relevant computational objectives from simple atoms, bonds and/or fragments.8,9 A common practical concern hampering virtual compound library design has been the synthesizability of the chemical structures proposed that has often proven questionable. To alleviate this problem, several research groups have proposed the use of chemical reaction protocols to guide virtual compound synthesis. Consequently, in recent years, collections of reactions and reactions types have been systematically mined Received: March 23, 2016 Published: June 10, 2016 1253

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

space defined using BI chemistry and building blocks to retrieve the ones which can be combined to produce structures similar to the original query in Ftrees descriptor space. The authors report that BI-Claim has been instrumental in identifying active compounds for two ongoing pharmaceutical projects using literature compounds as seeds. The same group also reported the use of BI-Claim for the identification of a new class of GPR119 agonists.18 Researchers from Pfizer reported on the Pfizer Global Virtual Library (PGVL) of synthetically feasible compounds in a series of publications.19 The PGVL makes use of over 1200 combinatorial reactions and can theoretically enumerate 1014−1018 virtual compounds. A number of search methods for PGVL were reported in the literature by the same group and a custom desktop software package, PGVL-Hub, has been developed to facilitate the search for structures of interest as well as details on their proposed synthetic route.20 Among the search methods reported, LEAP1 applies retrosynthetic analysis on the query molecule to identify reactions and reagents in PGVL that can produce it, followed by a search for similar reagents, enumeration of the compounds that can be made and ordinary similarity search to the original query. LEAP2 applies asymmetric similarity search21 of the query compound to a small representative subset of the PGVL consisting of so-called basis product compounds of the PGVL reactions.22 Upon identification of the similar PGVL compounds in the representative subsets, a process similar to that of LEAP1 is used to select reagents, enumerate virtual compounds and report the products most similar using similarity search. It is worth pointing out that while LEAP1 fails to retrieve any similar compounds when retrosynthetic analysis cannot decompose the query compound, LEAP2 almost always retrieves similar structures. The authors report that LEAP2 may fail to retrieve a query molecule included in the PGVL when significant size difference exists with the compounds in the representative set used. A variation of LEAP2 was also implemented by an AstraZeneca group to enable virtual highthroughput screening of their Virtual Library system defined using synthetic protocols extracted from their corporate electronic laboratory notebook.23 Also of interest is MoBSS (Monomer-Based Similarity Searching), which exploits the presence of numerous identical substructures in a virtual combinatorial library such as the PGVL to implement a fast product atom-pair descriptor calculation algorithm from the descriptors of the constituent fragments.24 Further speed-up of the search is provided by a prescreening step that calculates the asymmetric similarity of the cleaved monomers to the query structure and removes from further consideration all monomers substantially dissimilar to the query. Earlier, the same group had presented a search method implementation based on FtreesFragment Spaces.25 The Proximal Lilly Collection. The Proximal Lilly Collection (PLC) is a large, virtual collection of compounds that is readily synthesizable using Eli Lilly robotic synthesis tools, most notably the Automated Synthesis Laboratory (ASL)26 and readily available starting materials. In the current PLC implementation, emphasis has been placed on a small number of robust reaction types commonly performed on the ASL as well as reagent collections with ample inventory in our corporate storage or available reliably from trusted vendors. A custom software system, PLC-link, has been designed and implemented to facilitate the use of PLC by pharmaceutical project teams. PLC-link supports a number of usage scenarios

from the chemical literature and, attempts to classify them into reaction types and organize them have been reported.10−12 Research efforts to exploit such reaction sets to define large virtual libraries of theoretically feasible chemical structures13 or exploit them for de novo design purposes14 and (re)scaffoldhopping15 are ongoing. In this paper, we briefly review past efforts in this field with a focus on virtual libraries of feasible chemical structures. We then introduce the Proximal Lilly Collection (PLC), a large virtual compound library consisting of structures readily synthesizable using Eli Lilly and Co knowhow, technology and starting materials. Emphasis is placed on the design and implementation of the PLC-link, the computational backend which enables efficient and practical exploitation of PLC molecules. We also describe a number of usage scenarios by Lilly scientists and present an application example and results produced by the system to demonstrate its capabilities. A discussion on lessons learned, issues to be resolved, and future development directions concludes the paper. Virtual Collections of Feasible Structures. In an effort to benefit from methodological advances and organic synthesis knowledge accumulated, as well as to capitalize fully on the considerable investments made in in-house synthetic capabilities, several pharmaceutical companies introduced virtual compound libraries of theoretically feasible compounds. In this section, we review a selection of the approaches reported in the literature in recent years. Among the early attempts was AllChem, developed by Cramer et. al.13 The system described used approximately 100 reactions and 7000 reagents to generate 5 × 106 synthons, i.e., building blocks appropriate for use by the incorporated reaction set. Triplets of synthons could be combined to form chemical structures using the reaction rules giving rise to a chemical space in the magnitude of (5 × 106)3. Exploitation of the AllChem space was made through filtering and searching techniques that, for example, enabled the identification of structures similar to a query molecule using topomer search methods developed by the same group.13 BI-Claim, developed by Boehringer Ingelheim, uses in-house combinatorial library generation protocols at its core to provide virtual synthesis capabilities.16 The system uses internally available reagents and in its initial implementation could theoretically enumerate 5 × 1011 chemical structures using about 1600 scaffolds and 30 000 reagents. Searching is provided via Ftrees-Fragment Spaces (Ftrees-FS), a method designed specifically for similarity searching in combinatorial chemistry spaces, which represents molecular structures using the feature trees (Ftrees) descriptor.17 Ftrees, a special case of the reduced graph molecular descriptor, describe molecules by a tree data structure whose nodes correspond to fragments. Similarity calculation using Ftrees involves calculating the descriptors of two molecules, matching and removing directed edges of the two graphs and computing the pairwise similarity of the remaining subtrees. In Ftrees-FS, where the problem is one of retrieving molecules represented by their Ftrees fragments similar to a query, the process requires matching directed edges of the query to special, precomputed fragment links that describe fragment connectivity features. The results of the edge-link matching process are stored in a matrix subsequently used to identify the best fragments to join in order to retrieve feature trees similar to the query. In BI-Claim, when a query molecule is supplied, its feature tree is first computed; then, an edge-link-based similarity search is run against the fragment 1254

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

Figure 1. Breakdown of the reactions performed on the Eli Lilly Automated Synthesis Laboratory through 2013 by reaction superclasses. During this period, more than 37 000 reactions were performed on the ASL.

The first set of LARR reactions has been selected based on the frequency of execution and success rate on the ASL. The LARR set used for the purposes of this manuscript consists of a subset of 10 annotated reaction types and is fully described in the Supporting Information (see S2). The annotation work for each reaction has been the outcome of joint efforts by synthetic chemists with experience on the ASL and computational chemists. Work on augmenting LARR is ongoing, and additional reactions and reagent rules are frequently added. Special care is exercised to ensure that all reactions added have been validated on the ASL and are sufficiently annotated to increase synthesizability odds. PLC Reagent Preparation. PLC-link consists of all the software components necessary to map PLC-DS and enumerate large subsets or specific structures based on user supplied criteria. Essentially, this means accessing structure and inventory information from Lilly corporate databases, identifying the reagents appropriate for each LARR reaction and preparing all the data structures necessary to allow efficient access to PLC-DS. In order to achieve the above PLC-link is used to prepare and maintain readily available reagent collections for each LARR reaction. The system is linked to databases of reagents available to Lilly chemists including our main corporate chemical structure database, the reagent databases at synthetic laboratories distributed around the global organization and compound collections from select vendors. A preprocessing step regularly updates the reagent collections from each source based on new additions and current inventory availability, and calculates a list of functional group-based descriptors for each structure. This step involves processing 10s−100s of thousands of reagents for each reaction from various data sources. A subsequent step prepares current reaction specific reagent sets, i.e., readily available reagent sets for each LARR reaction meeting the reagent filter rules for each reagent type. Optional pruning of the available reagent sets may also be applied based on e.g. physicochemical properties such as molecular weight or heavy atom count. In a final step, each current reagent set is postprocessed using the reagent transformation rules of each annotated reaction to first generate transformed reagent sets with the labeled reaction center and corresponding files of structural fingerprints and 3D conformers among others. It is

including near neighbor search, focused library design and virtual screening described in detail below. Moreover, the Idea2Data (I2D) initiative, a comprehensive hypothesis design and evaluation process largely based on the PLC, has been put in place to coordinate stakeholders from design, synthesis, purification and testing and expedite the drug discovery process at Eli Lilly. PLC-link consists of the Lilly Annotated Reaction Repository (LARR) and a flexible virtual synthesis engine (VSE) that combined help define the PLC Data Space (PLCDS). PLC-link also contains a collection of search and retrieve utilities that enable exploitation of PLC-DS. Details for each of the above components are provided in the following sections. Lilly Annotated Reaction Repository. At the core of PLC is the Lilly Annotated Reaction Repository (LARR), which contains reactions validated, i.e., commonly performed with successful results, on Lilly automated synthesis systems with a focus on the ASL. The ASL system and the set of reactions commonly performed on it have been described in ref 26. Figure 1 presents information on the breakdown of the reactions performed on the ASL as of 2013 using reaction superclasses as defined by the NameRxn tool (version v2b80) from NextMove software.27 To ensure synthesizability of the proposed compounds, LARR relies on the usage of an annotated reaction scheme that captures a wealth of information on each reaction, including the detailed profile of the reagents that may (or may not) be used and reagents that may result in mixtures due to multiple reactive centers. Each LARR entry consists of the core reaction, a detailed reaction description file, a set of reagent filter rules for each reagent type and reagent transformation descriptions. The core reaction is encoded in standard RXN28 or via an internal SMARTS-based schema. Each reagent filter rule consists of a set of simpler rules describing in detail functional groups and structural features that either must, or must not, be present for the specific reaction to be successful. The reagent transformation descriptions are simpler one-product reactions, one for each reagent type involved in the reaction. These reactions are used to process each reagent by, for example, removing the leaving groups and produce the structural component that contributes to the final reaction product with the reaction center appropriately labeled. Figure 2 presents a detailed view of LARR and the annotated reaction scheme used. 1255

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

Figure 2. Lilly Annotated Reaction Repository consists of a collection of reactions validated on the ASL. In addition to the core reaction, Lilly annotated reactions contain (a) detailed description, (b) filter rules for each reagent type involved in the reaction and (c) reagent type transformation reactions useful in producing the reagent structures following the execution of the reaction.

compounds possible with any of the LARR reactions and their corresponding reagent sets from one or more reagent sources. Alternatively, the VSE can simply generate a sample of structures by selecting specific reagents subject to user imposed restrictions on the reagents or the final products, related to e.g. compound size or chemical structure. Note that this ability provides the foundation for the search utilities described in subsequent sections of this paper. Finally, the VSE can generate diverse samples from the entire space in a random or quasirandom manner to be used for assessing the PLC-DS properties and for virtual screening purposes. In the current production setting, users have access to several PLC-DS subsets through selection of the specific reagent source they prefer to use or a specific reaction subset. An optional final step to the VSE

worth noting that these reagent collections and their transformations are essential for subsequent PLC-link operations such as searching the PLC space or enumerating select subsets satisfying user specified criteria. Figure 3 graphically represents the PLC reagent collection preparation process. Virtual Synthesis Engine. A key component of PLC-link is the virtual synthesis engine (VSE) which has been designed to (i) enable the enumeration of PLC chemical structures and, (ii) facilitate the querying of the PLC-DS to identify and retrieve designs satisfying user-defined criteria. PLC-DS, the entire PLC virtual compound collection, contains the complete set of feasible compounds that could be enumerated using all LARR reactions and available reagents. In practice, the VSE can be used to enumerate subsets of the full matrix of feasible virtual 1256

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

Figure 3. PLC reagent preparation step includes (a) accessing all Lilly reagent sources and retrieving structure and available inventory information, (b) profiling each reagent collection with functional group descriptors, (c) generating current reagent sets for each reagent type of each LARR reaction by applying their corresponding reagent filter rules, (d) preparing each current reagent set for use by the PLC engine; includes preparation of postreaction transformation sets representing the reagents as they will be present following the reaction in 2D, 3D and fingerprint format.

Figure 4. Focused library design using the PLC. Users can supply one or more scaffolds to the system that identifies functional groups present, matches them to appropriate reactions and enumerates compound sets using the coupling reagent sets from the current PLC reagent collections.

high-performance computing system and makes heavy use of parallel processing techniques, in addition to careful software design and implementation, to keep performance within reasonable time limits. Indicatively, PLC subsets in the 107 order are often generated within minutes to address specific

structure generation process is the characterization of structures using the Lilly medicinal chemistry rules29 for subsequent elimination of those likely to prove problematic in practice. Special consideration has been placed on the capacity and performance of the VSE. The system is supported by Lilly’s 1257

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

Figure 5. ExactSearch algorithm. A number of the steps in the process can be run using parallel processing techniques to reduce required run time.

user requests. Virtual compound sets in the 109 order have also been prepared and processed as dictated by project needs. Usage scenarios for PLC generally fall into three categories: (i) focused library design (scaffold-based) for SAR exploration, (ii) similarity search to structures of interest, and (iii) enumeration of diverse sets for virtual screening purposes. Users interact with the system through a collection of search and retrieve utilities which enable them to effectively query the PLC space. In the following sections, we present these VSE utilities and describe our approach to explore efficiently the PLC space operating on LARR reaction and reagent similarity. Focused Library Design (Scaffold-Based) for SAR Exploration. A common usecase for PLC-link is the enumeration of all PLC compounds containing a specific scaffold. This scenario frequently occurs when project teams have identified a scaffold of interest and wish to explore quickly its SAR landscape. To this end, a VSE functionality has been implemented that allows users to import such scaffolds and derive product PLC structures by enumeration on functional groups matching any of the LARR reactions applicable. In a typical use case, the user supplies a scaffold to be used as the “core” of a focused library design. The VSE identifies potential reaction centers on the scaffold, appropriate for reactions in LARR. Then, for each of the suitable reactions, the scaffold is matched with the appropriate reagent type. The additional reagent sets needed for each reaction are retrieved by the VSE from the current collection of previously prepared reagent sets. The VSE then proceeds to enumerate structures in line with user supplied restrictions (e.g., size). For example, when a user supplies a scaffold containing a carboxylic acid the reaction selection step will identify the amide synthesis reaction. The system designates the scaffold as the set of carboxylic acids necessary for the amide reaction and retrieves the amine set from the available PLC reagents collections. The two reagent sets are then passed on to the virtual synthesis step for enumeration. The resulting amides are filtered according to user input and returned with details on how they can be synthesized including reaction name and reagent identification

information. Figure 4 illustrates the process of focused library design. Searching the PLC. A common question asked by medicinal chemists is whether the structure hypothesis they designed exists within the PLC. To answer this question, we have developed the ExactSearch method (see Figure 5) which, given a query chemical structure will progressively search PLC-DS to identify if that structure is contained and, in the event of success, report the synthetic route(s) and reagents that can be used to produce it. The method can only meet expectations for throughput by not explicitly generating large numbers of possible product molecules therefore strategies for limiting the number of molecules that are explicitly enumerated are important. To this end, the method employs structure-based filters. For example, the atom counts of the molecules to be made are set in an array, and unless the atom count from a combination of reagents (the sum) is a nonzero array member, that combination is eliminated. Specifically, if the program is searching for a molecule with 20 heavy atoms, we set that corresponding index in an array. Then, only reagent sets where the sum of reagent heavy atoms (minus atoms lost in the reactions) are set in that array are further considered. This significantly reduces run-times because relatively expensive molecule specific processing is avoided. Reagents are kept sorted by atom count to assist. Similarly, by examining the element counts in the molecules coming in as search targets, we can eliminate reagents and combinations that already have too many atoms of a given type. Rings, aromatic and aliphatic, are treated similarly. In the same spirit, we compute all bonded pairs (atom pair with bond type) quantities for each product and each reagent, again filtering reagent combinations that cannot produce the target molecule. Note that these methods depend on reagents already having leaving groups removed and knowledge of reaction transformations including the formation of extra rings. Given a set of queries, and reagents stripped of leaving groups, we can perform substructure searches, where each reagent can determine whether or not it could form a given query molecule. Because reagents are independent, this part of 1258

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

length of 7 of the query molecule(s). Tversky similarity is then used to identify similar structures from any of the transformed (cleaved) available reagent collections whose AFP’s have been precomputed as described above. Essentially, this process identifies all transformed reagents participating in any of the LARR reactions that could be used to generate the query molecule. This part of the AsmSearch method is conceptually similar to the prescreening step of the MoBSS approach proposed in ref 24 with notable differences in the fingerprint type and the reagent representation used. Alternatively, at a higher computational cost, substructure search methods could be used to provide the list of reagents wholly included in the query as used in the exact match search described previously. The resulting reagent lists are then used to enumerate all feasible compounds using the LARR reactions, which, in a final step, are subjected to ordinary similarity search to the query chemical structure to select the near neighbors as per the user supplied parameters (e.g., near neighbors satisfying a minimum similarity criterion or top fixed number of near neighbors). The method of choice for this final AsmSearch step uses the AFP fingerprints and Tanimoto distance (1-Tanimoto similarity). To expedite the process, parallelization has been extensively used for individual steps such as compound enumeration and similarity searching. In a typical search, the workload is distributed to 10s−1000s of processors. Consequently, the process of single query near neighbor search completes within 30 s on average. The resulting near neighbor list includes the chemical structure, PLC id, reaction identifier and used reagent ids. Figure 6 illustrates the AsmSearch process. To assess the ability of AsmSearch, an extensive retrieval rate profiling experiment was performed. 1000 compounds were randomly selected from one of our diverse PLC sets that contains 18 million compounds. New molecules were generated by multiple steps of randomly inserting, deleting or swapping atoms and by swapping fragments and bond types for each of the 1000 molecules. Duplicates from the resulting molecules were removed to ensure uniqueness and their similarity to their original 1000 PLC compounds was

the computation can be done in parallel, thus taking advantage of modern hardware architectures. For each reagent, we keep a bit vector of those products where it might be possible for it to be a precursor. Once sets of plausible reagents for a given product are found, we resort to explicit enumeration. The product molecule is formed, and its unique smiles compared with the unique smiles of the molecules to be found−via a hash structure. This step could also be done in parallel. As implemented, the method is able to perform searches and retrieve identical structures from the PLC (if existing) efficiently on modern multicore computers. ExactSearch has been used extensively during our efforts to understand the chemical space covered by PLC. Table 1 in the Results section later in this paper summarizes experiments comparing known drugs and Lilly and Pubchem drug-like compounds to PLC-DS using ExactSearch. Table 1. Identical Structure Retrieval of Known Drugs and Subsets of Lilly and PubChem Drug-like Compounds to PLC compound collection

sample size

ExactSearch results

coverage (%)

DrugBank Lilly Collection (subset) PubChem

6059 1 000 000 43 629 726

191 231 622 8 716 672

3.2 23.2 20.0

A related use of PLC-link is near neighbor search to identify chemical structures similar to a query molecule. This process, referred to internally as Hit Expansion (HE), is commonly performed at early stages of the drug discovery process when project teams have identified a promising structure and wish to verify chemotype activity or expand the SAR knowledge around it. PLC-link provides several ways for near neighbor search, two of which are described in more detail below. The first approach, referred to as AsmSearch, relies on the asymmetric similarity between query structures and available PLC reagents. The process involves the calculation of linear atom path based fingerprints (AFP) with a default maximum

Figure 6. AsmSearch process for near neighbor search in PLC. Note that the process steps performed are inherently parallel, a feature that PLC-link exploits to reduce drastically the amount of time needed to complete a search. 1259

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

Figure 7. Near neighbor retrieval rate of the AsmSearch process based on recovery results of PLC compounds known to be similar to the search process seeds. Each histogram bar corresponds to the successful recovery rate for compounds within the corresponding Tanimoto distance bin (xaxis) ranging between 0.0 and 0.15. The blue portion of the histogram corresponds to the recovery rate; the red corresponds to the percentage of similar compounds not recovered in the specific distance bin. The numbers in each bar segment correspond to the number of seeds present in that segment.

computed. All compounds in the resulting data set within 0.15 Tanimoto distance to the starting PLC compounds were then placed in a “seed” data set. The AsmSearch method was then used to search PLC-DS for compounds similar to the structures in this seed data set. Following the conclusion of the search, the near neighbor list of each seed was analyzed. The success measure used is based on the presence of the original PLC compound used to generate the seed in the seed’s near neighbor list. Figure 7 presents the result of this experiment. As shown, all compounds with distance bin of 0 (original PLC compounds plus other PLC structures identical in fingerprint space) were successfully identified through the search. The success rate stayed above 97% from distance bin 0 to 0.07. As expected, as the distance increases (seeds are less similar to the original PLC compounds), the success rate decreases. It is worth noting that even at the distance bin of 0.13 the success rate is close to 90%. The result indicates that the AsmSearch method can be applied to search for similar compounds in the PLC space and is capable of retrieving near neighbors provided that such neighbors exist. Users can also search PLC using as query a structure with a PLC id. This is often the case when an initial round of investigation, for example via virtual screening on a PLC subset, has already produced some PLC structures of interest. This method, termed ExpandSearch, relies on the knowledge of the exact reagents of the compound to identify near neighbors. Essentially, each query compound provided is decomposed into its constituting reagents based on prior knowledge and a search for similar reagents in the current reagent collections is performed. Once the similar reagent lists for each reagent type are compiled, the VSE is used to perform a full enumeration. Ordinary similarity search to the query chemical structure is then performed to select the near neighbors as per user supplied parameters. The method typically retrieves PLC structures that share one of the constituting reagent (or a very similar) of the query structure. It is therefore used when users would like to retrieve highly similar structures or when the goal

is to explore the impact of each of the two constituting reagents to the e.g. measured activity. ExpandSearch is conceptually similar to the LEAP1 technique described in ref 22, although our method implements by default full enumeration of the matching reagents, enabled by the heavy use of parallel processing techniques, where LEAP1 uses a default of 20 reagents of each type. Diverse Library Design. As previously discussed, PLC-DS contains all chemical structures synthesizable using one of the LARR reactions and their corresponding reagent sets. As an indication, the theoretical size of the PLC using the set of the 10 most reliable reactions on the ASL is in the order of 3.5 × 1011. This number increases dramatically when additional coupling, multicomponent and multistep reactions are included. Simply enumerating sets of this size and applying traditional virtual screening methods to identify structures of interest is prohibitively expensive in computer time. Instead, a methodology for preparing diverse PLC subsets has been implemented using only reagent space information. Note that these “standard” PLC subsets are typically sampled from the PLC-DS formed by the most reliable LARR reactions to ensure higher confidence in synthetic success. In special cases multistep, multicomponent or nonvalidated coupling reactions can also be used to address specific research needs with the caveat of increased synthesizability risk. To obtain diverse sets from the PLC, a number of sampling algorithms operating in reagent space have been implemented including methods similar to the technique described in.5 Below, we describe the standard technique we use for diverse PLC subset generation. This technique, referred to as Reagent Space Cherry-pick Random Selection (RSCRS), has been designed to meet the divergent objectives of diversity, capacity and speed when dealing with spaces of the order of PLC-DS. At the core of RSCRS is a simple virtual synthesis tool named Make-These-Molecules (MTM) that enables sampling from large spaces without full enumeration. Traditional virtual synthesis tools require a reaction rule and the reagent sets, 1260

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

Figure 8. Generating diverse PLC subsets using the RSCRS method.

Figure 9. Property distribution of two PLC subsets generated using RSCRS and internal reagents (PLC-Subset-1), external reagents (PLC-Subset2), a Lilly collection subset (LLY) and a Pubchem subset. All subsets have size 1million and have been filtered using the Lilly medicinal chemistry rules. Additional property distributions are found in the Supporting Information (see S6).

computed by simply dividing N to the number of reactions in LARR or by weighing the frequency of each specific reaction type in PLC. In a second step, and for each reaction in LARR, a quasi-random sampling technique is applied to prepare the P2F ID sets. The technique first sorts all candidate reagents sets using heavy atom count and then selects the appropriate number of reagent IDs randomly from each set. The P2F ID sets are sent to MTM, which is executed independently for each reaction. The products of each MTM run are then combined and postprocessed to, e.g., remove duplicate chemical structures and structures failing the Lilly medicinal chemistry rules. Note that the process is heavily parallelized to decrease the time required to obtain PLC subsets. Figure 8 summarizes the RSCRS process.

and, generate the full matrix of possible products. Thus, generating any subset of virtual structures using a traditional tool necessitates the enumeration of the space followed by the application of a sampling technique. Such an approach is problematic when dealing with PLC-DS size spaces. In contrast, MTM requires as input a reaction rule, the reagent sets and a list of products-to-form IDs (P2F). Each P2F ID is a list of reagent IDs that can be used to synthesize virtually a single product using the reaction rule. The output of MTM consists of the chemical structures corresponding to each P2F ID provided that the reagent IDs involved exist in the reagent sets supplied to MTM. When initiated to generate a diverse PLC subset of size N using a specific reagent source, RSCRS loads all LARR reactions and prepares the appropriate reagent sets. The number of products to be formed using each reaction NR is also 1261

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

Figure 10. Distance distribution of PLC near neighbors to DrugBank structures using AsmSearch. The search identified the closest near neighbor in PLC within 0.15 Tanimoto distance using the AFP fingerprints. A total of 1134 DrugBank structures (from a total of 6059) had a “hit”, i.e., a neighbor in PLC within the threshold used. Each histogram bar corresponds to the count of hits within the corresponding Tanimoto distance bin (xaxis).

Figure 11. Near neighbor distribution of Lilly and Pubchem compounds to structures in PLC.



RESULTS

Table 1 summarizes sample exact structure search results of 3 data sets to the PLC collection. The first set consists of 6059 approved, experimental and investigational drugs from DrugBank 4.3.30 The second set has 106 random structures from the Lilly collection whereas the third contains the entire set of Pubchem compounds (retrieved on August 31, 2015). All compound sets consist of structures satisfying the Lilly medicinal chemistry rules29 and an upper size limit of 40 heavy atoms. It is worth pointing out that the PLC contains a considerable fraction of compounds, ca. 20%, from both the Lilly collection and Pubchem. This observation may offer insights to the diversity of the chemical space synthesized to date and the heavy preference shown by synthetic chemists to a small set of robust reactions.10,12 Also, note that for several query compounds multiple PLC exact matches were found corresponding to alternative synthetic routes. The table only contains information on unique query compound matches. The search process for the Lilly and Pubchem data sets has been distributed on 1000 cores of the Lilly high performance computing system to reduce the overall time required. The entire search for the Drugbank compound set took just over 4 min on a single 24-core processor. In a related experiment, we have applied AsmSearch to the same DrugBank set of 6059 compounds. The search identified

Since its implementation, the PLC has been used by numerous internal projects in a number of ways. Projects in search of new hit structures have at times resorted to using the PLC standard subsets to identify novel matter via virtual screening. In this setting, the PLC has acted as a vast third source of available compounds with drug-like characteristics, the first being our internal library and the second vendor available collections. Figure 9 presents the distribution of the number of heavy atoms and polar surface area for two PLC subsets generated using RSCRS, and subsets from the Lilly collection and Pubchem. The Lilly and Pubchem subsets were randomly selected from the original source. PLC-Subset-1 was enumerated using only internally available reagents whereas PLC-Subset-2 was prepared using reagents from commercial vendors. As shown, PLC compounds are on average somewhat heavier than those in the Lilly collection and Pubchem. At the same time, all four sets have virtually identical polar surface area distribution plots which, indicates that the PLC sets retain key pharmaceutically relevant properties similar to those of real compound collections used in the drug discovery process. We have also utilized the ExactSearch utility described previously to compare compound sets of interest to the PLC. 1262

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

diphenpyramide for the RIO2 kinase (Kd = 1.4 μM) was discovered by serendipity while profiling of several marketed drugs on a large 456 kinase assay panel. These results also suggested a relative selectivity of diphenpyramide for RIO2 against the other kinases in the panel. Analogues of diphenpyramide were initially identified by a similarity search of to the Lilly internal library and vendor collections. However, only three of the compounds retrieved met expert chemist criteria (see Figure 12, compounds 1−3) and, thus, a PLC similarity search was pursued to find more analogues.

a near neighbor within 0.15 Tanimoto distance using the AFP fingerprints for 1134 of the DrugBank structures (18.7%). Figure 10 shows the distribution of the closest PLC near neighbor to each of the DrugBank structures up to the threshold used. Note that 254 DrugBank structures had a neighbor at distance zero. This number is greater by 63 compounds compared to the results of ExactSearch shown in Table 1 as the AsmSearch identifies near neighbors at distance zero, which may not be identical to the query compound. This is due to the inability of the AFP’s, and fingerprint methods in general, to capture small differences in molecular structure.31,32 Supporting Information 5 contains a small sample of DrugBank queries and their nearest PLC neighbor with reagent structure and reaction type information. To investigate the presence of structurally different compounds in the PLC, we have used 3 sets of 1 million compounds from the PLC, the Lilly collection and Pubchem. The PLC subset was generated using the RSCRS technique. The Lilly and Pubchem subsets were randomly selected from the original source. All subsets satisfy the Lilly medicinal chemistry rules.29 Chemical structure fingerprints were calculated using the Morgan algorithm (size 3) as implemented in RDKit.33 The distances were obtained via the Tanimoto coefficient. Figure 11 plots the distribution of the nearest Lilly and Pubchem compounds to structures in the PLC subset. Note that the Lilly collection contains molecules previously made on the ASL from PLC reactions but, even so, there remains considerable diversity available. The plots clearly show that a large fraction of the PLC is substantially different to what has been previously synthesized. The ability of virtual screening approaches to process very large chemical structure collections in little time and minimal cost makes this technology a natural fit to PLC. The use of VS enables the identification of virtual “hits” through rank-ordering of PLC and the focusing of subsequent synthesis and evaluation efforts on a small number of most promising structures. More advanced projects have often employed the focused library design PLC-link functionality to explore the structure− activity relationship (SAR) landscape around a scaffold of interest. This approach, perhaps the most frequently used PLClink tool, allows the quick enumeration of numerous structures containing the scaffold of interest that can then be further processed computationally, via, e.g., virtual screening, rankordered by expert chemists and forwarded for synthesis by the ASL. Internal projects have also been using the PLC-link similarity search utilities. ExactSearch is used when a specific structure design hypothesis needs to be synthesized and tested since a successful match can readily be forwarded to the ASL for synthesis. More frequently, AsmSearch is used by projects with some initial hit(s) in need of near neighbor exploration or when ExactSearch does not retrieve a match for an expert hypothesis which the team would like to explore. The vastness and diversity of PLC coupled with the ease of synthesis provided by the ASL, make it an attractive source of ideas for hit discovery. Below, we describe a recent successful use of PLC similarity search. Discovery of Selective hRIO2 Kinase Inhibitors. In a recent use, the AsmSearch tool was instrumental to the optimization of diphenpyramide, an old anti-inflammatory drug, as more potent and selective hRIO2 kinase ligand.34 Diphenpyramide is a cyclooxygenase 1 and 2 inhibitor with no previously known kinase activity. The weak affinity of

Figure 12. Diphenpyramide and the analogs identified and screened.

The search for diphenpyramide analogs was performed on the entire PLC, including all LARR reactions and reagent sources, using the AsmSearch utility. Retrieving near neighbor structures from the PLC, estimated at the time to consist of 3.5 × 1011 chemical structures, took just over a minute on the Lilly high-performance computing system. The top 80 most similar compounds on AFP fingerprints were provided to the project team for visual inspection. A total of 8 compounds representing 2 interesting chemotypes were selected and forwarded for synthesis on the ASL. The structures of the 8 compounds are shown in Figure 12. Compounds 4−7 and 8−11 correspond to the first and second chemotype investigated, respectively. The entire process from search to selection, synthesis, purification and testing took 5−7 days not including the time required for reagent ordering and delivery from an external vendor. The binding affinities of the analogues were determined using KdELECT at DiscoverX using 10 point dose response. Three analogues (compounds 7, 8 and 9) from the PLC recommended subset showed increased binding affinity for RIO2 kinase compared to diphenpyramide (respectively Kd = 470, Kd = 520 and Kd = 160 nM). With a binding affinity of 160 1263

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

Figure 13. Amide synthesis reaction performed on the ASL leading to Pubchem compound 25348574. This compound, derived from commercially available reagents, is to the best of our knowledge the most selective ligand for the RIO2 kinase identified to date.

this problem remains an open research question that our team is actively pursuing. Additional PLC uses soon emerged to support commonplace drug discovery needs such as the identification of near neighbors to a query compound. Enumerating the entire PLC to perform similarity search in the traditional ways is not practically feasible due to the virtual library size. Moreover, the difficulty of arriving at similarity values for virtual library molecules from calculations on their fragments is already established, as is the need for partial enumeration.17 To address this we implemented several efficient search methods that rely on (i) preprocessing all reagents to produce the fragments present in the product molecules following enumeration, (ii) identify the fragments that could potentially be part of a query molecule or molecules highly similar to it, and (iii) enumerate candidate molecules from the selected fragments and apply similarity searching on a high performance computing system. The PLC search methods have been integrated into Lilly’s inhouse molecular design platform35 and through it are regularly used to identify structure designs that show promise. Similarly, methods for scaffold exploration, i.e., the complete enumeration of PLC compounds containing a given fragment for SAR or selectivity investigations have also been prepared and made available to Lilly scientists. Currently, the PLC is a realistic third compound source for discovery at Lilly. The primary collection consists only of virtual structures feasible using reactions validated on the ASL and reagents readily available to improve synthesizability odds. The current size of the collection is in the order of 1011, and, as shown previously includes both, known compounds with therapeutic potential and novel, promising structures. Interestingly, and to a certain degree surprisingly, the PLC contains a considerable portion of the Pubchem and Lilly collection compounds even though it only uses a limited set of reactions. Despite its size, PLC represents a minute fraction of the druglike chemical space. Efforts to expand it further are ongoing through increasing the number and diversity of reactions currently encoded and the collections of reagents available. To this end, the performance of new reaction attempts on the ASL and other Lilly robotic systems is closely monitored because our intent is to maintain a high synthesizability success rate. In parallel, efforts to extract knowledge from our corporate reaction database are underway with the goal of identifying candidate reactions for further validation and potential inclusion to LARR. Coupled with the Idea2Data initiative our goal has been to turn the PLC from a virtual collection of structures to a realistic internal compound source that can readily be used in production. The Idea2Data is an effort to shorten significantly the hypothesis design to synthesis, purification and testing cycle. This initiative requires the reorganization of multiple steps of the lead discovery process at Eli Lilly, closer crossdepartment cooperation and new technology development. The availability of PLC enabled the successful completion of a

nM, analogue 9 showed almost a 10-fold increase in affinity. Analogue 9 was then profiled on the 456 kinase assay panel. This analogue, obtained via a simple amide synthesis reaction (see Figure 13) on our ASL system, is to our knowledge the most selective ligand for the RIO2 kinase. The full activity profiles of the compounds are provided in the original paper. Additional details on the results and follow-up efforts can be found in ref 34.



CONCLUSIONS Drug discovery can be thought of as an optimization problem involving a search in chemical space for compounds satisfying numerous, often conflicting objectives.8 The problem is notoriously complex and challenging as evidenced by the limited number of drugs approved every year despite the considerable amount of resources invested by the drug discovery community. Despite the known difficulty of identifying compounds meeting all (or the majority) of the required objectives, the community firmly believes that structures with the desired profile and therapeutic potential can be designed and synthesized given enough information on the problem at hand; viewed from an optimization perspective, researchers postulate that such structures “exist” in the drug-like chemical space but have yet to be discovered. Consequently, the need for tools to access and explore further into the chemical space as well as models to assess the potential of virtual structures becomes imperative. The PLC aims to contribute to addressing the above need by mapping the chemical space accessible to Eli Lilly scientists, providing easy access to it for all discovery purposes and, thereby, enabling the routine exploitation of a larger chemical space. To this end, the PLC-link computational engine has been designed to bridge the chemical synthesis knowhow and advanced synthesis tools at Eli Lilly with the needs of ongoing discovery chemistry projects. Tight coupling of the PLC engine with the Lilly chemical sample management system and databases of select vendors ensure that the reagent sets used in the process can readily be accessible for chemical synthesis. Initial approaches focused on exploiting PLC for virtual screening purposes and the identification of promising hits with novel chemical structures. In this scenario, the process followed is similar to traditional virtual screening workflows supplied with PLC sample sets and supported by an HPC system to enable coping with sets in the order of 107−109. This approach proved sufficient and successful in identifying several virtual hits of interest. However, an early challenge with potentially wider implications is how to obtain computational models that reliably identify chemical structures of interest structurally novel from those in the training sets used to prepare the models. In essence, the benefit of having access to a large chemical space that contains compounds structurally diverse from known hits is compromised by the conservative, precedent-based models we have been using which discriminate against novel, unfamiliar chemical structures. The solution to 1264

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling

K. Complementarity Between a Docking and a High-Throughput Screen in Discovering New Cruzain Inhibitors. J. Med. Chem. 2010, 53, 4891−4905. (2) Reymond, J.-L.; van Deursen, R.; Blum, L. C.; Ruddigkeit, L. Chemical Space As a Source for New Drugs. MedChemComm 2010, 1, 30. (3) van Deursen, R.; Reymond, J. L. Chemical Space Travel. ChemMedChem 2007, 2, 636−640. (4) Awale, M.; van Deursen, R.; Reymond, J. L. Mqn-Mapplet: Visualization of Chemical Space with Interactive Maps of Drugbank, Chembl, Pubchem, Gdb-11, and Gdb-13. J. Chem. Inf. Model. 2013, 53, 509−518. (5) Virshup, A. M.; Contreras-Garcia, J.; Wipf, P.; Yang, W.; Beratan, D. N. Stochastic Voyages Into Uncharted Chemical Space Produce a Representative Library of All Possible Drug-Like Compounds. J. Am. Chem. Soc. 2013, 135, 7296−7303. (6) Nicolaou, C. A.; Kannas, C. C. Molecular Library Design Using Multi-Objective Optimization Methods. In Chemical Library Design; Zhou, J., Ed.; Humana Press: 2011; Vol. 685, pp 53−69. (7) Gillet, V. J.; Willett, P.; Fleming, P. J.; Green, D. V. Designing Focused Libraries Using MOSELECT. J. Mol. Graphics Modell. 2002, 20, 491−498. (8) Nicolaou, C. A.; Brown, N. Multi-Objective Optimization Methods in Drug Design. Drug Discovery Today: Technol. 2013, 10, e427−435. (9) Schneider, G. Future De Novo Drug Design. Mol. Inf. 2014, 33, 397−402. (10) Hartenfeller, M. E. M.; Meier, P.; Nieto-Oberhuber, C.; Altmann, K. H.; Schneider, G.; Jacoby, E.; Renner, S.; Eberle, M. A Collection of Robust Organic Synthesis Reactions for In Silico Molecule Design. J. Chem. Inf. Model. 2011, 51, 3093−3098. (11) Patel, H.; Chen, B.; Gillet, V. J.; Bodkin, M. J. Knowledge-Based Approach to De Nowo Design Using Reaction Vectors. J. Chem. Inf. Model. 2009, 49, 1163−1184. (12) Roughley, S. D.; Jordan, A. M. The Medicinal Chemist’s Toolbox: An Analysis of Reactions Used in the Pursuit of Drug Candidates. J. Med. Chem. 2011, 54, 3451−3479. (13) Cramer, R. D.; Jilek, R.; Campbell, B.; Soltanshahi, F. Allchem: Generating and Searching 10(20) Synthetically Accessible Structures. J. Comput.-Aided Mol. Des. 2007, 21, 341−350. (14) Hartenfeller, M.; Zettl, H.; Walter, M.; Rupp, M.; Reisen, F.; Proschak, E.; Weggen, S.; Stark, H.; Schneider, G. Dogs: ReactionDriven De Novo Design of Bioactive Compounds. PLoS Comput. Biol. 2012, 8, e1002380. (15) Evers, A.; Hessler, G.; Wang, L. H.; Werrel, S.; Monecke, P.; Matter, H. Cross: An Efficient Workflow for Reaction-Driven Rescaffolding and Side-Chain Optimization Using Robust Chemical Reactions and Available Reagents. J. Med. Chem. 2013, 56, 4656−4670. (16) Lessel, U.; Wellenzohn, B.; Lilienthal, M.; Claussen, H. Searching Fragment Spaces with Feature Trees. J. Chem. Inf. Model. 2009, 49, 270−279. (17) Rarey, M.; Stahl, M. Similarity Searching in Large Combinatorial Chemistry Spaces. J. Comput.-Aided Mol. Des. 2001, 15, 497−520. (18) Wellenzohn, B.; Lessel, U.; Beller, A.; Isambert, T.; Hoenke, C.; Nosse, B. Identification of New Potent GPR119 Agonists by Combining Virtual Screening and Combinatorial Chemistry. J. Med. Chem. 2012, 55, 11031−11041. (19) Peng, Z. Very Large Virtual Compound Spaces: Construction, Storage and Utility in Drug Discovery. Drug Discovery Today: Technol. 2013, 10, e387−e394. (20) Peng, Z.; Yang, B.; Mattaparti, S.; Shulok, T.; Thacher, T.; Kong, J.; Kostrowicki, J.; Hu, Q.; Na, J.; Zhou, J. Z.; Klatte, D.; Chao, B.; Ito, S.; Clark, J.; Sciammetta, N.; Coner, B.; Waller, C.; Kuki, A. Pgvl Hub: An Integrated Desktop Tool for Medicinal Chemists to Streamline Design and Synthesis of Chemical Libraries and Singleton Compounds. In Chemical Library Design, Zhou, J., Ed. Humana Press: 2011; Vol. 685, pp 295−320. (21) Tversky, A. Features of Similarity. Psychol. Rev. 1977, 84, 327− 352.

several Idea2Data cycles to date and thereby proved its practical feasibility. It is the opinion of the authors that PLC, Idea2Data and similar initiatives already surfacing across the drug discovery community36,37 will, in the future, become the norm fueled by advances in automated synthesis technology and the need to exploit larger feasible virtual collections.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.6b00173. Table S1 contains a table with Pubchem IDs for diphenpyramide and 11 compounds identified through PLC search synthesized and screened for RIO2 kinase binding affinity. Table S2 contains a descriptive table with the 10 main reactions in LARR. Table S3 lists the reagent annotation files for each of the reactions in Table S2. Table S4 lists the description of the queries used to annotate candidate reagents and identify the ones appropriate for each reaction in Table S2. Table S5 contains a sample of DrugBank queries and their nearest PLC neighbor with reagent structure and reaction type information. Table S6 contains molecular property distributions for PLC subsets, the Lilly collection and Pubchem (PDF).



AUTHOR INFORMATION

Corresponding Author

*C. A. Nicolaou. E-mail: [email protected]. Tel.: 317-2778287. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The authors acknowledge the contributions of Drs. Sheehan, S. and Humblet C. to the conception of the PLC idea and their strong encouragement during the implementation of the project and Burton, K. who has been instrumental in the design and implementation of the Idea2Data initiative. We would also like to thank Drs. Varin, T., Masquelin T., Godfrey A. and numerous colleagues from the Computational Chemistry and Cheminformatics, Synthesis, Analytical and Purification groups at Eli Lilly and Co. for their work that led to the discovery of selective hRIO2 kinase inhibitors using PLC and their support during the preparation of the corresponding section of this paper.



ABBREVIATIONS PLC, Proximal Lilly Collection; I2D, Idea2Data; ASL, Automated Synthesis Laboratory; LARR, Lilly Annotated Reaction Repository; VSE, virtual synthesis engine; PLC-DS, PLC data space; SnAr, nucleophilic aromatic substitution; 2D, two-dimensional; AFP, atom path based fingerprints; RSCRS, Reagent Space Cherry-pick Random Selection; MTM, MakeThese-Molecules; P2F, products-to-form; VS, virtual screening; SAR, structure−activity relationship; HPC, high performance computing; PGVL, Pfizer Global Virtual Library; MoBSS, Monomer-Based Similarity Searching



REFERENCES

(1) Ferreira, R. S.; Simeonov, A.; Jadhav, A.; Eidam, O.; Mott, B. T.; Keiser, M. J.; McKerrow, J. H.; Maloney, D. J.; Irwin, J. J.; Shoichet, B. 1265

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266

Article

Journal of Chemical Information and Modeling (22) Hu, Q.; Peng, Z.; Kostrowicki, J.; Kuki, A. Leap Into the Pfizer Global Virtual Library (PGVL) Space: Creation of Readily Synthesizable Design Ideas Automatically. In Chemical Library Design, Zhou, J., Ed. Humana Press: 2011; Vol. 685, pp 253−276. (23) Vainio, M. J.; Kogej, T.; Raubacher, F. Automated Recycling of Chemistry for Virtual Screening and Library Design. J. Chem. Inf. Model. 2012, 52, 1777−1786. (24) Yu, N.; Bakken, G. A. Efficient Exploration of Large Combinatorial Chemistry Spaces by Monomer-Based Similarity Searching. J. Chem. Inf. Model. 2009, 49, 745−755. (25) Boehm, M.; Wu, T.-Y.; Claussen, H.; Lemmen, C. Similarity Searching and Scaffold Hopping in Synthetically Accessible Combinatorial Chemistry Spaces. J. Med. Chem. 2008, 51, 2468−2480. (26) Godfrey, A. G.; Masquelin, T.; Hemmerle, H. A RemoteControlled Adaptive Medchem Lab: An Innovative Approach to Enable Drug Discovery in the 21st Century. Drug Discovery Today 2013, 18, 795−802. (27) Nextmove Software. www.nextmovesoftware.com (accessed February 9, 2016). (28) Dalby, A.; Nourse, J. G.; Hounshell, W. D.; Gushurst, A. K. I.; Grier, D. L.; Leland, B. A.; Laufer, J. Description of Several Chemical Structure File Formats Used by Computer Programs Developed at Molecular Design Limited. J. Chem. Inf. Model. 1992, 32, 244−255. (29) Bruns, R. F.; Watson, I. A. Rules for Identifying Potentially Reactive or Promiscuous Compounds. J. Med. Chem. 2012, 55, 9763− 9772. (30) Wishart, D. S.; Knox, C.; Guo, A. C.; Cheng, D.; Shrivastava, S.; Tzur, D.; Gautam, B.; Hassanali, M. Drugbank: A Knowledgebase for Drugs, Drug Actions and Drug Targets. Nucleic Acids Res. 2008, 36, D901−D906. (31) MacCuish, J.; Nicolaou, C.; MacCuish, N. E. Ties in Proximity and Clustering Compounds. J. Chem. Inf. Comput. Sci. 2001, 41, 134− 146. (32) Flower, D. R. On the Properties of Bit String-Based Measures of Chemical Similarity. J. Chem. Inf. Comput. Sci. 1998, 38, 379−386. (33) Landrum, G. Rdkit: Open Source Toolkit for Cheminformatics. http://www.rdkit.org/ (accessed November 11, 2015). (34) Varin, T.; Masquelin, C. A.; Nicolaou, C. A.; Evans, D.; Vieth, M.; Godfrey, A. G. Discovery of the First Selective RIO2 Kinase Inhibitors. Biochim. Biophys. Acta, Proteins Proteomics 2015, 1854, 1630−1636. (35) Zhang, H.; Wang, J.; Gao, C.; Nicolaou, C.; Humblet, C. MD3: A Computational Application to Support Drug Discovery Process. Natl. Meet.Am. Chem. Soc., Div. Comp. Chem., 2013; COMP 216. (36) Chevillard, F.; Kolb, P. Scubidoo: A Large yet Screenable and Easily Searchable Database of Computationally Created Chemical Compounds Optimized toward High Likelihood of Synthetic Tractability. J. Chem. Inf. Model. 2015, 55, 1824−1835. (37) Nicklaus, M. C. Synthetically Accessible Virtual Inventory (SAVI). https://cactus.nci.nih.gov/download/savi_download/ (accessed February 9, 2016).

1266

DOI: 10.1021/acs.jcim.6b00173 J. Chem. Inf. Model. 2016, 56, 1253−1266