What Can We Learn from Bioactivity Data ... - ACS Publications

Oct 25, 2016 - A wealth of chemoinformatics tools, web services, and applications therefore ... can influence the further development of chemical prob...
1 downloads 0 Views 2MB Size
Subscriber access provided by La Trobe University Library

Review

What can we learn from bioactivity data? Chemoinformatics tools and applications in chemical biology research. Lina Humbeck, and Oliver Koch ACS Chem. Biol., Just Accepted Manuscript • DOI: 10.1021/acschembio.6b00706 • Publication Date (Web): 25 Oct 2016 Downloaded from http://pubs.acs.org on October 29, 2016

Just Accepted “Just Accepted” manuscripts have been peer-reviewed and accepted for publication. They are posted online prior to technical editing, formatting for publication and author proofing. The American Chemical Society provides “Just Accepted” as a free service to the research community to expedite the dissemination of scientific material as soon as possible after acceptance. “Just Accepted” manuscripts appear in full in PDF format accompanied by an HTML abstract. “Just Accepted” manuscripts have been fully peer reviewed, but should not be considered the official version of record. They are accessible to all readers and citable by the Digital Object Identifier (DOI®). “Just Accepted” is an optional service offered to authors. Therefore, the “Just Accepted” Web site may not include all articles that will be published in the journal. After a manuscript is technically edited and formatted, it will be removed from the “Just Accepted” Web site and published as an ASAP article. Note that technical editing may introduce minor changes to the manuscript text and/or graphics which could affect content, and all legal disclaimers and ethical guidelines that apply to the journal pertain. ACS cannot be held responsible for errors or consequences arising from the use of information contained in these “Just Accepted” manuscripts.

ACS Chemical Biology is published by the American Chemical Society. 1155 Sixteenth Street N.W., Washington, DC 20036 Published by American Chemical Society. Copyright © American Chemical Society. However, no copyright claim is made to original U.S. Government works, or works produced by employees of any Commonwealth realm Crown government in the course of their duties.

Page 1 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

What can we learn from bioactivity data? Chemoinformatics tools and applications in chemical biology research.

Lina Humbeck and Oliver Koch* Faculty of Chemistry and Chemical Biology, TU Dortmund University, Otto-Hahn-Straße 6, 44227 Dortmund, Germany *Corresponding Author Email: [email protected], [email protected] The ever increasing bioactivity data that are produced nowadays allow exhaustive data mining and knowledge discovery approaches that change chemical biology research. A wealth of chemoinformatics tools, web services and applications therefore exists that supports a careful evaluation and analysis of experimental data to draw conclusions that can influence the further development of chemical probes and potential lead structures. This review focuses on open-source approaches that can be handled by scientists who are not familiar with computational methods having no expert knowledge in chemoinformatics and modeling. Our aim is to present an easily manageable toolbox for support of every day laboratory work. This includes, among other things, the available bioactivity and related molecule databases as well as tools to handle and analyze in-house data. Introduction The era of big data has changed and is still changing the way how small molecules are developed that modulate protein function.1 Even in academia, the hit finding and developmental process accumulates a huge amount of additional data which requires methods for data handling and data mining. Chemoinformatics-based tools can assist by facilitating the decision making process and increasing the probability of successfully answering scientific questions behind an experiment. In addition, publicly available bioactivity databases provide a huge amount of data that should not be ignored. This review focuses on publically available data and tools to support the scientist in chemical biology and medicinal chemistry research by providing an overview of open-source approaches. At the beginning, we would like to point to possible application scenarios of the subsequently described tools and methods. The bioactivity databases can be used to search for possible targets of a new hit, in case a target is not known, or to identify promiscuous molecules that often occur as hits in screening campaigns. In addition, the bioactivity data of similar molecules can be analyzed simultaneously. The next step would be to extend the structure activity relationship (SAR) by testing similar compounds. The purchasable compound libraries can be used to identify similar molecules for testing or to get ideas what could be synthesized next. In-house workflows, e.g. generated by KNIME, can easily be implemented to perform these searches, if the molecule data should not be used in the public domain. Tools like DataWarrior or Scaffold Hunter support analyzing screening data to identify the most promising hit or to analyze the structure activity relationship of a molecule series. This is done via clustering and visualization of the molecules and the corresponding bioactivity data. Bioactivity Databases

ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Bioactivity databases store chemical data in combination with biological data thereby linking both worlds to provide valuable information for researchers. For each newly identified hit or compound modification, these databases can be used for setting relationships to other compounds with similar structures or similar scaffolds together with biological activity. The main idea of these approaches relies on the premise that similar ligands show similar activity.2 Schuffenhauer et al.3 extended this approach and could show that ligand similarity is reflected in the similarity of the respective target proteins. The simplest way to get information about potential targets or off-targets is therefore a direct search within the described bioactivity databases if the molecule of interest or a similar one was already measured. Nettles et al.4 studied this approach extensively and found that 2D fingerprints yield very good results for the description of molecules if the similarity is high. Thus, not the search itself is the problem on the way to target prediction, but rather the extensive analysis of the results.5 Another important task is the identification of promiscuous hits within high-throughput screening results. Promiscuous hits show false positive results based on reactivity, assay interference or aggregation and therefore often occur within other screening results. Tools and web services to predict such promiscuous hits can be found in the literature 6,7,8, but the available bioactivity database can also support the identification of specific molecules. Important information about the applied assay system, used concentrations etc. can also be carefully evaluated to verify hits. In this context it has to be mentioned that, due to the huge amount of data and automatic data curation, any bioactivity data point has to be handled with care and should be carefully analyzed by further investigation of the underlying publications. Bajorath9 for example showed that, depending on assay quality, the promiscuity of certain molecules and scaffolds massively changes. However, these databases provide a huge amount of valuable information for data-driven decisions which should be included in the day-by-day workflow. Bioactivity databases can be differentiated into databases restricted to drugs on the market (current or former) like DrugBank10, 11, KEGG DRUG12 and such open for all small molecules like ChEMBL13 and PubChem.14, 15 Whereby, the above mentioned warnings refer more to the latter type due to the huge amount of collected data. An overview about the databases is shown in Table 1. DrugBank (www.drugbank.ca) DrugBank is a highly curated and supervised database of drugs and related compounds which aims to bring together chemical and biological as well as clinical data. DrugBank has a rigid form of quality control which ensures that the data is proved by two persons independently. It has four major categories which are FDA-approved small molecules, FDA-approved biotech drugs, nutraceutical small molecules and experimental, illicit, withdrawn or investigational drugs.10 DrugBank contains the data about the chemical nature encompassing synthesis information, clinical behavior including ADME-Tox data, the biological nature of its target (the DrugCards) and the Anatomical Therapeutic Chemical (ATC) classes.16 The data can be browsed by a set of search options as for example structure or 2D chemical similarity search (ChemQuery), sequence similarity search of the protein target as BLAST search (Sequence Search), search for drug or food interactions (Interax Interaction Search) or search for similar analytical data like MS or NMR spectra (e.g., MS Search, 1D NMR search). Additionally, emphasis was placed on metabolism and therefore drug metabolism reactions as well as drug metabolism pathways can be analyzed. The DrugBank data can also be freely downloaded from the website. An alternative to DrugBank is DrugCentral (http://drugcentral.org/), which is available as download and as web service, contains drugs approved by EMA, FDA as well as PMDA and is maintained by the University of New Mexico. ACS Paragon Plus Environment

Page 2 of 31

Page 3 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

ChEMBL (www.ebi.ac.uk/chembl) ChEMBL provides open access to bioactivity data mainly from medicinal chemistry literature. In contrast to DrugBank it contains data from all sorts of molecules with biological activity. For drug-like molecules, descriptors such as violation of Lipinskis’ rule of five17, rule of three, weighted Quantitative Estimate of Drug likeness (QED) or ligand efficiencies are provided. For protein targets, ChEMBL distinguishes between “single protein” (interaction with a defined monomeric protein), “protein family” (interaction is not clearly ascribed to one protein), “protein complex” (defined interaction with a protein complex) and “protein complex group” (interaction to protein complexes of varying subunit constellations). Furthermore, a binding site definition is possible at different levels, e.g. subunits, domains or residues of proteins. As for DrugBank quality assurance is a vital aspect. Therefore, approaches to facilitate the verification of the data by the user were introduced. These contain a standardization procedure for activity types and values, introduction of a pChEMBL value which is the negative decadic logarithm of an activity type and flags in the “data validity comment” or “potential duplicate” column like “non standard unit for type”. The data can be searched among others by textual searches, structural searches including substructure and similarity search, and sequence (similarity) searches. Additionally, it is possible to search the stored assays, documents or cell lines. ChEMBL is available via web interface and as download, e.g. in RDF or MySQL format. A comparable database was available free of charge for academic users: the WOMBAT database (World of Molecular Bioactivity).18 Unfortunately, the company was closed in 2015 (personal communication). Tiikkainen et al. compared the estimated curation error rates of WOMBAT and ChEMBL. Despite the lower total error range in chemical structure curation, ChEMBL has more serious errors (incorrect connectivity), whereas most chemical curation error of the WOMBAT database are due to incorrect stereochemistry.19 BindingDB (bindingdb.org) BindingDB is a public database of experimental binding affinities of a macromolecule-small molecule interaction, e.g. between a protein and a small molecule, that contains more than 1.2 million binding data points for more than 6,400 protein targets and around 550,000 small molecules. Furthermore, affinities for protein-protein, protein-peptide as well as host-guest interactions are provided.20, 21 In contrast to ChEMBL, it contains only data of defined interactions, i.e. data from phenotypic screens or single concentration measurements are excluded due to the higher risk of misleading or erroneous data. The authors state that a set of journals not covered by other databases (https://www.bindingdb.org/bind/index.jsp) is continually curated which is supported by a molecule overlap analysis showing that only one third of the molecules from BindingDB are also contained in ChEMBL. The contemplable data for integration is reviewed carefully both automatically and if needed manually. BindingDB provides a wide range of searches, possible queries, tools and data sets.20 One tool, “Find My Compound’s Target”, aims to predict the target of a small molecule of interest or possible off-targets. The query compound is first compared to other compounds in the database and targets of these compounds are selected if the compounds’ similarity is above the chosen cut-off and the affinity is respectively above a certain threshold. A second tool is termed “Find Compounds For My Target” that tries to find compounds for a specific target. BindingDB also provides a virtual compound screening tool with which the user has the possibility to screen an external dataset of compounds for similar bioactivity. Access is given by a download option (SD-file, tab-separated value (TSV) or Oracle dump) or by programmatic access (RESTful API, structured URLs, or KNIME). ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

PubChem (pubchem.ncbi.nlm.nih.gov) PubChem can be roughly divided into a bioactivity part (PubChem BioAssay) and a more chemical focused part, which is composed of a PubChem Substance and a PubChem Compound database. The PubChem Substance database contains all structures and their descriptions provided by the depositors, and the PubChem Compound database stores the unique chemical structures derived from these substances.14 The title of the initial publication very nicely summarizes the intended application, “Pubchem – a public information system for analyzing bioactivities of small molecules”.22

Figure 1. The PubChem web services. The PubChem BioAssay and Substance Database information can be accessed from the initial project webpage (https://pubchem.ncbi.nlm.nih.gov/). The information is cross-referenced so that the substance data can be accessed from the bioassay data and vice versa. PubChem BioAssay stores the activity data of small molecules or RNAi and contains curated parts of ChEMBL for which flags for active or inactive compounds are assigned depending on whether the IC50, EC50 or Ki is above 50 µM or not. The data can be accessed and analyzed via a broad range of provided web services and tools (Figure1).15 Besides using a name, a smiles code can be used or a structure can be drawn to search for an identical molecule, a similar molecule or a substructure. This leads to information about bioassay results or substance descriptions (Figure 1). Variations of assay results can be analyzed using detailed description of the performed experiments.14 The data is additionally clustered, e.g. according to the protein or gene target, the type of assay (e.g., cell-based, protein-protein interaction), an assay project or more complex kinds of relationships like target similarity or common active compounds. Additionally, PubChem provides links to patents as well as an upload tool called PubChem Upload. Apart from facilitating the upload process it performs validation checks and has an option to hide the data to the public, e.g. until publication of the corresponding paper or patent. Another tool is PubChem3D which generates theoretical 3D structures. Open access to all bioassay datasets is provided via FTP or download. In addition it is possible and appreciated to contribute data by a submission tool. Open PHACTS (www.openphacts.org)

ACS Paragon Plus Environment

Page 4 of 31

Page 5 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

Finally, we want to report on a project unifying data of various sources by using a Semantic Web context to simplify scientists’ life called Open PHACTS.23 This open-source project is a broadly conceived public-private partnership which is funded by the European Union under a European grant and has the mission to organize the data concerning pharmacology. This would simplify data mining approaches and hence has the chance to accelerate drug design. The main features are: a mapping process of identifiers, a design to answer specific pharmacological questions, tools to analyze the data and resort to established data sources, e.g. some of the afore mentioned databases like ChEMBL or DrugBank as Resource Description Framework (RDF) versions as well as approved algorithms.24 The basic idea of Open PHACTS is to bridge data and subsequent analysis .The results of an example search for sorafenib are shown in Figure 2. In contrast to the other databases described in this review Open PHACTS is a union of multiple databases and does not store any data but retrieves data from the source databases on the fly. In Figure 2 the provenances can be seen, e.g. ChemSpider and ChEMBL. Hence, Open PHACTS provides a fast overview about the known data regarding a compound instead of manually searching the applicable databases. However, it is focused on gaining pharmacological information and corresponding issues. Nevertheless, to personalize Open PHACTS Discovery Platform and create new tools, a free registration is required. This project demonstrates once more the profound requirement of internationally accepted standards and taskforces to curate data with the ultimate goal to gain world-wide consistent data.

Figure 2. Open PHACTS Explorer search results for sorafenib. From the initial result table including provenances (on the left hand side) a structure search (at the bottom) as well as an overview of pharmacology data (at the top) is reachable. The arrows lead to subsequent results for the structure search and pharmacological overview, respectively. Further databases: In addition, databases with slightly different focus areas exist. First, a database specialized in substances of the traditional Chinese medicine called TCM [email protected] It is a free database which can be downloaded or searched via a web interface, whereas only a limited number of options are available to customize the searches. Two unique features are the classification of compounds according to traditional Chinese theories and secondly, the possibility to dock the compounds of the ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 6 of 31

database via the web-interface. Next there are databases focused on structure-based data mainly concentrating on proteins: The Potential Drug Target Database (PDTD)26 that focuses on targets with known protein structure, BRENDA27 that focuses on enzymes and PDBbind28, 29 that collects bioactivity data for known protein structures. The last type of databases which should be mentioned here are databases storing data about pathways, e.g. Reactome30, WikiPathways31 and KEGG12. KEGG (Kyoto Encyclopedia of Genes and Genomes) provides a collection of databases and tools, e.g. a pathway database, the KEGG Orthology, gene annotation (BlastKOALA and GhostKOALA) and function mapping (BRITE) tools, drug interaction networks including a “drug interaction checker”, a drug database and a gene database.12 Pathway maps concerning metabolism, genetic information processing, environmental information processing, cellular processes, organismal systems, human diseases and drug development are provided. ChEMBL PubChem BioAssay bioactivity data assay data from small molecules and RNAi gene function and biological pathways • Compounds • drugs: > 8 K • BioAssays: > 1 M # • BioActivities: molecule • targets: > 4 K > 1.5 M • Target > 11K > 229 M entries • drug-target • tested • Activities interactions: > 15 K compounds: > 13.9 M >2M • tested substances: >3M • RNAi BioAssays: > 75 • Protein targets: >9 • KGene targets: > 19 K special MS and NMR data assay linkage to related features annotation bioassays and clustering of datasets focus

DrugBank molecules with drug-like character

advantag free es highly curated

open large

disfocused on FDA advantag drugs es

quality

open access assignment of inactive or active flag partly confusing and not useroptimized

BindingDB experimental binding affinity data

TCM Chinese medicines

• binding data: > 32 K substances >1M • proteins: > 6 K • small molecules: > 500 K

human-computer interaction knowledge integration datasets for validation KNIME node reliable data

docking tool classification according to traditional Chinese theories

free database of mainly natural products

limited amount of specialized in data Chinese medicine; slow and inconsistent web interface; no target proteins website http://www.drug https://www.eb http://pubchem.nc http://bindingdb. http://tcm.cmu.e bank.ca/ i.ac.uk/chembl/ bi.nlm.nih.gov/ org/ du.tw/

Table 1: Overview of bioactivity databases.

ACS Paragon Plus Environment

Page 7 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

Compound collections and virtual libraries Besides bioactivity databases, there are further databases existing that can be divided into several groups according to their main features. One group contains collections of purchasable compounds like ZINC32, 33, MolPort (www.molport.com), mcule (https://mcule.com/) or databases from the suppliers directly, e.g. Enamine (http://www.enamine.net/). These databases are often used for the SAR-by-catalogue approach, which means the purchasing and testing of molecules similar to a known hit to gain initial structure-activity relationships or to plan derivative synthesis by analyzing the number of possible products. The second group contains databases with molecules of biological interest like ChEBI34, 35 or natural product libraries. Most of the before mentioned databases can be downloaded and searched for, e.g. a special substance, substances with defined properties, similar structures or molecules exhibiting a common substructure to another given small molecule. Usually, tools are needed which are described in the analysis and visualization part. The last group contains databases that can be used to analyze molecule structures and to guide further synthesis like the GDB17, with artificially created virtual libraries, or to support structure-activity relationship analysis like the Cambridge Structural Database (CSD), containing three-dimensional structures of small molecules. A neat overview of the databases in this section is given by Table 2. Purchasable Compound Collections Zinc (ZINC is not commercial, zinc15.docking.org) is a freely accessible database of more than 120 million purchasable compounds, which is continuously maintained and updated.36 Initially, this database was created to provide all purchasable drug-like compounds, but it was extend in the new version towards information about drugs, metabolites, natural products and biologically annotated compounds from the literature. This information is derived from databases like ChEMBL. The molecules are preprocessed for virtual screening approaches (e.g., protonation and tautomerization) with the aim to provide biologically relevant forms. Using the web-frontend it is possible to download the complete database as well as subsets generated according to user queries, vendors or properties like logP and molecular mass (see Figure 3). The recent version (ZINC15) has much more sophisticated subsets, increased number of vendors and even some bioactivity databases included.32, 33 Furthermore the user can execute search queries and as registered user it is possible to upload user specific datasets, to directly send emails with selected structures to request a quotation from the respective vendor or to share selected compounds with colleagues to prioritize them.33 The drawback of ZINC is the effective availability of the listed compounds, because only a quarter are available for immediate delivery. Other compound collections, like MolPort (www.molport.com), mcule (mcule.com) or eMolecules (www.emolecules.com) contain less compounds (around 7 million for MolPort and eMolecules and ca. 36 million for mcule but only 5 million with known stock amount) but there is a higher chance that these compounds can be purchased. In addition, a service is provided to collect the compounds from different vendors for delivery. These databases can be searched using a web interface or in case of MolPort and mcule can also be downloaded as well as analyzed in-house. The PubChem Substances and ChemSpider37 (www.chemspider.com/) also provides information about possible vendors.

ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Figure 3. The Zinc 15 compund collection. Overview of compound sets based on logP and molecular mass. Each subset can be selected and downloaded individually. Collections of biological interest ChEBI (Chemical Entities of Biological Interest) focuses on small molecules with biological relevance (https://www.ebi.ac.uk/chebi/) and was initiated and is still maintained by the European Bioinformatics Institute.34 The notion “biological relevant” means that the molecules are present in living organisms or capable to intervene with processes of these.35 Besides its database function, it provides an ontology, which is divided into three sub-ontologies: molecular structure ontology, subatomic particle ontology, and role ontology (which consists of the initial sub-ontologies biological role and application).38 Small molecules are annotated with references, additional databases, e.g. NMRShiftDB or patents, and are separated into different classes like natural products (synonymous with secondary metabolite)35. More than 45,000 annotated chemical entities are part of ChEBI. Each entry is manually annotated before its release. Users are encouraged to generate an account and submit chemical entities they miss through the submission tool.34 PubChem includes all data from ChEBI35, whereas ChEBI has data from ChEMBL and PDBeChem incorporated.38 Downloads for the complete database or e.g. only the manually annotated entries are available through ftp://ftp.ebi.ac.uk/pub/databases/chebi/SDF/. The main advantages are that no proprietary data or data sources are used, entries are fully traceable and referred to source as well as the data is completely available through e.g. MySQL dumps.34 Super Natural II is a public resource for natural compounds. Natural products are likely to have bioactivity due to their interaction with multiple proteins, e.g. during biosynthesis or to fulfil their biological function.39 One advantage and drawback at once is their higher complexity in terms of number of stereocenters, diversity or flexibility, making them less tractable by chemical synthesis but provide them with bioactivity by building specific 3D shapes. Some practical features are the ability to preselect results, e.g. to only gain purchasable compounds, a predicted toxicity class according to GHS (Globally Harmonized System) and a target search tool. Additionally, substructure and similarity searches are possible through the web interface. It would be useful to be able to access the entire database also offline, e.g. by providing a download file.

ACS Paragon Plus Environment

Page 8 of 31

Page 9 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

Further Databases for structure analysis or chemical synthesis guidance GDB-17 is a virtual library of theoretical molecules with up to 17 atoms created by Reymond and coworkers.40 It was created by enumerating chemical feasible combinations of C, N. O and S. This collection highlights the blank areas on the chemical map when compared with the chemical space of synthesized compounds and pinpoints areas of interest for future synthesis strategies. A subset of 50 million compounds as well as a lead-like subset can be obtained from http://www.gdb.unibe.ch. Additionally, it is possible to search the entire database through a web interface. This dataset will get even more important in near future as computational power evolves enabling a virtual screening of the entire library. Nevertheless, until now even sophisticated cheminformatics tools are not reliably able to predict synthesizability which constrain a plausibility check before starting a wet lab synthesis campaign.41 CSD (Cambridge Structural Database) harbors 3D structures of small molecules and aims to shed light upon preferred conformations. Since 1965 structures determined by X-ray or neutron diffraction are stored, processed and curated by the Cambridge Crystallographic Data Centre (CCDC) and finally deposited into the CSD.42 It is an almost complete collection of 3D crystal structures with more than 800,000 structures. For retrieving the desired information from the CSD and analyzing the structures the CSD System is provided. This system can be used to access the available 3D conformations, not only for analyzing the whole molecules but also to retrieve statistics about specific fragments, e.g. the possible torsion angle ranges of a specific substitution pattern. This can help to rationalize structure-activity relationships that depend on specific molecule conformations. Although it is a commercial library it is included in this review because a huge number of academic institutions can access this database (although many research groups do not know). In 2015, more than 1,200 academic institutions in 80 countries had access including countries like France or Spain that have a countrywide license (CCDC, personal communication). In addition, a structure service is provided where anyone can gain free access to individual crystal structures linked from structural publications.

focus

# molecul e entries

ZINC providing molecules in biologically relevant forms which are purchasable > 35 M

special prepared features subsets for different logPmolecular

ChEBI small molecules with biological relevance

CSD 3D crystal structure of small molecules

PubChem Small molecules and RNAi

GDB-17 Super Natural Enumeration Purchasable of possible natural small compounds molecules

> 50 K

> 800 K

> 280 M (compounds > 82 M substances > 198 M) PubChem 3D

> 166 bn

> 325 K

theoretical lead like subsets available

target prediction

Submission • tools for tool, ontology searches • not a relational database

ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

mass combinations advanta free no ges ready to dock proprietary database data or data sources; entries are traceable to source; completely available data disadva impossible to small set ntages have complete purchasabilit y

approximate completeness high quality

hugh collection

Page 10 of 31

diverse diverse new chemical with space biological impact

commercial, but available for a huge amount of academic institutions

not always complex synthesizable structures, hence derivatives are probably hard to synthesis no download website http://zinc15. https://www. http://ccdc.ca http://pubch http://www. http://bioinfdocking.org/ ebi.ac.uk/che m.ac.uk/prod em.ncbi.nlm. gdb.unibe.ch/ applied.charit bi/ ucts/csd/ nih.gov/ e.de/superna tural_new

Table 2. Overview of compound libraries. Analysis and Visualization After having successfully gathered data, e.g. from the previously mentioned databases, screening experiments or synthesized molecule series, it is obvious that meaningful information or patterns for knowledge generation could only be trustworthily detected by utilizing powerful, stable and reliable tools due to the wealth of data. Key features of such tools should be an interactive interface leading to intuitiveness to beginners in the field or sporadic users (which also implies a mature error handling), a broad spectrum of areas of application, stability, reliability, sustainability in terms of maintenance of the software, e.g. bug fixing and implementation of recent algorithms, simple installation, maintenance and use. Basis operations that every cheminformatics tool should have are an import and export function, a feature to calculate properties and to depict the molecules. Four tools, namely, DataWarrior, MONA2, Screening Assistant 2 (SA2) and Scaffold Hunter, which are free for academic use, will be described in more detail in the following. Additionally, all but MONA 2 are open-source tools. The GUIs of this software are depicted in Figure 4 and compared in Table 3. DataWarrior DataWarrior is a comprehensive chemoinformatics tool spanning solutions for questions arising in different drug discovery stages which was developed by the drug discovery department of Actelion Ltd. but nevertheless is an open-source software.43 It aims to support the user in answering these questions by an interactive exploration of the chemical space and visualizing using cheminformatics algorithms, physicochemical property prediction as well as multivariate data analysis. The main features of this software are: methods to reduce multidimensional data to two dimensions, like the well-known methods Self-Organizing Maps (SOMs)44 or Principle Component Analysis (PCA)44 as well as a novel method called Rubber Band Scaling (2D-RBS), a method to visualize and analyze activity cliffs based on the Structure Activity Landscape Index (short SALI45), and a scaffold based analysis. The developers of DataWarrior also emphasize effective storage by using a compact file format which represents chemical structures as canonical strings. For the purpose of a rapid interaction with the ACS Paragon Plus Environment

Page 11 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

interface the imported files are stored in memory which is luckily not a major drawback due to the advancements in technology and the relatively small memory usage per molecule (1 GB of RAM for one million molecules).43 MONA 2 Mona 2 is characterized by a focus on set operations, like merging sets or presenting only the intersection of sets, which are especially helpful to trace changes in datasets.46 Beside the general features, MONA 2 provides a clustering method, a SMARTS search, a filtering option and an elaborated alignment procedure. During molecule loading, the most-often-used properties like number of heavy atoms, H-bond donor/acceptors, molecular mass or logP estimates are directly calculated. Many other properties like ring counts or Topological Polar Surface Area (TPSA) can be additionally calculated in a subsequent step or can be imported from the provided molecule files. An interesting feature is the asynchronous job calculation, which allows for continuing analysis while another time consuming task runs in parallel in the background.47 The development of MONA 2 is focused on easy usability and it can be installed without a specific database server setup, because a SQLite database is used that stores information in a single file. Hilbig et al.47, 46 state that it is possible to work comfortably with molecule sets up to 1 million molecules.47 In their publication, the authors discuss several possible applications like compound filtering based on physicochemical properties or chemical pattern, or analyzing datasets based on chemical similarity clustering. Furthermore, their smart solution should be mentioned, which is to ask the user instead of guessing which level of isomerism should be considered as identical, e.g. whether the stereocenter is set to an ambiguous state because stereosisomerism is irrelevant for the users purposes. Nevertheless, one drawback in contrast to all other presented tools is that a registration is required. Another demerit is that MONA 2 does not provide a sample dataset in contrast to Data Warrior and Screening Assistant 2. Screening Assistant 2 (SA2) Screening Assistant 2 is an open-source Java-based software to visualize and analyze chemical data which is able to handle huge screening libraries of about 15 million compounds. For this, SA2 utilizes a MySQL database and therefore a running MySQL server is a requirement. As all other presented tools, it supports set operations and the creation of new subsets. One possible application of Screening Assistant 2 is to manage the provenance of molecules of a screening library and the creation of a new subset of compounds which should be finally purchased and biochemically evaluated. As interesting feature, it automatically flags molecules if they contain substructures known as PAINS (Pan Assay INterference Compounds)48.49 Screening Assistant 2 exhibits four interesting concepts: providers, libraries, scaffolds and frameworks. The property providers enable the user to directly link the molecules to a source, e.g. a vendor, whereas the library concept represents a subset of a database, e.g. all bioactive compounds. A promising idea suggested by the developers of SA2 is to integrate ontologies into their tool which would be valuable for maintenance or for gaining uniformity and consistency. Scaffold Hunter Scaffold Hunter is a Java-based open-source software which focuses on visualization of data to assist to hypothesize and discover structures, i.e. patterns. Its main concept is to reduce the chemical space by reducing molecules to scaffolds, i.e. the core substructure.50 Therefore, molecules are stepwise shortened starting with the removal of all terminal side chains but exocyclic double bonds. ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 12 of 31

Afterwards different levels of scaffolds are generated by removing the less characteristic ring system at each level defined by a set of chemistry and medicinal chemistry rules, e.g. that the remaining ring system stays connected, which are customizable.51 During this process so called virtual scaffolds, i.e. scaffolds which have no molecule representatives in the data set, are frequently generated indicating gaps of synthetic strategies or in the selected data set. The power of an analysis of such virtual scaffolds could be demonstrated by Wetzel et al. (2009)52 where they identified novel pyruvate kinase modulators which exhibited such a virtual scaffold. Thus, Scaffold Hunter and in particular the subsequent analysis of virtual scaffolds could be used for scaffold hopping approaches. Another basic concept of Scaffold Hunter is to arrange the data in different ways called views. Hence, the user can analyze the same data in different views, e.g. a scaffold-based one like the Scaffold Tree View or a clustered one like the HeatMap View or distinct data can be compared with the same view, e.g. the Cloud View. In the current version the following views are supported: Scaffold Tree, TreeMap, Table, Dendrogram, HeatMap, Plot and Cloud. Nevertheless, selected molecules are marked globally, i.e. in each view.53 The views are either based on the scaffold tree or molecular fingerprint based similarity (see http://scaffoldhunter.sourceforge.net/ and an upcoming publication for details).As Screening Assistant 2, Scaffold Hunter relies on a running MySQL server for the management of larger compound data sets but also supports a file system based database (HSQLDB) which is simple to set up and sufficient for occasional usage, small data sets and testing. Additionally, the next release will contain a sample data set for testing. Scaffold Hunter provides different filtering methods which are able to narrow the chemical space down to the most favorable parts when combined with the subset generation and set operation features. Apart from its highly flexible nature, not only because of being an open-source the software Scaffold Hunter has also integrated some state of the art algorithms like the Sequential Agglomerative Hierarchical Non-overlapping (SAHN) clustering algorithm and its heuristic twin.54

focus # max. molecules special features

Scaffold Hunter scaffolds

MONA 2 set operations

Data Warrior support all stages of drug development

1M • scaffold tree • virtual scaffolds • cloud view

advantage s

variety of subset operations

Screening Assistant 2 screening libraries 15 M

combinatorial and evolutionary library generation; import data from clipboard

scaffold and framework concept; simple personalization and integration of extensions possible compatible with huge data sets

sessions can be stored with all current settings disadvanta need MySQL for ges better performance

possibility to select a variety of level of similarity integrated fingerprints; needs sudo or eventually high RAM needs MySQL server; limited file formats for installation via root; usage; registration needs root access for molecule import required for installation download

clustering subset supported operating systems

Yes Yes • Linux • Windows • Mac OS X

Yes Yes • Linux • Windows

Yes yes • Linux • Windows • Mac OS

ACS Paragon Plus Environment

Yes • Windows • Linux • Mac OS

Page 13 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

(not MacOSX)

(not MacOSX) JAVA runtime environment version >= 1.5 MySQL server

system requireme nts

• CUP: >= 2 GHz • RAM: >= 2 GB better >= 4 GB • Hard drive: >= 40 MB • Display Resolution: > 1280*1024

programm ing language opensource? website

Java

C++

Java

Java

yes

No

yes

yes

http://scaffoldhunt er.sourceforge.net/

http://www.zbh.un http://www.openmo http://sa2.sourceforg ilecules.org/datawarr e.net/ hamburg.de/mona ior/

Table 3. Overview of tools for visualization and analysis of small molecule bioactivity data.

Figure 4. Data analysis tools. Comparison of GUIs of Scaffold Hunter (A), Mona2 (B), DataWarrior (C) and SA 2 (D).

Scientific workflow systems: Scientific workflow systems are used to handle data in a highly flexible and easy manner so that it is intuitive to non-specialists and comprises maximal applicability. The process can be created as well as monitored via a visual and interactive interface. Each task has a component also called node with inand/or output ports where the initial data or the result is passed through, respectively. The user can build pipelines also called edges or connections among these components to transport data, e.g. the result from one component to another. Figure 5 and Figure 6 show an example workflow created by using KNIME and Taverna, respectively. Table 4 summarizes differences and communalities.

ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Page 14 of 31

KNIME (Konstanz Information Miner) KNIME is an open-source software which offers a modular framework for data handling and analysis. It is free of charge for academia and non-profit organizations and provides a graphical workbench where the user can connect various tasks which are encapsulated in nodes like data I/O, data manipulation, data transformation, mining algorithms and visualization.55 This framework is highly flexible as the user can integrate own nodes and external tools. The WEKA toolkit, the R environment, BIRT and JFreeChart are already integrated. One specific feature of KNIME is that data transmission from one node to another only occurs if the task at the previous node has completed for all input data. Therefore, the results are permanently stored at each node which enables stopping, manipulating and restarting the workflow without reinitializing already processed nodes.56 Additionally, KNIME provides a highlighting strategy to trace the dataflow and provides support for loops. The KNIME developers place importance on an intuitive, flexible interface which is capable of handling a huge amount of data. Figure 5 shows an example KNIME workflow. Part A of the workflow reads molecules and transforms them into a KNIME processable format. B analyses the MolPort database for similar molecules, e.g. to plan SAR studies using purchasable compounds. C analyzes the ChEMBL database for similar molecules, e.g. for target prediction or identification of promiscuous molecules. In addition, this part also identifies, if these molecules are also purchasable. D afterwards compares both results using a fingerprint-based similarity measurement using, e.g. MACCS fingerprints and Tanimoto similarity.57 The final files contain information about similar molecules with known activity and the purchasability of these molecules as well as purchasable compounds without known activity. Taverna Taverna is an open-source workflow environment which initially focused on combining bioinformatics web services. It provides a standalone workbench which is capable of creating new workflows and editing existing ones locally. Furthermore, a Taverna Server exists which can perform established workflows and is available through https://portal.biovel.eu/ without needing a specific infrastructure or an installation. Knowing KNIME, Taverna is less intuitive. However, a rich repository of ready to use workflows is available through http://www.myexperiment.org and Taverna provides plugins for different purposes like CDK-Taverna for cheminformatics tasks.58 An example workflow is shown in Figure 6. KNIME

Taverna workbench

focus special features

advantages disadvantages clustering supported operating systems programming language open-source?

• all input data is handled before passed to a successor node • support for loops clear structure

• bioinformatics • use of web services no installation needed for Taverna Server

Yes • Linux • Windows • Mac OS X Java

ready to use workflows available less intuitive Yes • Linux • Windows • Mac OS X Java

yes

Yes

ACS Paragon Plus Environment

Page 15 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

website

https://www.knime.org

https://taverna.incubator.apach e.org/

Table 4. Overview of workflow system tools.

Figure 5. An example KNIME workflow. It reads a SD-file containing molecules, e.g. hits from a primary screen and afterwards searches for similar molecules in ChEMBL or in a library of commercially available compounds (here MolPort). Important nodes are the SD-file reader (1), the ChEMBL Connector (2) and the MolPort Node (3) to analyze these databases, the CSV reader (4) to save molecules as SMILES and the Fingerprint Similarity node (5) to compare molecules.

Figure 6. Example workflow created by Taverna. It filters an imported SD-file according to the rule of five. ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Computational target prediction and toxicity prediction The last section describes how the huge amount of bioactivity data can be utilized for target or toxicity prediction, either by directly analyzing this data or by data mining approaches. Data mining approaches are used to circumvent the extensive analysis of the available amount of data by extracting the important information (“knowledge discovery”) with help of statistical and machine learning methods. Afterwards, the results are exploited for predictions. One of the latest works in this area impressively demonstrated what those methods can be capable of. Schneider and coworkers59 succeeded in predicting protein targets for a complex natural product which were then experimentally validated. In another work, Schneider and coworkers60 were able to successfully predict protein targets for drug-like molecules with help of SOMs. The underlying method can be tested using the provided SPiDER webserver (http://modlab-cadd.ethz.ch/software/spider/). Other target prediction servers based on available bioactivity databases are SuperPred (http://prediction.charite.de/)61, SwissTargetPrediction (www.swisstargetprediction.ch/)62, DINIES (www.genome.jp/tools/dinies/help.html)63 and iDrug with focus on GPCRs, ion channels and nuclear receptors (www.jci-bioinfo.cn/iDrug-Target/).62 Another approach is to predict the target based on the similarity of the protein’s ligands to the query molecule. This approach is termed Similarity Ensemble Approach (SEA) (http://sea.bkslab.org/).64 Additionally, a database named CARLSBAD (Confederated Annotated Research Libraries of Small molecule Biological Activity Data) can be used (http://carlsbad.health.unm.edu/wp/).65 This database focuses on the aggregation of high quality entries from different databases and the scaffold concept. Another application of generating predictive models is the field of toxicity prediction. The ultimate goal would be to avoid animal or even in vitro experiments by utilizing in silico models. QSAR models are already recommended by the ICH (International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use) for prediction of mutagenicity.66 Furthermore, the FDA (U.S. Food and Drug Administration) makes efforts to cement computational approaches in the area of toxicity control.67 One such approach is the European Commission funded project called OpenTox.68 The main objective is not only to unify the plethora of databases, models and tools concerning toxicology data, but also to support various specialists like toxicologists, modelers and developers. It provides two basic tools: ToxPredict and ToxCreate. ToxPredict predicts toxicological hazards for a given query. On the other hand ToxCreate is used to build a predictive model which is preferably dedicated to experts as knowledge of the algorithms is needed. ToxPredict as a web service is platform independent. However, this is accompanied by a potential drawback which excludes usage of OpenTox for sensible competitive in-house data. Nevertheless, being a web service is also an advantage as it prevents the user from installing several tools, downloading lots of datasets and providing a suitable infrastructure for preforming the calculations. The developers of OpenTox focused on validation of models, applicability domain prediction, standardization, i.e. establishing ontologies and using REST for data sharing and extensibility. Another web-based tool is BioAssayData Associative Promiscuity Pattern Learning Engine (BADAPPLE) (http://pasilla.health.unm.edu/tomcat/badapple/badapple).69 It aims at the prediction of promiscuity based on scaffolds. The developers additionally provide a SMARTS-Filter (http://pasilla.health.unm.edu/tomcat/biocomp/smartsfilter). Conclusion/Outlook ACS Paragon Plus Environment

Page 16 of 31

Page 17 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

This review focuses on publically accessible databases and open-source tools that can be used to handle and analyze a huge amount of in-house data to influence the rational identification and development of chemical probes and drugs. The overall scheme of common tools, databases and their connection is shown in Figure 7. The bioactivity databases allow to analyze the activity and promiscuity of identified hits, or to predict their targets. Databases with commercially available compounds allow identifying similar purchasable compounds for the SAR-by-catalogue approach. Powerful tools help to analyze and understand the structure-activity relationship. Finally, workflow management systems can be used to generate workflows that automatically perform searches to collect the data for analysis. The ever-increasing usability leads to the possibility that researchers without deeper knowledge about chemoinformatics can now easily integrate these approaches into their own day-by-day workflow on their desktop computers. With this review we want to encourage the community to actively use the available tools and subsequently report bugs, unintended behaviors or missing features. An active contribution is essential for the further development and implementations of specific algorithms should be shared with the community to support further software development which finally will support our all daily work.

Figure 7. An example overall workflow starting with initial hits and leading up to toxicity prediction.

Acknowledgements: L. H. thanks the German Research Foundation (DFG, Priority Programme “Algorithms for Big Date”, SPP 1736) for funding. O. K. thanks the German Federal Ministry for Education and Research (BMBF, Medizinische Chemie in Dortmund, Grant BMBF 1316053) for

ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

funding. We want to thank the developers of databases storing chemical and/or biological data as well as the developers of cheminformatics tools. Keywords

• • • • • •

data mining and knowledge discovery: The identification of patterns in large datasets and the transfer of this information into knowledge. chemoinformatic: In the context of this review, the handling and analysis of molecules and their biological activity using computational methods. bioactivity databases: These databases store chemical data in combination with biological data thereby linking both worlds to provide valuable information. workflow systems: Software that allows to automate recurring tasks like processing and analyzing high-throughput screening data without programming experience. molecule fingerprint: Abstract representation of a molecule as a series of numbers that e.g. allows to determine the similarity of two molecules. tanimoto coefficient: Similarity measure based on molecule fingerprints, whereas 1 means highly similar (or identical) molecules.

References 1.

Lusher, S. J.; McGuire, R.; van Schaik, R. C.; Nicholson, C. D.; Vlieg, J. de. Data-driven medicinal chemistry in the era of big data. Drug Discovery Today 2014, 19 (7), 859–868.

2.

Mark A. Johnson and Gerald M. Maggiora. Concepts and applications of molecular similarity. J. Comput. Chem. 1992, 13 (4), 539–540.

3.

Schuffenhauer, A.; Floersheim, P.; Acklin, P.; Jacoby, E. Similarity metrics for ligands reflecting the similarity of the target proteins. J Chem Inf Comput Sci 2003, 43 (2), 391–405.

4.

Nettles, J. H.; Jenkins, J. L.; Bender, A.; Deng, Z.; Davies, J. W.; Glick, M. Bridging chemical and biological space: "target fishing" using 2D and 3D molecular descriptors. J. Med. Chem. 2006, 49 (23), 6802–6810.

5.

Jenkins, J. L.; Bender, A.; Davies, J. W. In silico target fishing: Predicting biological targets from chemical structure. Drug Discovery Today: Technologies 2006, 3 (4), 413–421.

6.

Bruns, R. F.; Watson, I. A. Rules for identifying potentially reactive or promiscuous compounds. J. Med. Chem. 2012, 55 (22), 9763–9772.

7.

Irwin, J. J.; Duan, D.; Torosyan, H.; Doak, A. K.; Ziebart, K. T.; Sterling, T.; Tumanian, G.; Shoichet, B. K. An Aggregation Advisor for Ligand Discovery. J. Med. Chem. 2015, 58 (17), 7076–7087.

8.

Guilloux, V.; Arrault, A.; Colliandre, L.; Bourg, S.; Vayer, P.; Morin-Allory, L. Mining collections of compounds with Screening Assistant 2. J Cheminform. [Online] 2012, 4 No. 1, 20. J Cheminform. DOI: 10.1186/1758-2946-4-20.

9.

Bajorath, J. Analyzing Promiscuity at the Level of Active Compounds and Targets. Mol. Inf. [Online] 2016. ACS Paragon Plus Environment

Page 18 of 31

Page 19 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

10. Wishart, D. S.; Knox, C.; Guo, A. C.; Shrivastava, S.; Hassanali, M.; Stothard, P.; Chang, Z.; Woolsey, J. DrugBank: A comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006, 34 (90001), D668-D672. 11. Law, V.; Knox, C.; Djoumbou, Y.; Jewison, T.; Guo, A. C.; Liu, Y.; Maciejewski, A.; Arndt, D.; Wilson, M.; Neveu, V.; Tang, A.; Gabriel, G.; Ly, C.; Adamjee, S.; Dame, Z. T.; Han, B.; Zhou, Y.; Wishart, D. S. DrugBank 4.0: Shedding new light on drug metabolism. Nucleic Acids Res. 2013, 42 (D1), D1091-D1097. 12. Kanehisa, M.; Sato, Y.; Kawashima, M.; Furumichi, M.; Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016, 44 (D1), D457-D462. 13. Bento, A. P.; Gaulton, A.; Hersey, A.; Bellis, L. J.; Chambers, J.; Davies, M.; Krüger, F. A.; Light, Y.; Mak, L.; McGlinchey, S.; Nowotka, M.; Papadatos, G.; Santos, R.; Overington, J. P. The ChEMBL bioactivity database: An update. Nucleic Acids Res. 2013, 42 (D1), D1083-D1090. 14. Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B. A.; Wang, J.; Yu, B.; Zhang, J.; Bryant, S. H. PubChem Substance and Compound databases. Nucleic Acids Res. 2016, 44 (D1), D1202-13. 15. Wang, Y.; Suzek, T.; Zhang, J.; Wang, J.; He, S.; Cheng, T.; Shoemaker, B. A.; Gindulyte, A.; Bryant, S. H. PubChem BioAssay: 2014 update. Nucleic Acids Res. 2014, 42 (Database issue), D1075-82. 16. WHO Collaborating Centre for Drug Statistics Methodology. http://www.whocc.no/atc/ structure_and_principles/ (accessed August 5, 2016). 17. Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Deliv. Rev. 1997, 23 (1-3), 3–25. 18. Olah, M.; Mracec, M.; Ostopovici, L.; Rad, R.; Bora, A.; Hadaruga, N.; Olah, I.; Banda, M.; Simon, Z.; Mracec, M.; Oprea, T. I. WOMBAT: World of Molecular Bioactivity. In Chemoinformatics in Drug Discovery; Oprea, T. I., Ed.; Methods and Principles in Medicinal Chemistry; Wiley-VCH Verlag GmbH & Co. KGaA: Weinheim, FRG, 2005; pp 221–239. 19. Tiikkainen, P.; Bellis, L.; Light, Y.; Franke, L. Estimating error rates in bioactivity databases. J. Chem. Inf. Model. 2013, 53 (10), 2499–2505. 20. Gilson, M. K.; Liu, T.; Baitaluk, M.; Nicola, G.; Hwang, L.; Chong, J. BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016, 44 (D1), D1045-D1053. 21. Chen, X.; Liu, M.; Gilson, M. BindingDB: A Web-Accessible Molecular Recognition Database. Comb Chem High Throughput Screen. 2001, 4 (8), 719–725. 22. Wang, Y.; Xiao, J.; Suzek, T. O.; Zhang, J.; Wang, J.; Bryant, S. H. PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res. 2009, 37 (Web Server issue), W623-33. 23. Williams, A. J.; Harland, L.; Groth, P.; Pettifer, S.; Chichester, C.; Willighagen, E. L.; Evelo, C. T.; Blomberg, N.; Ecker, G.; Goble, C.; Mons, B. Open PHACTS: Semantic interoperability for drug discovery. Drug Discovery Today 2012, 17 (21-22), 1188–1198.

ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

24. Gray, Alasdair J. G.; Groth, P.; Loizou, A.; Askjaer, S.; Brenninkmeijer, C.; Burger, K.; Chichester, C.; Evelo, C. T.; Goble, C.; Harland, L.; Pettifer, S.; Thompson, M.; Waagmeester, A.; Williams, A. J. Applying Linked Data Approaches to Pharmacology: Architectural Decisions and Implementation. Semantic Web 2014, 5 (2), 101–113. 25. Chen, C. Y.-C.; Hofmann, A. TCM Database@Taiwan: The World's Largest Traditional Chinese Medicine Database for Drug Screening In Silico. PLoS ONE [Online] 2011, 6 No. 1. PLoS ONE. DOI: 10.1371/journal.pone.0015939. 26. Gao, Z.; Li, H.; Zhang, H.; Liu, X.; Kang, L.; Luo, X.; Zhu, W.; Chen, K.; Wang, X.; Jiang, H. PDTD: a web-accessible protein database for drug target identification. BMC Bioinform. [Online] 2008, 9, 104. BMC Bioinform. DOI: 10.1186/1471-2105-9-104. 27. Chang, A.; Schomburg, I.; Placzek, S.; Jeske, L.; Ulbrich, M.; Xiao, M.; Sensen, C. W.; Schomburg, D. BRENDA in 2015: Exciting developments in its 25th year of existence. Nucleic Acids Res. 2015, 43 (D1), D439-D446. 28. Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind Database: Collection of Binding Affinities for Protein−Ligand Complexes with Known Three-Dimensional Structures: Collection of Binding Affinities for Protein−Ligand Complexes with Known Three-Dimensional Structures. J. Med. Chem. 2004, 47 (12), 2977–2980. 29. Wang, R.; Fang, X.; Lu, Y.; Yang, C.-Y.; Wang, S. The PDBbind Database: Methodologies and Updates. J. Med. Chem. 2005, 48 (12), 4111–4119. 30. Fabregat, A.; Sidiropoulos, K.; Garapati, P.; Gillespie, M.; Hausmann, K.; Haw, R.; Jassal, B.; Jupe, S.; Korninger, F.; McKay, S.; Matthews, L.; May, B.; Milacic, M.; Rothfels, K.; Shamovsky, V.; Webber, M.; Weiser, J.; Williams, M.; Wu, G.; Stein, L.; Hermjakob, H.; D'Eustachio, P. The Reactome pathway Knowledgebase. Nucleic Acids Res. 2016, 44 (D1), D481-7. 31. Kutmon, M.; Riutta, A.; Nunes, N.; Hanspers, K.; Willighagen, E. L.; Bohler, A.; Melius, J.; Waagmeester, A.; Sinha, S. R.; Miller, R.; Coort, S. L.; Cirillo, E.; Smeets, B.; Evelo, C. T.; Pico, A. R. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Res. 2016, 44 (D1), D488-94. 32. Irwin, J. J.; Shoichet, B. K. ZINC--a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 2005, 45 (1), 177–182. 33. Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G. ZINC: a free tool to discover chemistry for biology. J. Chem. Inf. Model. 2012, 52 (7), 1757–1768. 34. Degtyarenko, K.; Matos, P. de; Ennis, M.; Hastings, J.; Zbinden, M.; McNaught, A.; Alcántara, R.; Darsow, M.; Guedj, M.; Ashburner, M. ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res. 2008, 36 (Database issue), D344-50. 35. Hastings, J.; Matos, P. de; Dekker, A.; Ennis, M.; Harsha, B.; Kale, N.; Muthukrishnan, V.; Owen, G.; Turner, S.; Williams, M.; Steinbeck, C. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 2013, 41 (Database issue), D456-63. 36. Sterling, T.; Irwin, J. J. ZINC 15--Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55 (11), 2324–2337.

ACS Paragon Plus Environment

Page 20 of 31

Page 21 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

37. Editorial: ChemSpider--a tool for Natural Products research. Nat. Prod. Rep. 2015, 32 (8), 1163– 1164. 38. Matos, P. de; Alcántara, R.; Dekker, A.; Ennis, M.; Hastings, J.; Haug, K.; Spiteri, I.; Turner, S.; Steinbeck, C. Chemical Entities of Biological Interest: an update. Nucleic Acids Res. 2010, 38 (Database issue), D249-54. 39. Breinbauer, R.; Vetter, I. R.; Waldmann, H. From Protein Domains to Drug Candidates-Natural Products as Guiding Principles in the Design and Synthesis of Compound Libraries. Angew. Chem. Int. Ed. 2002, 41 (16), 2878. 40. Reymond, J.-L.; Awale, M. Exploring chemical space for drug discovery using the chemical universe database. ACS Chem. Neurosci. [Online] 2012, 3 No. 9, 649–657. ACS Chem. Neurosci. DOI: 10.1021/cn3000422. 41. Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 2012, 52 (11), 2864–2875. 42. Allen, F. H. The Cambridge Structural Database: A quarter of a million crystal structures and rising. Acta Cryst. 2002, 58 (3), 380–388. 43. Sander, T.; Freyss, J.; Korff, M. von; Rufener, C. DataWarrior: An Open-Source Program For Chemistry Aware Data Visualization And Analysis. J. Chem. Inf. Model. 2015, 55 (2), 460–473. 44. Reutlinger, M.; Schneider, G. Nonlinear dimensionality reduction and mapping of compound libraries for drug discovery. J Mol Graph Model. 2012, 34, 108–117. 45. Guha, R.; van Drie, J. H. Structure--activity landscape index: identifying and quantifying activity cliffs. J. Chem. Inf. Model. 2008, 48 (3), 646–658. 46. Hilbig, M.; Urbaczek, S.; Groth, I.; Heuser, S.; Rarey, M. MONA – Interactive manipulation of molecule collections. J Cheminform. [Online] 2013, 5 No. 1, 38. J Cheminform. DOI: 10.1186/1758-2946-5-38. 47. Hilbig, M.; Rarey, M. MONA 2: A Light Cheminformatics Platform for Interactive Compound Library Processing. J. Chem. Inf. Model. 2015, 55 (10), 2071–2078. 48. Baell, J.; Walters, M. A. Chemistry: Chemical con artists foil drug discovery. Nature. 2014, 513 (7519), 481–483. 49. Baell, J. B.; Ferrins, L.; Falk, H.; Nikolakopoulos, G. PAINS: Relevance to Tool Compound Discovery and Fragment-Based Screening. Aust. J. Chem. 2013, 66 (12), 1483. 50. Schuffenhauer, A.; Ertl, P.; Roggo, S.; Wetzel, S.; Koch, M. A.; Waldmann, H. The Scaffold TreeVisualization of the Scaffold Universe by Hierarchical Scaffold Classification. J. Chem. Inf. Model. 2007, 47 (1), 47–58. 51. Klein, K.; Koch, O.; Kriege, N.; Mutzel, P.; Schäfer, T. Visual Analysis of Biological Activity Data with Scaffold Hunter. Mol. Inf. 2013, 32 (11-12), 964–975. 52. Wetzel, S.; Klein, K.; Renner, S.; Rauh, D.; Oprea, T. I.; Mutzel, P.; Waldmann, H. Interactive exploration of chemical space with Scaffold Hunter. Nat. Chem. Biol. 2009, 5 (8), 581–583. 53. Klein, K.; Kriege, N.; Mutzel, P. Scaffold Hunter: Facilitating Drug Discovery by Visual Analysis of Chemical Space. In Computer Vision, Imaging and Computer Graphics. Theory and Application; ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Csurka, G., Kraus, M., Laramee, R. S., Richard, P., Braz, J., Eds.; Communications in Computer and Information Science; Springer Berlin Heidelberg: Berlin, Heidelberg, 2013; pp 176–192. 54. Kriege, N.; Mutzel, P.; Schäfer, T. Practical SAHN Clustering for Very Large Data Sets and Expensive Distance Metrics. J. Graph Algorithms Appl. 2014, 18 (4), 577–602. 55. Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kötter, T.; Meinl, T.; Ohl, P.; Sieb, C.; Thiel, K.; Wiswedel, B. KNIME: The Konstanz Information Miner. In Data analysis, machine learning and applications: Albert-Ludwigs-Universität Freiburg, March 7-9, 2007; Preisach, C., Ed.; Studies in classification, data analysis, and knowledge organization 31; Springer: Berlin, Heidelberg, 2008. 56. Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Kötter, T.; Meinl, T.; Ohl, P.; Thiel, K.; Wiswedel, B. KNIME - the Konstanz information miner. SIGKDD Explor. Newsl. 2009, 11 (1), 26. 57. Todeschini, R.; Consonni, V.; Xiang, H.; Holliday, J.; Buscema, M.; Willett, P. Similarity coefficients for binary chemoinformatics data: overview and extended comparison using simulated and real data sets. J. Chem. Inf. Model. 2012, 52 (11), 2884–2901. 58. Wolstencroft, K.; Haines, R.; Fellows, D.; Williams, A.; Withers, D.; Owen, S.; Soiland-Reyes, S.; Dunlop, I.; Nenadic, A.; Fisher, P.; Bhagat, J.; Belhajjame, K.; Bacall, F.; Hardisty, A.; Nieva de la Hidalga, Abraham; Balcazar Vargas, M. P.; Sufi, S.; Goble, C. The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic Acids Res. 2013, 41 (Web Server issue), W557-61. 59. Reker, D.; Perna, A. M.; Rodrigues, T.; Schneider, P.; Reutlinger, M.; Monch, B.; Koeberle, A.; Lamers, C.; Gabler, M.; Steinmetz, H.; Muller, R.; Schubert-Zsilavecz, M.; Werz, O.; Schneider, G. Revealing the macromolecular targets of complex natural products. Nat. Chem. 2014, 6 (12), 1072–1078. 60. Reker, D.; Rodrigues, T.; Schneider, P.; Schneider, G. Identifying the macromolecular targets of de novo-designed chemical entities through self-organizing map consensus. Proc. Natl. Acad. Sci. USA 2014, 111 (11), 4067–4072. 61. Nickel, J.; Gohlke, B.-O.; Erehman, J.; Banerjee, P.; Rong, W. W.; Goede, A.; Dunkel, M.; Preissner, R. SuperPred: update on drug classification and target prediction. Nucleic Acids Res. 2014, 42 (Web Server issue), W26-31. 62. Xiao, X.; Min, J.-L.; Wang, P.; Chou, K.-C. iGPCR-drug: a web server for predicting interaction between GPCRs and drugs in cellular networking. PLoS ONE [Online] 2013, 8 No. 8, e72234. PLoS ONE. DOI: 10.1371/journal.pone.0072234. 63. Yamanishi, Y.; Kotera, M.; Moriya, Y.; Sawada, R.; Kanehisa, M.; Goto, S. DINIES: drug-target interaction network inference engine based on supervised analysis. Nucleic Acids Res. 2014, 42 (Web Server issue), W39-45. 64. Keiser, M. J.; Roth, B. L.; Armbruster, B. N.; Ernsberger, P.; Irwin, J. J.; Shoichet, B. K. Relating protein pharmacology by ligand chemistry. Nat. Biotechnol. 2007, 25 (2), 197–206. 65. Mathias, S. L.; Hines-Kay, J.; Yang, J. J.; Zahoransky-Kohalmi, G.; Bologa, C. G.; Ursu, O.; Oprea, T. I. The CARLSBAD database: a confederated database of chemical bioactivities. Database (Oxford) [Online] 2013, 2013, bat044. Database (Oxford). DOI: 10.1093/database/bat044. 66. ICH -International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use. M7 Assessment and Control of DNA Reactive (Mutagenic) Impurities in ACS Paragon Plus Environment

Page 22 of 31

Page 23 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

Pharmaceuticals to Limit Potential Carcinogenic Risk. http://www.ich.org/fileadmin/Public_ Web_Site/ICH_Products/Guidelines/Multidisciplinary/M7/M7_Step_4.pdf (accessed August 5, 2016). 67. U.S. Food and Drug Administration. 1. Modernize Toxicology to Enhance Product Safety. http:// www.fda.gov/ScienceResearch/SpecialTopics/RegulatoryScience/ucm268111.htm (accessed August 5, 2016). 68. Hardy, B.; Douglas, N.; Helma, C.; Rautenberg, M.; Jeliazkova, N.; Jeliazkov, V.; Nikolova, I.; Benigni, R.; Tcheremenskaia, O.; Kramer, S.; Girschick, T.; Buchwald, F.; Wicker, J.; Karwath, A.; Gutlein, M.; Maunz, A.; Sarimveis, H.; Melagraki, G.; Afantitis, A.; Sopasakis, P.; Gallagher, D.; Poroikov, V.; Filimonov, D.; Zakharov, A.; Lagunin, A.; Gloriozova, T.; Novikov, S.; Skvortsova, N.; Druzhilovsky, D.; Chawla, S.; Ghosh, I.; Ray, S.; Patel, H.; Escher, S. Collaborative development of predictive toxicology applications. J Cheminform. [Online] 2010, 2 No. 1, 7. J Cheminform. DOI: 10.1186/1758-2946-2-7. 69. Yang, J. J.; Ursu, O.; Lipinski, C. A.; Sklar, L. A.; Oprea, T. I.; Bologa, C. G. Badapple: promiscuity patterns from noisy evidence. J Cheminform. [Online] 2016, 8, 29. J Cheminform. DOI: 10.1186/s13321-016-0137-3.

ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

39x35mm (600 x 600 DPI)

ACS Paragon Plus Environment

Page 24 of 31

Page 25 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

76x41mm (300 x 300 DPI)

ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

85x51mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 26 of 31

Page 27 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

80x46mm (300 x 300 DPI)

ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

79x45mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 28 of 31

Page 29 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

50x18mm (300 x 300 DPI)

ACS Paragon Plus Environment

ACS Chemical Biology

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

112x187mm (300 x 300 DPI)

ACS Paragon Plus Environment

Page 30 of 31

Page 31 of 31

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

ACS Chemical Biology

93x130mm (300 x 300 DPI)

ACS Paragon Plus Environment