How Many Fingers Does a Compound Have? Molecular Similarity

Here we will present strategies to defining and slicing non-conventional molecular fingerprints as well as the application of network algorithms to bu...
0 downloads 0 Views 887KB Size
Chapter 15

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

How Many Fingers Does a Compound Have? Molecular Similarity beyond Chemical Space Eugen Lounkine* and Miguel L. Camargo*,1 Novartis Institutes for Biomedical Research, 250 Massachusetts Avenue, Cambridge, Massachusetts 02139 *E-mail: [email protected]; [email protected] 1Present address: UCB, 14th floor, One Broadway, Cambridge, Massachusetts 02142

The concept of molecular fingerprints and molecular similarity have matured and found innumerable applications in academia as well as industry, in particular in drug discovery. Chemical similarity is almost too commonplace for us to notice anymore. Still, this powerful concept – that molecules can a) be represented in terms of their interesting properties and b) are in one way or another similar to each other – has been growing over the past two decades to break out of the confinement of chemical space. Today, we do not just use chemical similarity to find compounds that biologically will behave the same; rather, we directly build on ever growing biological profiles to directly assess bio-similarity. In addition, capturing the rich descriptions of compound-induced phenotypes from literature gives us yet another molecular fingerprint. We define ‘molecular fingerprint’ to represent properties of compounds that include, but also, go beyond chemical descriptors. This brings new challenges and opportunities such as: How do we define and encode bioactivity and literature profiles in form of comparable fingerprints? How do we deal with the inherent sparseness of such representations? And, most importantly, how do we use these various ways of defining similarity in concert? Network concepts that have emerged and matured in social sciences, such as friend-of-a-friend may be of help – after all, we have been using the concept of “chemical neighborhoods” all along. Here we will present strategies to © 2016 American Chemical Society

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

defining and slicing non-conventional molecular fingerprints as well as the application of network algorithms to build and navigate heterogeneous similarity networks.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

Introduction – Starting with Chemical Similarity One of the origins of chemical similarity has been in chemical database design: efficient comparison of molecular structure increased the speed with which a particular chemical structure or analogs thereof could be retrieved based on a query (1–3). Combined with the classical pharmacologic observation that compounds from congeneric series had similar effects in biological systems (2), the similarity property principle (4) has been stated: compounds having similar chemical properties tended to have same (bioactivity) properties. Efficient computational search for similar compounds now allowed finding compounds with similar useful properties to a reference compound. This formed the basis of ligand-based virtual screening with the goal of finding compounds with similar bioactivity using chemical similarity as a proxy. Departing from the initial “search for analogs”, chemical fingerprints have been tailored towards finding active compounds that departed substantially from the space of close reference analogs, while retaining the biological activity. Indeed, benchmark sets for virtual screening often contain compounds that by design are diverse and balanced (5). Optimal fingerprints and search parameters have been the area of active research in the 1990s and early 2000s (1), resulting in generally accepted guidelines for “vanilla” chemical similarity searching (1–3, 6, 7). Chemical fingerprints subsequently advanced from a simple proxy status for “actual”, or biological similarity, to powerful tools for structure-activity relationship and activity cliff elucidation (8). Nevertheless, a priori chemical similarity can possibly only cover activity mediated by specific binding events against a target, or maybe a target family. Furthermore, it is not always the case that compounds that produce the same effect in a phenotypic assay, or in the clinic, would be structurally similar in any meaningful way. For example, morphine and aspirin are both analgesics, but they act through very different mechanisms. In such cases, a chemical similarity approach would not be able to recapitulate this high-level indication based connection (i.e. effect on the same phenotype) without any additional information. There are several reasons why additional information is needed. For one, compounds can achieve similar phenotypic outcomes by targeting different entry points to the same pathway/biological process. Secondly, small variations in chemical structures could lead to differences in polypharmacology profiles of the compound. This in turn may result in different phenotypic responses. Thirdly, even when compounds show a high degree of target selectivity, they exhibit pleiotropic effects resulting from downstream changes in gene expression. Such downstream changes could vary between compounds where minor modifications to the chemical structure change pharmacokinetic / pharmacodynamic (PK/PD) properties, impacting the magnitude of downstream effects. Because of these 332 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

challenges, extending similarity beyond chemical descriptors as means of grouping compounds is important. Therefore, compound similarity today can also be defined based on bioactivity, phenotypic modulation, or therapeutic knowledge that can be encoded in fingerprints (9). Advances in omics technologies as well as in computer science provide the chemoinformatician the ability to use disparate data types. These include, but are not limited to gene expression, behavior of compounds across multiple screens, as well as high-level (clinical) phenotypes extracted from the literature. The similarity metrics all have their pros and cons and should be used as a compendium of approaches, individually, or in combination, and not as competitive technologies. Here we discuss approaches to molecular similarity that have augmented and departed from chemical similarity.

Infusing Chemical Space with Bioactivity Information In structure-activity relationship (SAR) exploration on chemogenomic data chemical features are routinely juxtaposed with compound’s biological activities in order to identify chemical features characteristic of or mediating activity. Often, bioactivity information is projected onto chemical space, and activity of chemical neighbors is compared. This includes compounds with the same core structure (10), matched molecular pairs (11, 12), or fingerprint similarity (13–15) based approaches. Formalisms have been developed to characterize the activity landscape of any portion of predefined chemical space. Using the landscape metaphor, chemical space is often projected onto a plane, and activity values occupy the third dimension (13), giving rise to “rocky”, “hilly”, and “flat” regions that are informative for SAR exploration. These approaches, along with quantitative SAR, or QSAR, seek to explain changes in bioactivity on the basis of more or less subtle changes in chemical structure. Naive Bayes models, as a side-product, score features based on how enriched they are in active compounds, and thus can provide insight into individual chemical moieties required for activity (16). Compounds are scored based on the individual scores of their chemical (fingerprint) features. One way to combine chemical and bioactivity space directly is through Bayes affinity fingerprints, where compounds are described by scores from a panel of Naive Bayes target models (17). Thus, a predicted-bioactivity space is defined, which is grounded in chemical space and its contribution to activities at multiple targets. Compounds that are similar in this space can be chemically quite distinct from each other, as long as they retain a collection of (overlapping or non-overlapping) features characteristic of the same target activities. Another approach is to stay in chemical space, but use Naive Bayes feature weights (18) or frequency scores (19) to emphasize parts of compounds characteristic of a defined activity. Such weighted fingerprints have found application in virtual screening (19, 20), as well as clustering (18). Clusters derived in that way are centered on molecular core fragments characteristic of activity. These cores are activity class-dependent and can be non-overlapping for pairs of assays. 333

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

Biological Fingerprints Departing from chemical space, increasing profiling efforts and public availability of bioactivity data make it possible to compare compounds based on their biological profiles. Describing compounds based on their biological activity has a decade-long history (10). Examples include cellular proliferation panels (21), gene expression profiles (22), high content imaging profiles (23) and High-throughput screening fingerprints (HTSFPs), which are based on many historical assays (9). Because they are aggregated over multiple assays, HTSFPs more clearly than other biological fingerprints can be used to exemplify unique challenges and opportunities of biological fingerprints and will be discussed here. Fingerprint sparsity has different meanings for chemical and biological fingerprints. Chemical fingerprints are calculated based on the (fully known) chemical structure. Although typically only few bits are set (7), i.e. the fingerprint is sparse, every bit that is not set has a definite meaning: the encoded feature is not present in the chemical structure. For many compounds, the same is true for HTSFPs: a typical compound will likely be active only in a subset of assays. However, a second meaning of sparsity makes comparing HTSFPs more complex: compounds typically have not been profiled in all of the assays. Missing data are very different from a bit that is not set or a value of zero signifying inactivity. We simply do not know what activity value we should assign, and have to account for that when we calculate similarity values. An extreme case of HTSFP sparsity is compounds that do not have an HTSFP at all, because they have not been screened in relevant assays. Using the similarity property principle (4), we have introduced bioturbo similarity searching (24) named for its similarities to turbo similarity searching (2, 25) in chemical space. If a compound itself does not have an HTSFP, we try to find, using chemical similarity, a surrogate compound that allows us to carry out an HTSFP similarity search. This opens up HTSFP search and its advantages (like scaffold hopping) to compounds we do not even have in our collection. In particular, we were able to show that starting with chemically unattractive active natural products we can find chemically tractable low molecular weight compounds with the same mode of action (24). At the basis for HTSFP similarity is a pearson correlation coefficient between the fingerprints of the assays. What to do with missing data in one fingerprint or the other? One possibility would be to assume inactivity and set these values to zero; however, this greatly skews any similarity values, in particular when one compound has got a significant activity and the other has not been screened in a particular assay. Therefore, we have chosen to only compare the assays compounds actually had in common (9). This posed unique challenges: a pearson correlation coefficient of 0.5 may be substantial based on 300 assays, but if only three assays were shared between compounds, this value may not be significant. Analytically, one can calculate a p-Value for an observed correlation under the assumption that assays are normally distributed and independent. This is a rather strong assumption for historic assays, which may address similar target families. Therefore, we use empirical correlation coefficient distributions derived from pairwise fingerprint comparisons of a randomized reference set and 334 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

calculate empirical p-Values or frequency scores (26). These scores are used in combination with a cutoff for the number of assays in common and a minimal pearson correlation coefficient to define when two compounds are similar. A more subtle challenge arising from HTSFP sparsity is sparsity of the similarity matrix itself. Two compounds may be in fact very similar to each other, but if they do not have enough assays in common, we would never know. However, we may find compounds that have enough assays in common with both compounds. In chemical similarity networks, these three compounds, all similar to each other will form a clique. For intrinsically sparse HTSFPs, even though all compounds de facto are similar to each other, a clique is not formed. Thus, in sparse biosimilarity networks the turbo approach of finding neighbors-of-neighbors is more than a relaxation of the similarity threshold: it can be crucial in identifying close biosimilars whose similarity we simply cannot assess. The similarity property principle in chemical space has been largely applied to virtual screening or target prediction. The underlying goal is to find compounds with a similar biological activity – an activity that often is defined based on activity at a specific target. Biological fingerprints extend this goal. If two compounds behave similarly across a range of different cellular assays it might be that they act at the same target. However, it also could be that they affect the same pathway by acting at different nodes in the pathway. Since these nodes can be very different in their structure (one could be an upstream GPCR, another one a downstream kinase), chemical similarity would be hard pressed to find commonalities. Biological similarity, in particular based on cellular assays intrinsically captures modulation of similar intracellular processes (9). Biological fingerprints have expanded our definition of what we can consider similar molecules. In some ways they are closer to the underlying question we ask of chemical similarity: what are compounds that will behave, biologically/ phenotypically, in a similar way? This does not mean that they can fully replace chemical similarity – quite the opposite: once new active chemical matter has been identified, chemical similarity approaches are indispensable to understanding SAR. Moreover, in bioturbo similarity, chemical space links as well compounds that do not have an HTSFP. Turning bioturbo similarity on its head, HTSFPs may also be applied to natural product extracts without a defined structure: while they may have biological fingerprints (because they have been screened in past assays), the structure of the constituent compounds is not known. Identifying low molecular weight HTSFP similars can elucidate the MOAs of active ingredients. Moreover, just like chemical fingerprints can be used to define similarity between compounds, but also to elucidate chemical features characteristic of compound subsets. For example, Naive Bayes models assign weights to chemical features enriched in active compounds (16). Similarly, HTSFP features have been used to predict compound targets using Naive Bayes and Random Forests (27). The idea behind HTSFPs can also be extended to other data types such as in imaging and gene expression. In a similar vein, high content screens (HCS) can measure several cellular parameters in a screen such as cell size, mitochondrial count, nuclear size and bespoke measurements that are created or the designated 335

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

biology of choice (e.g. neuronal outgrowth or number of dendritic spines in neurons). The correlation of the compounds across these measurements, like with HTSFP can be used to group compounds with similar MOAs and assign MOA to orphan compounds. Gene expression has been applied in a similar vain to reposition compounds as well as assign MOAs (22). HTSFPs assess activity across many different experiments/projects with different targets and biology. In contrast, HCS and gene expression profiles from a particular single high-content assay can be sliced to map to the context of the biology being interrogated. Select group of genes that represent a desired phenotypic outcome can be used to identify compounds with similar profiles. Here, compounds that share the same gene expression patterns to a positive control are of interest.

Literature Fingerprints We have seen with HTSFPs how our definition of similarity can be stretched from initially focusing on the same binding pocket (the underlying assumption of the similarity property principle) to a more general similarity grounded in common pathway modulation. Similarity expansion around compounds not only can yield additional bioactive chemical matter, but also can help gain knowledge about the reference compounds. For example, a chemical analog of a hit may be in the clinic, or have a common name that can be used to search for relevant literature. This in turn can elucidate the cellular processes involved in the observed phenotype, as well as potential utility / caveats in the clinic. Using the scientific literature, and similar types of data such as adverse drug reactions, adds a dimension that is not possible to capture in high-throughput assays. For example, because genes may participate in different processes across different tissue types, use of compounds for experimental purposes may reveal interesting observations that can only be associated to them in the literature (e.g. changes to levels of inflammation or onset of anaphylaxis) that are not evident from cell and biochemical based screens. Taking this approach a step further, we can encode compounds using concepts from relevant literature. Text mining and natural language processing focus on extracting sentiments about compounds, and improving sensitivity and specificity of statement extraction is an area of active research (28). Application of such methods to literature corpora often is limited to (freely available) abstracts, which not always mention chemical matter discussed in the full text of the paper. If a publication focuses on a novel phenotype, it may not mention in the abstract all tool compounds used. At the same time both in publicly available chemogenomics databases like ChEMBL DB (29) and commercially available database such as Thompson Reuters Metabase (30), a majority of compound annotations are accompanied by a literature reference, often in the form of a MEDLINE Pubmed Id. In addition, MeSH terms are available from MEDLINE that describe the papers. Combining these two can yield a large number (millions from ChEMBL alone) of compounds that are now annotated with MeSH terms of papers they occur in. These terms can then be filtered (just like any other binary descriptors) for information content. For example, “Animal” or “Disease” are 336

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

non-informative features, because they co-occur with (almost) all compounds. The ensuing literature fingerprints encode not a particular in vitro bioactivity of the compound, as much as what the compound is “about”. There are unique caveats to such generated fingerprints. Compounds coming from congeneric series discussed in any one paper will form the biggest and most tight clusters. Some other compounds are mentioned in many publications and include amino acids, solvents, and media ingredients. Due to complexity effects that are well understood for chemical binary fingerprints (31), these compounds will tend to have higher similarity scores in binary similarity searches.

Figure 1. Starting with seed painkillers (dark nodes), expansion can be carried out using different fingerprints. Here a chemical, biological (HTSFP) and literature fingerprints are combined to form a heterogeneous similarity network. Distinct clusters emerge that are connected by a combination of similarities, rather than one particular fingerprint type. Using the mathematical structure of a graph, individual compounds can be prioritized (large nodes); in this example, compounds connected by multiple evidence lines (like chemical and literature for NSAIDs) have been emphasized (large nodes). It is important to note that while the visualization of the network serves to explain the approach here, it is not necessary, nor sufficient for deriving a prioritized list of compounds. This is done on the mathematical structure (the graph) underlying the visualization. 337

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

Despite the noise coming from the compound-paper-term link, and non-informative features and compounds, literature fingerprint similarity provides a unique view on compound relationships. For example, it is rather difficult to find a chemical similarity metric that would connect different NSAIDs, not to mention anti-inflammatory compounds in general (Figure 1). Of course, for this example no complex fingerprinting is necessary – we simply could query the anatomical, therapeutic, and chemical (ATC) classification for anti-inflammatory drugs. But what do we do with hits from an assay we know nothing about a priori? Some of these hits may have literature fingerprints linking them to known drugs and tool compounds used in the literature. Moreover, machine learning techniques such as Naive Bayes models (16) can be applied to any other binary fingerprints to find out what these compounds are “about”. Contextual understanding of the chemical space is a cornerstone of both focused library design before screening and MOA elucidation after screening (21). For example, despite being very informative, phenotypic screens can only handle a small set of compounds and hence require the design of chemical subsets that are hypothesis driven (32). In such cases, libraries that probe the right biology will have a higher probability of success. Finding a chemically diverse, but contextually similar subset of compounds using literature fingerprints can help design such libraries. Similarly, when hits are being selected after the screen has been run, grouping compounds with similar structures is important. However, contextualizing the hits, or different chemical series, in terms of themes, may reveal that although chemically tractable, these compounds may affect unwanted mechanisms in the cell or be associated with adverse events in animal models or in the clinic.

Signatures – Molecular Fingerprints without Molecules Chemical similarity searching always starts with a reference compound, either a compound that exists in the physical world (has been synthesized), but also virtual compounds that are drawn from scratch. What is the equivalent for biological / phenotypic / literature fingerprints? At first it seems like we are at a loss, because we cannot calculate the fingerprint from something that is readily available, such as the chemical structure. However, we still can define a reference signature based on our knowledge of individual assays or readouts. Queries could, for example be “I want inhibitors of assays that involve Kinase XY”; “I want compounds that down-regulate five genes, and upregulate six other genes”; “Give me compounds that increase the nucleus size, but do not deform it”. The metadata for each “bit” of these assays often is (or should be (21)) much richer than for the typical chemical fingerprint. Thus, we can define what we want to see in a compound, even if we have no single tool in the collection that would have this exact effect. Alternatively, active controls could be assay conditions (heat-shocking the compounds) or treatment with biomolecule factors (Lipopolysaccharide, LPS) or siRNA (for gene knockdown). Such knowledge-based signatures require rich metadata for individual fingerprint bits that are comparable across assays. For example, in HCSFPs (23), we have 338

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

to know what readouts relate to nuclear size. Signatures also can be defined in a data-driven way by clustering compounds using their biological fingerprints; cluster centroids or representative compounds can then serve as a reference. We might not know what this signature exactly “means”, but if we run an enrichment analysis on compound targets, associated MeSH terms (from their literature fingerprints), or identify individual compounds that are well-known tools, we can assign these signatures names like “Apoptosis signature”, “Motility inhibition”, etc. (Figure 2). Combining predefined signatures with fingerprint bit metadata allows then to slice any given fingerprint in a way that makes it comparable to a signature. Thus, we can identify, across projects, compounds that may belong to the same phenotypic mode-of-action class.

Figure 2. Within a particular project, a compound fingerprint is derived from specific questions asked of the primary readouts. These primary readouts can come from different experiments, as in the case of HTSFPs, or a single assay with multiple readouts, as in the case of HCSFPs. Signatures can be defined either on the basis of known reference compounds (i.e., active controls), or based on knowledge about how certain readouts should behave in the best case scenario for an active compound. Hits can then be identified by comparing the signature to actual compound fingerprints. When biological fingerprints have been collected for compounds from multiple projects, they can be organized based on readout metadata (e.g., all readouts related to nuclear size), or be limited to individual multiparametric assays. Signatures can be then compared across projects and mapped to specific phenotypes. For example “Cytostatics affect these 10 readouts in this particular way.” 339

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

Heterogeneous Similarity Networks We have discussed biological fingerprints and literature fingerprints that extend the concept of chemical similarity to compounds that affect biological systems in the same way, and compounds that are grouped together in our common (literature) knowledge space. Each of these approaches has its caveats and unique applications. However, as we have seen for bioturbo similarity searching, the combination of multiple similarity approaches synergistically opens up new avenues to explore related compounds. Each fingerprint will come with its own best practice of parameterization and optimal cut-offs to define what similar means. In the area of chemical similarity voting schemes and other ways of combining similarity metrics have been explored (33–35). For chemical similarities, network approaches proved useful for navigation of chemical space, and SAR exploration, irrespective of how exactly chemical similarity is defined in each case (10, 14, 36). Encoding similarity, or for that matter any relationships in a graph is by no means a new approach; but exactly because graphs and network relationships are well studied (37, 38), we can use established algorithms to combine multiple similarity approaches in heterogeneous similarity networks. From this point of view, (bio)turbo similarity searching is nothing more than neighbor exploration in a similarity graph: starting with a reference node we ask what its neighbors are (in chemical, biological, or literature space), and then we ask the same question about any neighbors we find. For example, starting with a few well-known painkillers (Figure 1), after two rounds of expansion, we can identify clusters defined by common literature, chemical analogs, or common behavior in historical assays. In this example, clusters delineate diclofenac biosimilars, NSAIDs, Coxibs, Morphine analogs, and compounds discussed in oncology pain management. No single similarity metric in isolation could identify all of them. Moreover, using standard metrics (37) such as connectivity, betweenness, clustering behavior, etc., we can weigh individual neighbors. For example, we can identify compounds that are connected to other compounds in more than one space. Then, we can use a standard graph/network flow algorithm to let the score “diffuse” to neighbors, identifying clusters with cross-domain activity relationships. Often, “network” is used synonymously with “visualization of a network”. Network visualization is not only aesthetically pleasing (39, 40), but have been shown instrumental in understanding SAR and navigating chemical space (10, 14). Nevertheless, first networks are a mathematical construct, which can be visualized, but do not have to be visualized to apply graph algorithms and scoring. A lot of work is required to find an informative layout, coloring scheme, added interactivity such as zooming and panning (14, 40). However, these features are not necessary: clustering and scoring approaches (which do not require visualization) provide a sorted table grouped by clusters, and that often is all that is needed to focus on compounds of interest. In summary, similarity networks, while not being new, provide an elegant way to combine different novel similarity domains in one extensible and wellunderstood framework. 340 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

Conclusions With more knowledge that we collect about the pharmacological, toxicological, clinical, and other activities of compounds, we have the opportunity to define new knowledge spaces that significantly depart from classical chemical space. Different such fingerprints of compounds require slightly different approaches to defining and interpreting similarity. However, the concept of neighbors is common to all of them. Hence, network, or graph approaches can help combine these distinct viewpoints into a common, robust, and easy to navigate model. Many standard algorithms have been developed for graphs, applicable to social networks as much as to compound networks. Using multiple views of a compound and combining highly orthogonal information, such as chemical structure, in vitro activity, and clinical phenotypes enables us to make more informed decisions than is possible with any annotation on its own. So, how many finger(print)s does a compound have? The answer is: as many as it is useful to us!

References Willett, P. Similarity searching using 2D structural fingerprints. Methods Mol. Biol. 2011, 672, 133–158. 2. Willett, P. Similarity-based virtual screening using 2D fingerprints. Drug Discovery Today 2006, 11, 1046–1053. 3. Bender, A.; Jenkins, J. L.; Scheiber, J.; Sukuru, S. C.; Glick, M.; Davies, J. W. How similar are similarity searching methods? A principal component analysis of molecular descriptor space. J. Chem. Inf. Model. 2009, 49, 108–119. 4. Maggiora, G.; Vogt, M.; Stumpfe, D.; Bajorath, J. Molecular similarity in medicinal chemistry. J. Med. Chem. 2014, 57, 3186–3204. 5. Rohrer, S. G.; Baumann, K. Maximum unbiased validation (MUV) data sets for virtual screening based on PubChem bioactivity data. J. Chem. Inf. Model. 2009, 49, 169–184. 6. Willett, P. Combination of similarity rankings using data fusion. J. Chem. Inf. Model. 2013, 53, 1–10. 7. Rogers, D.; Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 8. Stumpfe, D.; Bajorath, J. Activity cliff networks for medicinal chemistry. Drug Dev. Res. 2014, 75, 291–298. 9. Petrone, P. M.; Simms, B.; Nigsch, F.; Lounkine, E.; Kutchukian, P.; Cornett, A.; Deng, Z.; Davies, J. W.; Jenkins, J. L.; Glick, M. Rethinking molecular similarity: comparing compounds on the basis of biological activity. ACS Chem. Biol. 2012, 7, 1399–1409. 10. Wawer, M.; Lounkine, E.; Wassermann, A. M.; Bajorath, J. Data structures and computational tools for the extraction of SAR information from large compound sets. Drug Discovery Today 2010, 15, 630–639. 11. Wassermann, A. M. Structure-activity relationship analysis on the basis of matched molecular pairs. J. Cheminform. 2014, 6 (Suppl. 1), O14. 1.

341 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

12. Wassermann, A. M.; Bajorath, J. Large-scale exploration of bioisosteric replacements on the basis of matched molecular pairs. Future Med. Chem. 2011, 3, 425–436. 13. Bajorath, J.; Peltason, L.; Wawer, M.; Guha, R.; Lajiness, M. S.; Van Drie, J. H. Navigating structure-activity landscapes. Drug Discovery Today 2009, 14, 698–705. 14. Lounkine, E.; Wawer, M.; Wassermann, A. M.; Bajorath, J. SARANEA: a freely available program to mine structure-activity and structure-selectivity relationship information in compound data sets. J. Chem. Inf. Model. 2010, 50, 68–78. 15. Wawer, M.; Peltason, L.; Weskamp, N.; Teckentrup, A.; Bajorath, J. Structure-activity relationship anatomy by network-like similarity graphs and local structure-activity relationship indices. J. Med. Chem. 2008, 51, 6075–6084. 16. Lounkine, E.; Kutchukian, P. S.; Glick, M. Chemometric Applications of Naïve Bayesian Models in Drug Discovery. In Chemoinformatics for Drug Discovery; John Wiley & Sons, Inc.: 2013; pp 131−148. 17. Bender, A.; Jenkins, J. L.; Glick, M.; Deng, Z.; Nettles, J. H.; Davies, J. W. “Bayes affinity fingerprints” improve retrieval rates in virtual screening and define orthogonal bioactivity space: when are multitarget drugs a feasible concept? J. Chem. Inf. Model. 2006, 46, 2445–2456. 18. Lounkine, E.; Nigsch, F.; Jenkins, J. L.; Glick, M. Activity-aware clustering of high throughput screening data and elucidation of orthogonal structureactivity relationships. J. Chem. Inf. Model. 2011, 51, 3158–3168. 19. Hu, Y.; Lounkine, E.; Batista, J.; Bajorath, J. RelACCS-FP: a structural minimalist approach to fingerprint design. Chem. Biol. Drug Des. 2008, 72, 341–349. 20. Hu, Y.; Lounkine, E.; Bajorath, J. Improving the search performance of extended connectivity fingerprints through activity-oriented feature filtering and application of a bit-density-dependent similarity function. ChemMedChem 2009, 4, 540–548. 21. Wassermann, A. M.; Lounkine, E.; Davies, J. W.; Glick, M.; Camargo, L. M. The opportunities of mining historical and collective data in drug discovery. Drug Discovery Today 2015, 20, 422–434. 22. Lamb, J.; Crawford, E. D.; Peck, D.; Modell, J. W.; Blat, I. C.; Wrobel, M. J.; Lerner, J.; Brunet, J. P.; Subramanian, A.; Ross, K. N.; Reich, M.; Hieronymus, H.; Wei, G.; Armstrong, S. A.; Haggarty, S. J.; Clemons, P. A.; Wei, R.; Carr, S. A.; Lander, E. S.; Golub, T. R. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science 2006, 313, 1929–1935. 23. Reisen, F.; Sauty de Chalon, A.; Pfeifer, M.; Zhang, X.; Gabriel, D.; Selzer, P. Linking Phenotypes and Modes of Action Through High-Content Screen Fingerprints. Assay Drug Dev. Technol. 2015, 13, 415–427. 24. Wassermann, A. M.; Lounkine, E.; Glick, M. Bioturbo similarity searching: combining chemical and biological similarity to discover structurally diverse bioactive molecules. J. Chem. Inf. Model. 2013, 53, 692–703. 342

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

Downloaded by UNIV OF CALIFORNIA SAN DIEGO on December 12, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch015

25. Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A. Enhancing the effectiveness of similarity-based virtual screening using nearest-neighbor information. J. Med. Chem. 2005, 48, 7049–7054. 26. Wassermann, A. M.; Lounkine, E.; Urban, L.; Whitebread, S.; Chen, S.; Hughes, K.; Guo, H.; Kutlina, E.; Fekete, A.; Klumpp, M.; Glick, M. A screening pattern recognition method finds new and divergent targets for drugs and natural products. ACS Chem. Biol. 2014, 9, 1622–1631. 27. Riniker, S.; Wang, Y.; Jenkins, J. L.; Landrum, G. A. Using information from historical high-throughput screens to predict active compounds. J. Chem. Inf. Model. 2014, 54, 1880–1891. 28. Hirschberg, J.; Manning, C. D. Advances in natural language processing. Science 2015, 349, 261–266. 29. Gaulton, A.; Bellis, L. J.; Bento, A. P.; Chambers, J.; Davies, M.; Hersey, A.; Light, Y.; McGlinchey, S.; Michalovich, D.; Al-Lazikani, B.; Overington, J. P. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012, 40 (Database issue), D1100–1107. 30. Bessarabova, M.; Ishkin, A.; JeBailey, L.; Nikolskaya, T.; Nikolsky, Y. Knowledge-based analysis of proteomics data. BMC Bioinformatics 2012, 13 (Suppl. 16), S13. 31. Wang, Y.; Bajorath, J. Balancing the influence of molecular complexity on fingerprint similarity searching. J. Chem. Inf. Model. 2008, 48 (1), 75–84. 32. Wassermann, A. M.; Camargo, L. M.; Auld, D. S. Composition and applications of focus libraries to phenotypic assays. Front. Pharmacol. 2014, 5, 164. 33. Whittle, M.; Gillet, V. J.; Willett, P.; Loesel, J. Analysis of data fusion methods in virtual screening: theoretical model. J. Chem. Inf. Model. 2006, 46, 2193–2205. 34. Yera, E. R.; Cleves, A. E.; Jain, A. N. Prediction of off-target drug effects through data fusion. Pac. Symp. Biocomput. 2014, 160–171. 35. Hert, J.; Keiser, M. J.; Irwin, J. J.; Oprea, T. I.; Shoichet, B. K. Quantifying the relationships among drug classes. J. Chem. Inf. Model. 2008, 48, 755–765. 36. Gupta-Ostermann, D.; Wawer, M.; Wassermann, A. M.; Bajorath, J. Graph mining for SAR transfer series. J. Chem. Inf. Model. 2012, 52, 935–942. 37. Yildirim, M. A.; Goh, K. I.; Cusick, M. E.; Barabasi, A. L.; Vidal, M. Drugtarget network. Nat. Biotechnol. 2007, 25, 1119–1126. 38. Zhou, X.; Menche, J.; Barabasi, A. L.; Sharma, A. Human symptoms-disease network. Nat. Commun. 2014, 5, 4212. 39. Ono, K.; Demchak, B.; Ideker, T. Cytoscape tools for the web age: D3.js and Cytoscape.js exporters. F1000Res 2014, 3, 143. 40. Su, G.; Morris, J. H.; Demchak, B.; Bader, G. D. Biological network exploration with cytoscape 3. Curr. Protoc. Bioinformatics 2014, 47, 8.13.1–8.13.24.

343 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.