Molecular Similarity Approaches in Chemoinformatics: Early History

Chapter 6

Downloaded by CORNELL UNIV on October 17, 2016 | http://pubs.acs.org Publication Date (Web): October 5, 2016 | doi: 10.1021/bk-2016-1222.ch006

Molecular Similarity Approaches in Chemoinformatics: Early History and Literature Status Peter Willett* Information School, University of Sheffield, 211 Portobello, Sheffield S1 4DP, UK *E-mail: [email protected]

Computed measures of molecular similarity play an important role in many aspects of chemoinformatics, including similarity searching, database clustering and molecular diversity analysis. This paper discusses the initial studies carried out in the Seventies and Eighties that laid the foundations for these present-day applications, and uses publication and citation data to demonstrate the place of molecular similarity in the present-day literature.

Introduction As Rouvray noted (1) “similarity is one of the most instantly recognizable and universally experienced abstracts known to mankind. It is an abstraction that is at once ubiquitous in scope, interdisciplinary in nature, and seemingly boundless in its ramification”, and it is hence hardly surprising that it has found application in many different subject domains for a multitude of different purposes. Mendeleev’s discovery of the Periodic Table, which was based in part on recognizing the similarities in properties between groups of elements with related atomic weights, is often cited as an early example of the use of similarity concepts in chemistry, but this was just one of a stream of similarity-based applications stretching back over very many years (2). © 2016 American Chemical Society

Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.


Developments in information technology have spurred the introduction of a wide range of computational methods that seek to quantify the resemblances between pairs, or larger groups, of molecules and such molecular similarity methods now play an important role in computer-aided molecular design. One only has to consider techniques such as ligand-based virtual screening or molecular diversity analysis to realize the importance of such methods in chemoinformatics: there are now many excellent reviews of molecular similarity, and the reader is referred to the listed references for detailed discussions of the topic (3–7). In this article, we describe the early history of molecular similarity, and use the methods of bibliometrics to highlight some of the key advances since it first began to be studied seriously in the Eighties. An important, and arguably the seminal, literature source is the 1990 book Concepts and Applications of Molecular Similarity (8), which was edited by Johnson and Maggiora and which was based in part on presentations that were made at a 1988 meeting of the American Chemical Society in Los Angeles. The chapters of the book demonstrate that, even by then (over a quarter of a century ago), similarity concepts had been applied to property prediction, quantum chemistry, ligand-receptor interactions, computer-aided synthesis design and the modeling of metabolic pathways, and it is easy to consider other applications such as QSAR, pharmacophore mapping and reaction similarity inter alia. This article hence focuses on just three specific applications - similarity-based virtual screening, molecular diversity analysis and database clustering - and on measures of similarity that are based on the types of information – in 1D, 2D or 3D that can be readily computed from existing databases of chemical structures. The next section introduces the similar property principle, which provides an empirical rationale for the use of similarity methods, and also a means for their evaluation and comparison. There then follow brief historical accounts of the early development of the three chosen chemoinformatics applications: the focus is on “early” since, as Lajiness has noted (9), “During the early days, before the field of molecular similarity had gained full status as a legitimate area of chemical research, the terms “molecular similarity” and, consequently, “molecular dissimilarity” or “molecular diversity” did not appear in titles, keyword lists, or abstracts” and the emergence of these fields is hence likely to be largely unknown to many modern-day readers of this chapter. The situation now is, of course, markedly different with all three applications having a very large, constantly growing literature associated with them, and this is reflected in the extensive list of references at the end of the chapter, many of which describe the current state-of-the-art. However, the paper differs from most review articles in having a strong focus on publications that were of importance in establishing the field but that have, in many cases, become less well known as a result of the passage of time. After discussing the three chosen applications, their current status in the chemical literature is discussed using bibliometric data on publications and citations obtained from searches of the Thomson-Reuters Web of Science Core Collection database that were carried out in April 2015. 68



A Rationale for, and the Calculation of, Molecular Similarity The underlying rationale for the use of molecular similarity methods in computer-aided molecular design is the Similar Property Principle (hereafter SPP), which states that structurally similar molecules have similar properties. The existence of such a principle provides an obvious basis for research in areas such as drug discovery, environmental chemistry and pesticide science inter alia, since the identification of a molecule with some desirable chemical, physical or biological property can be used to suggest structurally similar molecules that may also exhibit this property. It must be emphasized that the SPP is simply a rule-of-thumb and that it has, like any such rule, many exceptions: these have been widely recognised (10–12) and have been highlighted by recent work on activity landscapes (as discussed in the Conclusions section). It does, however, provide the basis for a wide range of computational approaches for the calculation of molecular similarity and for the use of such calculations to probe chemical datasets (13). The 1990 book by Johnson and Maggiora (vide supra) is often cited as the source for the SPP. In fact, Johnson and Maggiora seem to have first used it in a 1988 article (14) where they ascribed it to a 1980 study of graph-theoretic methods for structure-activity correlation where Wilkins and Randic argued that it is “generally accepted that molecules of similar structural form may be expected to show similar biological or pharmacophoric patterns” (15). The same two authors had expressed similar views in a paper published in the previous year when they stated that “Since many molecular properties, and especially chemical or therapeutic activity, bear some relationship to chemical structure, studies of the similarity of structures, rather than properties, should be the first priority” (16), but it is clear that the Principle was already widely understood, even if not expressed in explicit form, much earlier than that. For example, a 1967 article by Armitage et al. that used an approximate maximum common subgraph procedure to determine the similarity between the reactant and product molecules in a chemical reaction noted that “The concept of similarity among sets of chemical structures has far-reaching implications, not only in the analysis of chemical reactions, but in many other areas involving chemical structural information. It involves procedures which chemists use intuitively whenever they survey a set of chemical structures, and attempt to relate structure and activities of various kinds, including reactivities, physical properties, and biological properties” (17). 69 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.

In like vein, the conclusions section of Adamson and Bush’s 1973 article on chemical clustering (vide infra) stated “in the automatic analysis of the properties of chemical species for the purpose of predicting unknown biological, physical or chemical properties the structural properties as represented by the structure diagram are likely to be correlated with the unknown properties” (18), this harking back to Crum-Brown’s famous 1868 article in which he noted that


“It is obvious that there must exist a relation between the chemical constitution and the physiological action of a substance” (19). Whatever the original source, the SPP is now well established and there is a wealth of experimental evidence supporting its general utility. For example, structurally similar molecules have been shown to tend to bind to similar protein targets (20–22) and predictive power in a QSAR study is related to the degree of structural similarity between the molecules comprising the training-set that is used to develop the model and the molecules in the test-set for which the activities are to be calculated (23, 24). It is interesting to note in passing that analogous relationships in other fields between similarity (or closeness) and some characteristic of interest have been mentioned in the chemoinformatics literature. Thus Teixeira and Falcao (25) describe a QSAR application of kriging, a data mining technique that derives from Tobler’s first law of geography, viz “Everything is related to everything else, but near things are more related than distant things” (26); Willett (4) has discussed the close relationship that exists between the SPP and van Rijsbergen’s Cluster Hypothesis in information retrieval, which states that “closely associated documents tend to be relevant to the same requests” (27); and, most recently, Zwierzyna et al. (28) in a study of chemical space networks have noted the application of the homophily principle, viz “a contact between similar people occurs at a higher rate than among dissimilar people” (29). It would be surprising if there are not other such similarity-based relationships that will prove to find future application in chemoinformatics.

70 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.


At the heart of any similarity technique is a procedure for computing the similarity, i.e., the degree of structural resemblance in the present context, between pairs of molecules, and there is an extensive literature associated with such similarity measures (4, 30, 31). In brief, there are three major components to any similarity measure: first, the way that the structures are represented in machine-readable form; then, the weighting scheme that is used to describe the relative importance of different parts of the chosen representation; and finally the similarity coefficient that is used to provide a quantitative value for the extent of the structural relationship between the two resulting weighted structure representations. There are many different types of representation, of weighting scheme and of similarity coefficient, resulting in a huge number of possible similarity measures that could be used in chemoinformatics. There has hence been much interest over the years in comparative studies of the effectiveness of the three individual components (32–46), with these listed references being but a very small fraction of what is now an extremely extensive literature. Many of the comparative studies that have been carried out have assumed the general validity of the Principle and then identified the most effective representation (or weighting scheme or similarity coefficient) as being the one that results in the strongest correlation between structure and bioactivity. The basic approach derives from the pioneering study of chemical clustering by Adamson and Bush that is described in a later section of this chapter (18). These authors assumed that some quantitative property value is available for each molecule in a dataset (in their study this was the pI value for each of the 20 naturally occurring amino acids) and that the observed value for the x-th molecule is denoted by Ox. Once the dataset had been clustered using the single linkage clustering method, each molecule x was considered in turn and its predicted property value, Px, was taken to be the arithmetic mean of the observed values for the other molecules in the cluster containing x. The overall effectiveness of the procedure was then taken to be the correlation coefficient between the sets of Ox and Px values. Adamson and Bush used this approach as a way of demonstrating the validity of the single-linkage classifications that they had generated, and subsequently for comparing the effectiveness of several similarity and distance coefficients when used to predict the minimal blocking concentrations of a set of local anaesthetics (47). This approach to the comparison of methods was popularized in a long series of publications by Willett et al. that were summarized in a 1987 book (48), and the approach subsequently formed the basis for, e.g., Brown and Martin’s muchcited comparisons of clustering methods and structural descriptors for compound selection (35, 36). Analogous leave-one-out approaches are available for use with qualitative (active or inactive) property data and for the comparison of methods for similarity searching and for molecular diversity analysis (49), as exemplified by many of the comparative studies that have been cited previously. A further, related evaluation procedure based on the SPP – called neighborhood behavior - has been developed to evaluate the suitability of different structure representations for use in molecular diversity applications, and involves correlating the similarities between pairs of molecules with the absolute differences in their observed bioactivities (50). 71



Clustering Chemical Databases An early, possibly the earliest, example, of the application of a clustering method to a chemical database was reported by Harrison in a description of a cluster analysis program that had been developed at (the then) Imperial Chemical Industries (ICI) Pharmaceutical Division (51). Molecules from the ICI database were represented by a 288-member fragment code and a probabilistic similarity coefficient was developed in which the co-occurrence of an infrequently occurring fragment was assumed to be of more importance in defining the membership of a cluster than the co-occurrence of a more frequent fragment. A cluster around a known active molecule was identified if there was a statistically significant number of molecules closer to the chosen active than a threshold value. Experiments were reported with files containing up to 16,000 molecules, with the clusters identified around known actives being inspected for the presence of significant structural features common to members of the cluster. An enhanced version of this approach was subsequently developed at Hoffmann-La Roche (52). The next papers to be discussed here are the aforementioned studies by Adamson and Bush that described an approach to chemical clustering that continues to be used right up to the present day. Harrison had used a fragmentation code to represent the molecules in his study, but Adamson and Bush adopted the small, automatically generated, atom- and bond-centered features that were then starting to be used for the implementation of 2D substructure searching systems (53–56). Their work showed that despite the simplicity of these features they provided a representation of molecular structure that, when combined with simple similarity or distance coefficients, provided measures of similarity that were both effective in operation and efficient in implementation. As Adamson and Bush noted “The relationship between structure and property which is produced by the classification and SC’s and DC’s indicates that these techniques could usefully be incorporated in information storage and retrieval systems” (47) (where SC and DC denote similarity coefficient and distance coefficient). Adamson and Bush might well have been surprised if they had been told that the majority of chemoinformatics applications of molecular similarity some four decades after their work would still be based on their 2D fingerprint-based measures of similarity (5, 57, 58) The continuing usage is despite the many studies that have been reported over the years of measures based on, e.g., path-length, graph, shape, volume or electrostatic similarity (59–70). This is at least in part because simple, fingerprint-based similarity measures seem to be as effective as the many more sophisticated approaches that are now available, while at the same time being both simple to implement and efficient in operation (45, 71–74). Adamson and Bush’s 1973 study provided the basis for a series of comparisons that evaluated over 30 different clustering methods (including hierarchic agglomerative and divisive methods, and non-hierarchic nearest neighbor and relocation methods) when implemented for the grouping of chemical structures 72 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.


using fingerprint-based similarity measures (48). The best results were achieved with Ward’s hierarchic agglomerative method (75), and the non-hierarchic, nearest-neighbor method of Jarvis and Patrick (76). The computational facilities available at the time (the mid-Eighties) meant that the former was too time-consuming for large-scale use, and it was thus the Jarvis-Patrick method that became widely adopted in operational chemoinformatics systems for selecting compounds for biological screening (77–79). However, later comparative studies on larger files demonstrated the general superiority of Ward’s method (35, 36) and improvements in both hardware and software meant that it has increasingly been implemented in operational software systems. That said, it is likely that it will be supplanted in its turn as new methods are developed that can handle even the largest chemical databases that are now available (80, 81). The extensive take-up of the Jarvis-Patrick method for applications in chemoinformatics is clearly demonstrated by a Web of Science search that identified a total of 349 citations to the original 1973 article (76). Considering the ten journals that provided most citations, the largest number came from Journal of Chemical Information and Modeling (49 citations; the reader should note that, both here and elsewhere in this paper, counts include those from previous incarnations of a journal, i.e., Journal of Chemical Documentation and Journal of Chemical Information and Computer Sciences in the present context), with five of the other top-ten journals being Journal of Computer-Aided Molecular Design, Acta Crystallographica B, Journal of Medicinal Chemistry, Molecular Diversity and Journal of Molecular Graphics and Modelling (and six of the subsequent ten journals were also chemical in character). The situation with regard to Ward’s method is totally different since citations to this 1963 article are spread across journals from a large number of disciplines: of the 5,268 citations it had attracted, the only chemistry-related journal in the top-ten citing journals was Journal of Chemical Information and Modeling in fourth place with 41 citations.

Similarity Searching Clustering has been discussed first of the three applications considered here since it was, in the shape of the work by Harrison and by Adamson and Bush, the first to be studied. However, arguably of more general importance is similarity searching or, as it is increasingly referred to, similarity-based virtual screening. The basic idea of similarity searching flows directly from the SPP: if a known bioactive molecule (often referred to as a reference structure or target structure) is available then a database can be scanned to identify its nearest neighbors (i.e., the molecules that are most similar to it) since these are assumed to have the greatest probability of exhibiting the same activity. This is clearly a very simple approach to virtual screening and more sophisticated methods (involving, e.g., pharmacophore mapping, machine learning or ligand-protein docking) are often more effective in practice. However, it requires very limited knowledge, viz the identity of a single known active to act as the reference structure, and it can hence be the precursor to the use of more sophisticated screening strategies as more structural information becomes available (58, 82–88). 73



The first reports of similarity searching described work carried out in the mid-Eighties at Lederle Laboratories in the USA and at Pfizer in the UK (89–91). Although differing in detail, these studies focused on the use of similarity measures analogous to those suggested by Adamson and Bush and based on the numbers of substructural fragments that were common to two molecules that were being compared (i.e., the reference structure and a database structure). The work at Lederle was conducted as part of a project to develop robust methods for largescale SAR studies, while that at Pfizer was conducted, initially at least, to prioritize the outputs of 2D substructure searches (though the focus rapidly changed to purely similarity-based searching as a way of providing structural browsing facilities). However, the two groups were at one in recognizing the potential of this new approach to database access as an adjunct to the existing substructure searching systems of the time (both public (92, 93) and in-house (91, 94)): “by providing a quantitative and holistic similarity measure that is not biased by concepts of functional groups and ring systems, the similarity probe can complement substructure-search techniques and can reveal relationships between classes of compounds that might otherwise be missed” (89), and “the ranking mechanism reduces the need for queries that have been finely honed so as to produce an acceptable volume of output, thus making enduser chemical retrieval more feasible than with conventional substructure searching systems” (90). Although not reported at the time, similarity searching had also been independently developed in 1986 by a group at Upjohn (9, 95). In their original system, the similarities were computed using the topological and information theoretic indices described by Basak et al. (96) but these were soon replaced by fragment-based searching facilities similar to those implemented in the systems at Lederle and Pfizer. The successful use of atom and bond-centered fragments to compute 2D molecular similarity spurred attempts to use fragments based on atoms and inter-atomic distances or angles to provide analogous measures of 3D molecular similarity (60, 97–100). This work proved, however, to be notably less successful than those based on 2D fragments, and the most widely used 3D measures at present are probably those based on molecular shape. The basic idea is that molecules will have a high degree of shape similarity if their volumes substantially overlap, with the overlaps being computed rapidly using Gaussian techniques that were pioneered by Good et al. for the calculation of electrostatic similarity (61) and then further developed for the calculation of shape similarity by Grant et al. (64). The simplicity and the effectiveness of 2D, fragment-based similarity searching meant that it was rapidly taken up, normally with the Tanimoto coefficient being used as the similarity coefficient (30, 101), as a standard facility 74



in both commercial and in-house chemoinformatics systems. With little or no modification the approach continues to be used for database searching to the present day and, as noted above, increasingly as a key component of virtual screening systems. Perhaps the main development in similarity searching since the initial Lederle and Pfizer systems has been the adoption of data fusion, which is the name used to describe a range of methods for combining information that has been obtained from different data sources. The aim here is to produce a fused source that is more informative than are the individual data sources (102, 103). In the chemoinformatics context, these sources are lists of structures that are ranked in decreasing order of the value of a similarity coefficient (or of a scoring function in the case of protein-ligand docking studies, where the approach is normally referred to as consensus scoring (104, 105)). Data fusion was again an approach that was developed independently at about the same time by two different groups. Sheridan et al. at Merck described the fusion of pairs of rankings generated using different types of fingerprint (106, 107) while Ginn et al. at Sheffield described the fusion of 2D, 3D and spectral rankings generated using different types of similarity coefficient (108, 109). Both groups found the use of multiple searches to be effective, with Ginn et al. noting that fusion “will generally result in a level of performance (however this is quantified) that is at least as good (when averaged over a number of searches) as the best individual measure: since the latter often varies from one target structure to another in an unpredictable manner, the use of a fusion rule will generally provide a more consistent level of searching performance than if just a single similarity measure is available” (109). Similar conclusions have been drawn in many subsequent studies and there is now an extensive literature associated with the use of data fusion for similarity searching (110, 111), with Sheridan suggesting that the combination of results from multiple existing similarity methods might be more useful than the development of new, more complex similarity searching techniques (112). Analogous consensus methods have started to be suggested for chemical clustering (113, 114) and for the analysis of activity landscapes (115, 116).

Molecular Diversity Analysis Developments in combinatorial chemistry, chemical robotics and high-throughput screening in the early Nineties (117–119) spurred interest in computational techniques that could be used to maximize the structural diversity of the molecules that were to be synthesized and tested in drug-discovery programmes. In particular, methods were developed for selecting diverse sets of molecules from databases (either real or virtual) for biological testing directly or for inclusion in the monomer pools that are input to combinatorial syntheses. Early work on the first three general approaches to these tasks – 75 Bienstock et al.; Frontiers in Molecular Design and Chemical Information Science - Herman Skolnik Award Symposium 2015: ... ACS Symposium Series; American Chemical Society: Washington, DC, 2016.


cluster-based selection, dissimilarity-based selection and partition-based selection – are discussed in a 1997 issue of Perspectives in Drug Discovery and Design given over to diversity analysis (120). The first two of these approaches were based directly on the similarity measures that had been developed previously for clustering and similarity searching and that have been described above; indeed, the pioneering studies at Pfizer and Upjohn that are summarized below had been undertaken several years prior to the widespread adoption of combinatorial approaches to drug discovery. The cluster-based selection of compounds for biological testing is a very obvious application of the use of cluster analysis methods in chemoinformatics, and one that was first studied on a reasonable scale in work at Pfizer UK (77) in the mid-Eighties. At that time, the company maintained a Structural Representatives File of compounds that were available for testing and that had previously been selected on a careful, manual basis. The new cluster approach, based on the Jarvis-Patrick method, was felt to have multiple advantages: “A complex and time-consuming intellectual operation that involves highly trained staff is replaced by a cheap automatic procedure; an effective clustering procedure should help to ensure that no classes of compounds are overlooked when selecting structures for testing and that the selection is consistent and free of bias; the existence of a classification can help to dictate which compounds are tested next in a program, since the identification of one active compound would suggest that the other members of that compound’s cluster should also be investigated (77).” Given these advantages it is hardly surprising that cluster-based selection procedures were extensively and rapidly adopted, especially when considerations of structural diversity assumed greater importance with the advent of combinatorial strategies for drug discovery. Small-scale clustering studies were being carried out at Upjohn at much the same time as the Pfizer UK work, and it was also these two companies that reported the first applications of dissimilarity-based approaches to compound selection. The basic task is a simple one: given a database containing N molecules, a dissimilarity-based selection method tries to identify an n-molecule subset of the database (n

Molecular Similarity Approaches in Chemoinformatics: Early History

Recommend Documents