Searching Techniques for Databases of Two- and Three-Dimensional

Efficient Heuristics for Maximum Common Substructure Search ...... Relational Database Driven Two-Dimensional Chemical Graph Analysis. Steven J. Wilke...
0 downloads 0 Views 326KB Size
© Copyright 2005 by the American Chemical Society

Volume 48, Number 13

June 30, 2005

2005 American Chemical Society Award for Computers in Chemical and Pharmaceutical Research Searching Techniques for Databases of Two- and Three-Dimensional Chemical Structures Peter Willett† Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, Western Bank, Sheffield S10 2TN, U.K. Received April 8, 2005

Introduction It is a great honor to be the recipient of the 2005 American Chemical Society Award for Computers in Chemical and Pharmaceutical Research. Everybody enjoys praisesand praise from one’s scientific peers is especially pleasingsbut praise where praise is due: Whatever I have managed to achieve since I first became interested in the mid-1970s in what we now refer to as chemoinformatics1,2 has only come about as the result of the contributions of a very large number of collaborators, both internally within the University of Sheffield and externally in the pharmaceutical, agrochemical, and chemical software industries. My career is certainly not one that I had planned for myself when young. I did a first degree in chemistry at Oxford but soon realized that my experimental abilities were far too limited for me to make a successful career as a laboratory chemist. Prior to going up to university I had worked in my local public library and continued to do so on a short-term basis during the university vacations. On completion of my chemistry degree in 1975, I hence decided to take an MSc course in library and information science. There are many departments in the U.K. where one can obtain such a qualification: I ended up in the University of Sheffield, in what was then called the Postgraduate School of Librarianship and Information Science primarily because they offered a larger scholarship than the other departments to † Phone: +44 114 222 2633. Fax: +44 114 278 0300. E-mail: [email protected].

which I applied for a place! Although I did not realize it at the time, this simple financial decision determined the rest of my career, since during my M.Sc. degree in Sheffield I met two individuals who fired my enthusiasm for what was, for me, a completely new subject of study, specifically the computer processing of chemical structures. These two individuals were Michael Lynch, now retired but then a professor in the department who was responsible for much of the IT-related teaching, and the late George Vleduts, a visiting Research Fellow in Sheffield who had previously worked at VINITI in Moscow and who subsequently took up a senior research position with the Institute for Scientific Information in Philadelphia. Mike had been the Head of Research at Chemical Abstracts Service, where he was involved in the design of the first version of the registry system. On coming to Sheffield, he initiated a long-term research program that established the basic techniques that are used for screening 2D substructure searches and, subsequently, for searching the Markush structures in chemical patents (see, for example, refs 3 and 4). While at VINITI, George had published the very first paper suggesting that computers could be used to index chemical reactions and to assist in the design of complex organic syntheses.5 On coming to Sheffield, George worked with Mike on the former problem, where the basic question to be addressed was the identification of a similarity relationship: that between the sets of reactant and product molecules in a reaction to identify

10.1021/jm0582165 CCC: $30.25 © 2005 American Chemical Society Published on Web 05/28/2005

4184

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

Award Address

those parts that were changed (the reaction site or reaction center) and those that were unchanged during the substructural transformations that characterize a chemical reaction. I became interested in this problem initially as a dissertation project for my M.Sc. and then as a Ph.D. study. Drawing upon much previous work at Sheffield (see, for example, refs 6 and 7), I was able to develop two approaches, one (based on the comparison of Wiswesser Line Notation symbol strings) designed for the production of printed indexes of reactions8,9 and the other (based on an approximate graph-matching algorithm) for the computer searching of connection tables.10,11 The precise details of this work are now of only historical interest (although the latter approach was subsequently much developed to form the basis for operational reaction database systems12,13), but it did arouse an interest in the computer processing of chemical molecules that has lasted to the present day. It may seem strange now, but like many people in the mid1970s, I had gone through my entire school and undergraduate education without ever using anything more sophisticated than a slide rule and (in my final undergraduate year) a pocket calculator. On the M.Sc. program at Sheffield, one had to learn computer programming, first in PLAN, an assembly level language, and then in COBOL. I was never a naturally gifted programmer in the sense of being able to produce robust and efficient code the first time round, but I found that I got a great deal of satisfaction from coding and, more importantly, that I possessed skills in algorithm design that seemed worth developing and that have served me well in my subsequent career. This has all been spent in Sheffield (in what is now called the Department of Information Studies), during which time I have worked in many areas of chemoinformatics; however, my initial studies of reaction similarities resulted in an interest in molecular similarity that has been the focus of a large proportion of my work, and it is the area where I am currently concentrating my efforts, specifically in the context of similarity-based approaches to virtual screening. Apart from Mike and George, the other formative influence on my career has been the academic department in which I have worked; as its name suggests, my department is an information department, not a chemistry department. Accordingly, the work of my group has focused rather more on algorithms, data structures, and computational efficiency, etc. and less on the details of chemistry and biology that characterize work in areas such as synthesis design, QSAR, molecular modeling, and ligand docking. This is not to say that we have not worked in such areas; rather, our work tends to consider databases of objects that possess particular computable characteristics rather than considering the detailed chemical and biological natures of those objects. This abstraction process has meant that we have been wellplaced to draw on techniques from other fields, most obviously the techniques that have been developed in the field of information retrieval (IR) for searching digital libraries.14-16 Indeed, for many years, I worked in IR as well as in chemoinformatics, studying subjects as diverse as clustering document databases,17,18 mor-

phological analysis,19 parallel text retrieval,20,21 and the searching of historical texts,22 inter alia. IR has traditionally focused on textual data, but its basic concepts are applicable, in principle, to any type of data, subject to constraints arising from the nature of the object representations that are used. In the case of IR and chemoinformatics, the graph characterizing a 2D or a 3D chemical structure bears a much closer relationship to its parent molecule than do the character strings representing the words comprising a textual document, where the use of natural language raises a host of linguistic problems that do not arise in the chemical context. However, there are also many close links between the two types of processing. I have found that it is often the case that algorithms and data structures that are applicable in IR are also applicable in chemoinformatics and vice versa,23 and I will make reference to some of these analogies in the body of this paper, which is structured as follows. The next section summarizes some of the main aspects of my group’s research work over the years, and this is followed by a more detailed review of molecular similarity and its application to the processing of chemical-structure databases. Finally, I describe some of our current work in the area of 2D similarity searching. Areas of Research This section summarizes my research group’s contributions in several areas of chemoinformatics with which it has been concerned since I took up a faculty position in Sheffield in 1979: the generation and searching of 3D pharmacophoric patterns; similarity and cluster analysis; molecular diversity analysis; and the use of chemoinformatics techniques in bioinformatics. One of these areas, that of similarity searching, is considered in more detail in later sections of this paper; other areas in which we have worked include QSAR, ligand docking, and the use of parallel database machines (see, for example, refs 24-30). This paper focuses on specific applications rather than on the algorithmic approaches that were adopted for these applications, but it is perhaps worth describing briefly those three approaches that we have found most productive over the years: graph theory, cluster analysis, and genetic algorithms. Algorithmic Approaches. Graph theory is a branch of mathematics that describes sets of objects (called nodes or vertexes) and the relationships (called edges) between pairs of these objects.31-33 In the chemical context, this provides a very natural representation of a 2D chemical structure diagram, with the nodes and edges of a graph representing the atoms and bonds, respectively, of a molecule. The use of graph-based methods in chemistry was first described by Ray and Kirsch.34 A chemical database can hence be represented by a large number of such graphs, with database searching historically being carried out using two types of search algorithm: structure searching and substructure searching. Structure searching involves a graph isomorphism search in which the graph describing the query molecule is checked for isomorphism (i.e., an exact match) with the graphs describing each of the molecules in a database, this permitting a search for a specific query structure, e.g., to retrieve the biological assay results and the synthetic details associated with that

Award Address

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

4185

particular molecule. Substructure searching involves checking the graph describing the query substructure for subgraph isomorphism with the graphs of each of the database molecules, this permitting the retrieval of all of those database molecules that contain a userdefined query substructure, irrespective of the environment in which that substructure occurs. The extension of such ideas from 2D chemical graphs to enable 3D searching formed one of our principal foci of interest in the 1980s. A third type of search is similarity searching. There are many ways in which this can be done, as discussed in detail later in this paper. For the moment, we note merely that one of these ways involves the use of similarity measures based on the graph-theoretic concept of a maximum common subgraph (hereafter MCS) isomorphism algorithm, where an MCS is the largest subgraph common to a pair of graphs. In the chemical context, this equates to the largest substructure common to a pair of molecules, such as the unchanged parts of the reactant and product molecules in a chemical reaction. Cluster analysis is a multivariate statistical technique that has been used throughout the social and physical sciences for identifying groups, or clusters, of similar objects in a multidimensional space.35-37 At the risk of some simplification, it is possible to identify three main stages in the generation of a clustering of a data set, which in the context of this paper we will assume is a database of 2D chemical structures. First, each of the molecules in the database must be characterized by some set of features. Typically there will be several or many such features, and these features may be weighted or standardized in some way. Next, the similarity between each pair of the resulting molecular representations is computed using a similarity coefficient, which yields a quantitative value for the degree of resemblance between the two molecules that are being compared. This similarity calculation is repeated for all pairs of molecules. Finally, a clustering method is applied to the resulting set of intermolecular similarities to identify the clusters of highly similar molecules that are present in the data set. It will be realized from this brief description that there are many decisions that need to be made before a clustering can be generated, and much of our work over the years has involved the comparison and evaluation of different procedures for representing molecules, for computing similarities, and for clustering the similarities once they have been computed. Finally, a genetic algorithm, or GA, is perhaps the best-known exemplar of that part of computer science that is known as evolutionary computing. This involves the use of algorithmic models of biological evolution to provide efficient, albeit often approximate and nondeterministic, solutions to computational problems that cannot be addressed by more systematic, and generally deterministic, computational approaches.38,39 A GA takes as its starting point a population of possible solutions, called chromosomes, to the problem that is being addressed. These initial solutions are often generated at random. A fitness function is used to evaluate the fitness of each chromosome, where the fitness of a particular chromosome measures the “goodness” of the solution encoded by that chromosome. Chromosomes of higher-than-average fitness are then processed using

mutation and crossover operators analogous to those employed in conventional biological evolution, with the aim of producing a new population of chromosomes in which the average fitness will have increased. When iterated through many generations, the resulting population is expected to contain solutions that provide a good, but not necessarily optimal, solution to the problem that is being addressed. The approximate, and nondeterministic, nature of GAs means that they should not be used when a feasible, conventional algorithm is available. When this is not so, as is the case in the combinatorial optimization problems that characterize many chemoinformatics applications, GAs have been found to provide a highly efficient way of generating effective, high-fitness solutions.40 Finally, in this section, I think it is important to note that an important strand in our work over the years has been the design and implementation of substantial comparative studies in which different techniques for some application are compared on several data sets.41 The comparison of existing approaches may not appear to be a particularly exciting area of science, given the importance that is attached to novelty. However, such comparative studies are of considerable importance because it is only by rigorous evaluation with a range of data sets that one can identify the best algorithms (in terms of the effectiveness of the results obtained, of the efficiency of the computational processing required for that level of effectiveness, and of their robustness when applied to disparate types of data) and hence develop successful operational systems. The success of these comparative studies is demonstrated by several approaches that are now widely used in chemoinformatics (including the Bron-Kerbosch clique-detection algorithm, the Jarvis-Patrick and Ward clustering methods, the Tanimoto similarity coefficient, and the Ullmann subgraph isomorphism algorithm, vide infra), the merits of which were first highlighted by work carried out in Sheffield.42-45 Generation and Searching of 3D Pharmacophores. As noted above, 2D substructure searching is effected by means of subgraph isomorphism algorithms. This is a highly effective approach but the combinatorial nature of subgraph isomorphism algorithms makes them highly inefficient and requires an initial screen search (based on a fragment bit-string or fingerprint) that eliminates from further consideration the great bulk of a structure file that cannot possibly satisfy the subgraph match. Such screen-based and graph-based approaches have been used for 2D substructure searching for many years,1,46 but it was not until 1977 that Gund suggested that a graph-based approach could also be applied to the retrieval of 3D chemical structures.47 Here, the nodes and edges of a graph represent the atoms and interatomic distances, respectively, in a 3D molecule (rather than the atoms and bonds in a 2D molecule). In practice, it is common to minimize storage requirements by using just the atomic coordinates and computing the distances from these coordinates as required. The resulting interatomic distance matrix can then be checked for the presence of a query pharmacophore, or pharmacophoric pattern, where a pharmacophore is the geometric arrangement of structural features necessary for a molecule to bind at an active

4186 Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

site. Given that a 3D structure can be represented by a graph, the presence or absence of a pharmacophoric pattern can be confirmed by means of a subgraph isomorphism procedure in which the edges in a database structure and a query substructure are matched if they denote the same interatomic distance (typically within any user-specified tolerance such as (0.5 Å). Gund’s paper is a very important one in the historical development of chemoinformatics, but several years were to pass before the first published description of a system for 3D substructure searching. This arose from a collaboration between Sheffield and Pfizer, the initial focus of which was the design of a screening system to permit rapid searching of large 3D databases.48 Given that pharmacophores are often expressed in terms of interatomic distances, the screens that we developed consisted of a pair of atoms together with an associated interatomic distance range, an approach that was subsequently widely developed.49-51 Once the initial screen search has been carried out, those molecules that contain the screens associated with the query pharmacophore are passed on for the subgraph isomorphism search.52,53 Studies of several different subgraph isomorphism algorithms42 demonstrated the general efficiency of that described by Ullmann.54 Our original comparison was purely in the context of pharmacophore matching, but we have found that this algorithm performs well for a range of structure-matching applications (see, for example, refs 24, 55, and 56), and it now forms the basis for many operational substructure searching systems. Our work on 3D substructure searching attracted considerable interest because it appeared at just about the same time as the first structure-generation programs became available, which made it possible to convert an existing 2D structure database to the 3D form at little computational cost.57,58 However, although both effective and efficient in operation, our initial screening and geometric searching algorithms were limited in that they took no account of the flexibility that characterizes many molecules.53,59 Specifically, a molecule was represented by a single low-energy conformation, with the result that a pharmacophore search was likely to miss those (probably many) molecules that could adopt a conformation containing the query pattern but that were encoded by a conformation that did not contain the sought pattern. The problem can be alleviated by storing several, or many, such low-energy conformations for each molecule,59 but this approach to flexible 3D searching means that the search algorithm cannot explore the full conformational space available to a flexible molecule, with the possibility of a loss in recall. The work in Sheffield, much of which was carried out in collaboration with ICI Pharmaceuticals Division and with Tripos Inc., thus focused upon the development of algorithms and data structures that could avoid such retrieval failures. In a rigid 3D molecule, the distance between each pair of atoms is a single fixed value, whereas the distance between a pair of atoms in a flexible molecule will depend on the conformation that is adopted. All of the geometrically feasible conformations that a flexible molecule can adopt may be encoded by storing a distance range for each pair of atoms in a molecule, with the

Award Address

Figure 1. Example of a clique. The nodes a-d comprise a clique because each of them is connected to each of the others and there is no larger group of nodes with this property.

lower bounds and upper bounds of this range corresponding to the minimum and maximum possible distances for that pair of atoms. Such sets of distance ranges can be generated using bounds-smoothing techniques from distance-geometry.60,61 The screening and graph-searching algorithms that are used for rigid 3D searching operate on graphs where each edge denotes a single value. Minor modifications to these algorithms enable them also to process graphs in which each edge contains both a lower bound and an upper bound. Indeed, one can regard rigid searching as being a limiting case of the more general algorithms that are required for flexible searching. These modifications hence enable the retrieval of all of the molecules in a database that could possibly adopt a conformation that contains a query pharmacophoric pattern.62 There is, however, one major difference between flexible 3D substructure searching (on one hand) and both 2D and rigid 3D substructure searching (on the other): specifically, those molecules that match the query in the subgraph-isomorphism component of a flexible 3D search must then undergo a further, and final, check that uses some form of conformational-searching procedure. A range of methods for this final search component were evaluated,63 with the most effective and most efficient seeming to be the technique known as directed tweak.64 Flexible 3D pharmacophore searching is now wellestablished and plays an important role in leaddiscovery programs for novel bioactive molecules.1,53,59,65 However, it can only be used if it is possible to define the pharmacophore of interest, and we have studied two different approaches to pharmacophore detection, both of which take as their input a set of known bioactive molecules and produce as output that pattern (or patterns) of features in 3D that the molecules have in common. This common pattern may be assumed to represent (or at least to contain) the pharmacophore that is responsible for the observed activity. The first approach takes as input a set of rigid molecules and identifies the common pattern(s) using a graph-theoretic algorithm. The second takes as input a set of flexible molecules and identifies the common pattern(s) using a GA. The first approach arose from our comparison of subgraph-isomorphism algorithms for 3D pharmacophore searching.42 One of these algorithms was based on the graph-theoretic concept of a clique, where a clique is a subgraph of a graph in which every node is connected to every other node and which is not contained in any larger subgraph with this property (see Figure 1). Although the clique approach was found to

Award Address

Figure 2. Efficient MCS detection. Generation and use of a correspondence graph for the identification of the maximal subgraphs common to two graphs A and B.

be less efficient than our preferred candidate (the Ullmann algorithm that has been referred to previously), we realized that it provided a natural way of identifying the largest substructure common to a pair of molecules and hence provided a possible way of identifying a 3D pharmacophore from a set of active molecules. A breadth-first, tree-search algorithm for this purpose had already been described by Crandell and Smith,66 drawing on a previous 2D algorithm developed by Varkony et al.67 The Crandell-Smith algorithm is elegant in concept but can be very time-consuming in operation, and we felt that clique detection might provide a more efficient approach to pharmacophore identification. Given a pair of input graphs, it is possible to compute a further graph, called a correspondence graph, that contains all of the possible pairs of matching nodes from the input graphs. This procedure is detailed in Figure 2. The clique in the correspondence graph then corresponds to the maximum common subgraph (MCS) of the input graphs,68,69 which is precisely what one requires for the identification of the largest substructure common to a pair of molecules, i.e., a pharmacophore. A comparison of several different clique-detection algorithms42 demonstrated that the one first described by Bron and Kerbosch70 was by far the most efficient for matching 3D chemical structures. The only limitation is that the correspondence graph approach is limited to comparing just two graphs; however, by use of one of the ideas in the Crandell-Smith algorithm, it was possible to extend the basic algorithm to find the largest substructure common to any number of input structures. The Bron-Kerbosch algorithm was later adopted and further developed by Martin et al. in the widely used DISCO program for pharmacophore identification.71 In fact, the Bron-Kerbosch algorithm finds not just the single largest common substructure but all common substructures containing more than some minimal number of nodes, a characteristic that is useful when multiple hypotheses need to be explored (e.g., for analysis of protein folds72). If just the single largest common substructure is required, then other approaches are preferred.45,73 The second GA-based approach was carried out in collaboration with Wellcome Research Laboratories.74 Given a set of active molecules, our program selected one of them as a base molecule, to which the other molecules were fitted using a GA. A chromosome in this GA encoded two types of information that are necessary to ensure an appropriate overlay of a molecule onto the base molecule: binary strings that encode angles of rotation about the rotatable bonds in all of the molecules; and integer strings that map structural features that might be involved in a pharmacophoric pattern (specifically, hydrogen-bond donor protons, acceptor lone pairs, and ring centers) in the base molecule to corresponding features in each of the other molecules. The

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

4187

feature-to-feature correspondences encoded in the mappings suggest possible pharmacophoric points, and the torsion-angle information specifies the conformations adopted by the molecules. A least-squares fitting process is used to overlay molecules onto the base molecule in such a way that as many as possible of the structural equivalences suggested by the mapping are formed. The fitness of a decoded chromosome is then a combination of the number and similarity of overlaid features, the volume integral of the overlay, and the van der Waals energy of the molecular conformations, and the genetic operators are used to drive the algorithm to that molecular superimposition that maximizes the value of this fitness function. This approach has been found to provide an effective way of identifying pharmacophore patterns and is embodied in a widely used program for pharmacophore identification called GASP (for Genetic Algorithm Superimposition Program). Similarity and Cluster Analysis. For many years, substructure searching provided the principal means of access to databases of first 2D and then 3D structures. It does, however, have several inherent limitations resulting from the need for a database structure to contain the entire query substructure if it is to be retrieved.44,75 This implies that the searcher must already have a fairly clear idea of the chemotypes that are required, which can clearly be very difficult at the start of a research program when perhaps just a single weak high-throughput screening (HTS) hit or competitor compound is available and when it is thus not possible to specify the particular feature(s) that are responsible for the observed activity (as would be the case if, for example, it had been possible to carry out a pharmacophore analysis). There is also the problem of output size: the specification of a broadly defined query substructure and/or a common ring system could easily retrieve many thousands of molecules. Alternatively, an initial query may prove to be too specific to retrieve anything at all. In either case, several different searches may be required before an appropriately sized output is obtained. Finally, a substructure search results in a simple partition of the database into two discrete subsets, i.e., those molecules that do contain the query substructure and those that do not, without any ranking of the retrieved molecules in order of decreasing similarity to the query, i.e., in order of decreasing probability of activity in the context of a virtual screening environment. These characteristics of substructure searching led to the development of the alternative, and complementary, access mechanism known as similarity searching.75,76 A query here generally involves the specification of an entire molecule, commonly known as the target structure or reference structure, rather than the substructure that is required for substructure searching. The target is characterized by one or more structural descriptors, and this set of descriptors is compared with the corresponding sets of descriptors for each of the molecules in the database. These comparisons enable the calculation of a measure of similarity between the target structure and each of the database structures, and the latter are then sorted into order of decreasing similarity with the target. The output from the search is a ranked list in which the structures that are calculated to be the

4188

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

Figure 3. Similarity coefficients used for the comparison of pairs of fingerprints. The two fingerprints are assumed to contain n bits, with a bits being set in the first fingerprint, b bits in the second, and c of these set bits being in common.

most similar to the target structure, the nearest neighbors, are located at the top of the list. These neighbors form the output of the search and will be those that have the greatest probability of being of interest to the user, given an appropriate measure of intermolecular structural similarity. The first two reports of similarity searching were based on work carried out at Lederle Laboratories77 and by us in collaboration with Pfizer.78 Both groups realized that counts of the numbers of fragment substructures common to a pair of molecules provided a computationally efficient, and surprisingly effective, way of computing the degree of resemblance between a pair of molecules, an approach first suggested by Adamson and Bush.79 Our initial work focused on ranking the output of an initial 2D substructure search, but the availability of a fast inverted-file nearest-neighbor searching algorithm17 meant that we were soon able to dispense with the initial search so that the user needed only to input a target structure of interest to obtain a ranking of the entire database. Specifically, once the target structure had been input, its 2D fingerprint was generated and then compared with the fingerprints of each of the database molecules to identify the bits, and hence fragment substructures, in common. This information was used to compute a similarity measure, called the Tanimoto coefficient (which is detailed in Figure 3 along with several other similarity coefficients that have been used for the comparison of 2D fingerprints). The database was then ranked in decreasing order of the coefficient values so that the nearest neighbors were presented first to the user as the output from the search. We started our work on similarity searching as a purely pragmatic response to the limitations of substructure searching that have been summarized above. However, there is an obvious theoretic rationale for using such an approach. This is the similar property principle, which states that molecules that are structurally similar are likely to have similar properties80 (an

Award Address

idea that also underlies research into molecular diversity and chemogenomics under the names of the neighborhood principle81 and structure-activity relationship homology,82,83 respectively). Thus, if a bioactive target structure is searched for, then its nearest neighbors are also likely to possess that activity. These molecules are hence prime candidates for biological testing, compared to other molecules that occur further down the ranking. If this approach is to be used for virtual screening,84 then the similarity measure that is used to compute the structural similarities must be effective (i.e., a similarity measure for which high computed structural similarities do indeed correspond to similar bioactivity characteristics) and efficient (i.e., enable the measure to be calculated sufficiently rapidly for interactive access to large structure databases). The Sheffield group has devoted considerable time during the period under review to the development of similarity measures that exhibit these two, sometimes conflicting, characteristics, as described in later sections of this paper. In retrospect, we were very lucky in our initial choice of similarity measure (i.e., 2D fingerprints and the Tanimoto coefficient) from among the many measures that are available,76,85 since this approach has subsequently proved to be of very general applicability (see, for example, refs 44 and 86-90) and continues to be the method of choice for similarity searching in operational chemoinformatics systems of all sorts. It is perhaps worth noting that the effectiveness of fingerprint-based methods is somewhat surprising in that the fragments encoded in a fingerprint have normally been designed to maximize the screenout in substructure searches rather than to retrieve chemically and biologically related substances. Similarity searching involves matching one structure (the target structure) with all of the structures in a database, and it was natural to consider extending our ideas to the clustering of chemical structures, where many of the more common methods for cluster analysis involve matching all of the members of a database with each other. Structure-based approaches to the clustering of chemical structures were first suggested in the late 1960s,91,92 but it was not until the 1980s that we commenced an extended evaluation of the effectiveness of over 30 hierarchic and nonhierarchic clustering methods when used for the grouping of chemical structures using fingerprint-based similarity measures.44 The principal aim of the work was to obtain an overview of the range of structural types present within a data set by selecting one (or some small number) of the molecules from each of the clusters resulting from the application of an appropriate clustering method to that data set. Our evaluation of clustering methods employed a property-prediction approach developed in previous work in Sheffield by Adamson and Bush.79 Using this approach on 10 small data sets for which physical, chemical, or biological property data were available, we found that the best results were obtained with Ward’s hierarchicagglomerative method,93 with the nonhierarchic nearest-neighbor method of Jarvis and Patrick94 performing almost as well. These two methods are summarized in Figures 4 and 5. At the time that these comparative experiments were carried out, computer limitations (in terms of both raw CPU speeds and the algorithms available) meant that Ward’s method could not be

Award Address

Figure 4. Stored matrix algorithm for Ward’s hierarchic agglomerative clustering method. A hierarchical agglomerative clustering method generates a classification in a bottom-up manner by a series of agglomerations in which small clusters, initially containing individual molecules, are fused together to form progressively larger clusters. In this algorithm, a point is either a single molecule or a cluster of molecules. This procedure is known as the stored matrix algorithm because it involves random access to the intermolecular similarity matrix throughout the entire cluster-generation process. For a Ward classification, a distance coefficient must be used for the computation of the dissimilarity matrix and the least dissimilar pair of points in step 2 is computed using a measure based on the within-group sum of squared distances.

Figure 5. Algorithm for the Jarvis-Patrick clustering method. The Jarvis-Patrick method involves the use of a list of the top K nearest neighbors for each molecule in a data set, i.e., the K molecules that are most similar to it. Once these lists have been produced for each molecule in the data set that is to be processed, two molecules are clustered together if they are nearest neighbors of each other and if they additionally have some minimal number of nearest neighbors, Kmin, in common.

applied to chemical databases of substantial size; it has expected time and space complexities of O(N3) and O(N2) for the clustering of a data set containing N molecules, as against corresponding complexities of O(N2) and O(N) for Jarvis-Patrick. The latter method was thus rapidly and widely adopted as the clustering method of choice in operational chemical database software (see, for example, refs 95 and 96) for applications such as the selection of compounds for random screening and for analyzing the outputs of substructure searches.97 However, the method does have limitations,98,99 and subsequent comparisons86,87,100 have reaffirmed the general superiority of Ward’s method. The availability of improved computer hardware and of the efficient reciprocal nearest-neighbors algorithm101 (see Figure 6), which has expected time and space complexities of O(N2) and O(N), means that this method can now be applied to databases containing some hundreds of thousands of molecules in an acceptable amount of time. I have mentioned previously the relationships that I believe exist between IR and chemoinformatics, and it was this relationship that was the original source for our work on chemical similarity and clustering. Specifically, the limitations of substructure searching that encouraged our initial interest in similarity searching were just those that encouraged researchers in IR to develop the ranking approach (common in modern Web search engines) from the long-established Boolean approach to text searching (common in most early bibliographic retrieval systems).15,16 In just the same way, the

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

4189

Figure 6. Reciprocal nearest neighbors algorithm for Ward’s method. A path is traced through the similarity space until a pair of points is reached that are more similar to each other than they are to any other point, i.e., each is the reciprocal nearest neighbor (RNN) of the other. These RNN points are fused to form a single new point, and the search continues until the last unfused point is reached. In the figure, NN(X) denotes the nearest neighbor for the point X, and the final, overall hierarchic classification is created from the list of RNN fusions that has taken place.

clustering of document databases to identify clusters that contain large numbers of relevant documents15,102,103 provided a natural model for the clustering of chemical databases to identify groupings of active molecules because the rationale for document clustering, the cluster hypothesis,102 is the IR equivalent of the similar property principle. Specifically, the cluster hypothesis states that documents that are similar tend to be relevant to the same requests: simply replace “document” in the cluster hypothesis by “molecule” and “relevant to the same requests” by “exhibit the same properties” and one has the similar property principle. Molecular Diversity Analysis. The widespread adoption by the pharmaceutical industry of combinatorial chemistry and HTS from the mid-1990s onward resulted in a massive increase in the numbers of compounds that could be synthesized and tested for biological activity. These technological developments spurred widespread interest in methods for molecular diversity analysis, i.e., for selecting sets of compounds that are as structurally diverse (or heterogeneous, dissimilar, widely spaced, etc.) as possible and that are thus expected to provide the maximum amount of information about biological activity.81,104 The main approaches that have been developed to select diverse subsets of compounds include clustering, dissimilaritybased compound selection (DBCS), partitioning or cellbased approaches, and optimization-based methods.105,106 These methods are routinely used to select compounds for biological testing and also to select subsets of reagents for the synthesis of combinatorial libraries. Our early studies of molecular diversity analysis focused on DBCS. The basic DBCS algorithm for selecting a diverse subset from a database was described by Bawden107 and Lajiness,108 using an approach first described by Kennard and Stone.109 This algorithm is simple in concept (see Figure 7) and involves selecting the first compound at random and then repeatedly selecting as the next compound the one that is most dissimilar to those that have already been selected. The selection step involves calculating the dissimilarity of every compound remaining in the database to the compounds already selected in the subset. Examples of such dissimilarities include the MaxSum method, where the dissimilarity is taken to be the sum of the pairwise dissimilarities to all compounds in the subset, and the MaxMin method, where the dissimilarity is taken to be

4190

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

Award Address

Figure 7. Basic algorithm for dissimilarity-based compound selection. This algorithm selects a size n Subset from a size N Dataset and permits many variants depending upon the precise implementation of step 3. The centroid algorithm referred to in the text uses the sum of the pairwise dissimilarities between a single compound in the data set and the group of compounds that comprise the subset.

the dissimilarity of the most similar compound in the subset to the database compound.110 The BawdenLajiness algorithm has an expected time complexity of order O(n2N) for selecting an n-compound subset from an N-compound data set, which makes it impractical for use with large chemical databases. However, we were aware of an efficient algorithm that had been described previously by Voorhees for document clustering using the group-average method hierarchical clustering method.111 We realized that her algorithm could be generalized to any situation where one needs to compute sums of similarities rather than sets of individual interobject similarities and where one can use the cosine coefficient to measure the similarity between pairs of vector objects (such as fragment bit-strings). It hence proved possible to adapt Voorhees’ algorithm to give a fast, O(nN), implementation of the selection step of the Bawden-Lajiness algorithm using the MaxSum method,112 and this centroid algorithm provided one of the first practical tools for large-scale DBCS Although it was later shown that other selection algorithms were superior to the MaxSum method,110,113-115 the centroid approach was subsequently developed for several other applications,116-118 and it provided the starting point for our studies of product-based approaches to the design of combinatorial libraries. When we started this work, the conventional approach to the design of structurally diverse libraries focused on the selection of diverse subsets of reactants. The expectation was that use of diverse reagents in a combinatorial synthesis would necessarily result in a diverse set of product molecules but without any explicit consideration of the nature of these products. An alternative approach, illustrated in Figure 8, involves product-based selection. Here, the virtual product library is enumerated from all of the available reactants and then a combinatorial subset is chosen directly from the space of the resulting products. Reactant-based selection is much less computationally demanding than product-based selection, but there is no guarantee that optimized subsets of reactants will lead to optimized products. In particular, we felt that reactant-based selection might result in combinatorial libraries that were less than ideally diverse. We hence developed a GA for product-based compound selection that made use of the MaxSum centroid method and were able to show that product-based approaches do in fact result in more diverse libraries than reactant-based methods,119 a result that was later confirmed in other studies.120 Subsequent work by my colleague Val Gillet, in collaboration with Prof. Peter Fleming in the Department of Automatic Control and Systems Engineering at Sheffield and with colleagues at GlaxoSmithKline, has extended this approach to permit the design of combi-

Figure 8. Creation of a combinatorial library of size n1n2 molecules using either reactant-based selection or productbased selection. The reactant-based approach involves the following: creation of r1 by selecting the n1 most diverse reactants from R1; creation of r2 by selecting the n2 most diverse reactants from R2; creation of the n1n2 products in library c by combining each of the n1 reactants in r1 with each of the n2 reactants in r2. The product-based approach involves the following: creating the N1N2 products in library C by combining each of the N1 reactants in R1 with each of the N2 reactants in R2; creating the combinatorial library, denoted by Ld*, by selecting the n1n2 most diverse molecules from the N1N2 products in library C (this selection normally being carried out subject to the combinatorial constraint to maximize synthetic efficiency).

natorial libraries that are not just structurally diverse but that also have appropriate physicochemical properties, thus ensuring the druglike nature of the libraries that are suggested for synthesis and testing.121,122 Use of Chemoinformatics Techniques in Bioinformatics. Bioinformatics and chemoinformatics both involve the application of computational techniques to the analysis of large volumes of molecular structure data, typically (but not exclusively) the sequences of biological macromolecules in the case of bioinformatics and the 2D or 3D structures of small molecules in the case of chemoinformatics. Both specialisms play increasingly important roles in modern approaches to the discovery of novel bioactive molecules, but they have developed separately, with relatively little interaction between researchers in the two fields to date. Our efforts in this respect have focused on applying to biological macromolecules some of the algorithmic approaches that have been used successfully by chemoinformatics researchers. The principal focus of the work here, which has all been carried out in collaboration with Prof. Peter Artymiuk and his colleagues in the Department of Molecular Biology and Biotechnology at Sheffield, has been the use of graph theory for the representation and searching of the structures (rather than the sequences) of the proteins in the Protein Data Bank,123 for which we have developed two types of graph: one describing 3D patterns of secondary structure elements (hereafter SSEs) and the other describing 3D patterns of amino acid side chains (Figures 9 and 10). The first graph representation of a protein that we have developed makes use of the fact that the two most common types of SSE, the R-helix and the β-strand, are both approximately linear structures, which can hence be represented by vectors drawn along their major axes. The set of vectors corresponding to the SSEs in a protein can then be used to describe that protein’s 3D structure, this structure being represented by a graph in which

Award Address

Figure 9. Representation of the geometric arrangement of secondary structure elements in a 3D protein. Each entry in the matrix contains the torsion angle between the vectors describing two secondary structure elements and then the intermidpoint distance in Å. The cylinder represents the helix, and the arrows represent the two strands.

Figure 10. Representation of side chains by pseudoatoms. Diagram of an aspartate-histidine-serine catalytic triad pattern showing the locations of the pairs of pseudoatoms (white circles) that are used to represent side chains. Arrows represent the vectors between pseudoatoms within a side chain, and dotted lines represent the distances between pseudoatoms used in pattern matching, with heteroatoms shaded dark.

the SSEs correspond to the nodes of the graph and the geometric relationships between pairs of the SSEs correspond to the edges of the graph.55 More precisely, each node in such a graph is denoted by the SSE type (R-helix or β-strand) and each edge in such a graph is a three-part data element that contains the angle between a pair of vectors describing SSEs, the distance of closest approach of the two vectors and the distance between their midpoints. In the case of amino acid side chains, the graph nodes and edges describe individual amino acids and the interacid geometric relationships.56 Specifically, each node contains two pseudoatoms, whose positions are chosen to emphasize the functional part of the side chain corresponding to that node. The locations of the two pseudoatoms are used to generate a vector, and each such vector corresponds to one of the nodes in a graph. The geometric relationships describing an edge encode information about the distances between the starts, ends, and midpoints of pairs of such vectors. Recent additions to the method include much more

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

4191

detailed node descriptions (including not just the type of amino acid but also the secondary structural state of the residue, the redox state of cysteine residues, the solvent accessibility of the residue in the biologically relevant multimer, and the distance of the residue from a bound ligand or known site) and the ability to search for queries that also include the main-chain atoms of a residue.124 A protein can hence be represented by a labeled graph that can be searched for structural, rather than sequence, patterns using a subgraph or maximum common subgraph isomorphism algorithm. This has enabled us to discover many previously unrecognized structural resemblances of various sorts (see, for example, refs 72 and 124-128), and analogous approaches have since been developed by several other groups (see, for example, refs 129-131). More recently, we have extended our studies to encompass the representation and searching of carbohydrate and RNA structures and have again been able to demonstrate the much greater search effectiveness that is possible using graph-theoretic methods.132,133 The interested reader is referred to a recent review for a more detailed overview of our work on graph methods for macromolecules.134 In addition to these studies, we have also made use of GA and similarity techniques for handling such data. For example, some years ago we described the use of a GA-based approach for comparing the solvent-accessible surfaces of protein structures,135 using techniques derived from the work on field-based similarity searching that is discussed in the next section of this paper. We have recently employed the same basic technique as a component of a program, called GAPDOCK, for protein-protein docking.136 Here, we are seeking protein surfaces that are complementary to each other rather than being similar as in the earlier study by Poirrette et al.;135 however, our results show that the GA approach is equally effective in this different domain, as demonstrated by our involvement in the CAPRI (Critical Assessment of PRotein Interactions) docking competition.137 As another example, an ongoing study with Prof. Chris Hunter and his colleagues in the Department of Chemistry in Sheffield is studying the different sorts of information that can be gleaned from DNA structures, compared to DNA sequences. The computation of the correct structures of proteins from their sequences has been the subject of intense study for many years but is still problematic unless a strongly homologous sequence is already available. The situation with DNA structures is rather different, and Packer and Hunter have shown that it is possible to generate reasonable accurate structures using a simple computational model that focuses on six structural parameters that are associated with each of the nucleotides in a sequence.138 Gardiner et al. have used similarity measures based on these parameters to demonstrate that the structural properties of DNA are much less diverse than the sequences and that DNA sequence space is much larger and more heterogeneous than DNA structure space.139,140 Similarity Searching Similarity searching has been a source of continuing interest in the group ever since the early work on 2D

4192

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

Award Address

fingerprints that has been summarized above. Our interest has been in the identification of similarity measures that are effective in operation, in the sense that they are able to bring together molecules that are indeed felt to be related and that can be implemented using algorithms that are suitable for searching large chemical databases. A similarity measure has three main components: the structural representation that is used to characterize the molecules that are being compared; the weighting scheme that is used to differentiate more important features from less important features; and the similarity coefficient that is used to quantify the degree of similarity between pairs of molecules. Of these, only limited work has been done on weighting. In this section we summarize some of our work on different representations, deferring discussion of similarity coefficients until later. Specifically, we here discuss similarity measures that are based on 2D chemical graphs and on 3D molecular fields. Other representations that we have studied for similarity searching include interatomic distances,141 valence angles, torsion angles,142 computed infrared spectra,143 and reduced graphs.144 Graph-Based Similarity Searching. As noted above, most similarity searching systems are based on the comparison of pairs of molecular fingerprints and on the use of the Tanimoto coefficient; however, if 2D structure diagrams are available, then it seems reasonable to consider measures of similarity that are based directly on the comparison of these diagrams rather than indirectly via the comparison of fingerprints that are derived from them. The obvious way of doing this is by the use of a similarity coefficient based on the identification of the MCS between the graphs representing a pair of structures. I have long been interested in the use of MCS methods (as noted in discussions above of reaction indexing, pharmacophore detection, and the comparison of 3D protein structures), but the algorithms used in those studies were not sufficiently rapid to enable them to be used for similarity searching in large chemical databases containing hundreds of thousands of structures.145 The need for enhanced computational efficiency has resulted in the development of a new MCS algorithm called RASCAL (for RApid Similarity CALculation).146,147 It is not appropriate to go into the details of the RASCAL algorithm here, so we note only its two principal components: an upper-bound screening procedure followed by graph matching based on a novel clique-detection procedure. The screening procedure is intended to determine rapidly whether the chemical graphs being compared exceed some specified minimum similarity threshold to avoid unnecessary calls to the more computationally demanding, graph-matching procedure. The procedure is thus analogous to the fingerprint-based screening that forms the first part of chemical substructure searching, in that both seek to minimize the computation that is required. However, whereas a good screening system for 2D substructure searching is able to screen out more than 99% of a file, the efficiency of screening for similarity searching is crucially dependent on the threshold similarity that is chosen: high screenout can be obtained if a high threshold, e.g., 0.90 for the Tanimoto coefficient, is

specified, but then the search may retrieve very few molecules other than close analogues of the target structure. If, conversely, one wishes to obtain an overview of the types of structure that are related to the target structure, then while a threshold of, perhaps, 0.75 may retrieve a reasonable spread of structural types, it will be at considerable computational cost because many fewer molecules will be eliminated from the MCS match. Thus, while upper-bounding is a powerful way of increasing the efficiency of many approaches to similarity searching (see, for example, refs 148-150), care is needed to ensure that the resulting efficiency gains are not at the expense of retrieval effectiveness. In essence, the first part of the screening lists the atom types (specifically the elemental type and the connectivity) in each of the two molecules that are being compared. Then if some particular atom type occurs N1 times in the first molecule and N2 times in the second molecule, there can be at most min{N1,N2} such fragments in common. By considering all of the atom types present in turn, it is possible to calculate an upper bound to the number of atoms and the number of bonds that the two molecules could possibly have in common. This information is then used to calculate an upper bound to the similarity for the two molecules, using a graph-based similarity coefficient such as that described by Johnson151 (which is a graph-based version of the well-known cosine coefficient, shown in Figure 3, that has been used for the calculation of fingerprint-based similarities75). If the computed upper bound is less than the user-defined threshold similarity then there is no need for further processing. This first screening stage considers just local connectivity information (and sums thereof) to calculate the upper bounds, without any attempt being made to consider the extent to which it is possible to map individual atoms in the first molecule to individual atoms in the second molecule. This is done in the second screening stage, which uses the linear assignment algorithm of Carpaneto et al.152 to identify a (possibly nonunique) optimal mapping. This provides a tighter upper bound to the numbers of matching atoms and bonds and hence to the overall intermolecular similarity. The two molecules are then passed on for the detailed clique-detection procedure if, and only if, this second upper bound is not less than the threshold similarity. The clique-detection procedure we have developed contains a number of novel pruning and upper-bounding heuristics, as well as modifications of existing heuristics. These techniques are applicable to any sort of MCS problem. There are also several heuristics that are based on the particular characteristics of chemical graphs and that can still further increase the speed of graphmatching when RASCAL is used for chemical similarity searching. These heuristics are described in detail by Raymond et al.146,147 Here, we focus on one characteristic of the graph-matching stage that plays a significant role in maximizing RASCAL’s efficiency, viz the use of line graphs. The line graph L(G) of a graph G is a graph whose nodes are the edges in G and whose edges are the nodes in G, i.e., L(G) is the inverse of G (see Figure 11). We have noted previously the use of a correspondence graph for clique-detection, and we have used this approach here; however, the correspondence graph in

Award Address

Figure 11. Relationship between a chemical graph, G, and its line-graph L(G). Each edge (numbered) in the graph G becomes a node in the line-graph L(G), with the edge descriptors in the latter graph being the corresponding node descriptors in G.

RASCAL is computed from the line graphs of the two molecules that are being compared, rather than the usual graphs. The resulting correspondence graph is far less dense (i.e., contains far fewer nonzero elements) than when simple graphs are used for its construction, which results in very substantial increases in the speed of clique-detection. In addition, use of the line graphs means that the algorithm identifies the clique that maximizes the number of matching bonds rather than the number of matching atoms. This is preferable in a chemical context because this definition of an MCS often yields intuitively more acceptable measures of structural resemblance than the common definition of a chemical MCS as the common subgraph containing the largest number of common atoms.145 Once the MCS has been identified, we have used similarity measures that take account of both the numbers of common atoms and common bonds.153 Experiments with data sets of publicly available druglike compounds and with a range of similarity thresholds demonstrate that RASCAL is able to compute thousands of graph-based similarities a second using standard PC equipment.146 This is still much slower than fingerprint-based searching, but it is certainly rapid enough to permit exact MCS-based similarity searching of very large databases, something that had not previously been possible. The effectiveness of such searches was tested by means of simulated virtual screening experiments, using both the ID Alert and MDL Drug Data Report databases,153 and with comparable fingerprint-based similarity searches that used BCI, Daylight, and UNITY 2D fingerprints. In summary, across the extensive experiments that were carried out, there was little difference between the effectiveness of the graph-based and fingerprint-based searches in terms of the numbers of active molecules that were retrieved in similarity searches based on known bioactive target structures. This result is in line with previous studies that have shown that the use of a more detailed representation in a similarity calculation, in this case the full chemical graphs rather than fingerprints derived from them, does not necessarily result in an increase in the effectiveness of retrieval.17,86,154,155 However, an analysis of the identities of the active molecules (rather than just the number of them) shows that there are substantial differences (of about 30%) in the two sets of retrieved actives and thus that the two approaches are complementary in nature. This finding suggests that improvements in retrieval

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

4193

effectiveness might be achieved by combining the two sorts of ranking using a data fusion method (vide infra), and experiments with the MDDR database show that this does indeed lead to an improvement of about 5% in recall when compared with the better of the two individual types of search.153 Field-Based Similarity Searching. Common patterns of atoms, either in 2D or in 3D, are clearly of importance in determining the similarity relationships that exist between pairs of molecules, but these are by no means the only features that are involved in biological activity. Most obviously, the success of 3D QSAR approaches such as CoMFA and CoMSIA demonstrates the importance of molecular electrostatic, steric, and hydrophobic fields.156 This suggests that it may be worth considering similarity measures that are based on such information, specifically on the extent to which the fields of two molecules overlap each other. This idea, first stated by Carbo et al.,157 has been taken up by several workers, most commonly using the molecular electrostatic potential (MEP) to describe a molecule’s electrostatic field (see, for example, refs 158-162). A molecule is positioned at the center of a 3D grid, and the electrostatic potential is calculated at each point in the grid. The similarity between a pair of molecules is estimated by aligning them in some way so that similar features are superimposed, taking the product of the two molecular potentials at each point and then summing the products over the entire grid. This procedure can be carried out efficiently using the elegant algorithm of Good et al.,163 who showed that the potential distribution could be approximated by a series of Gaussian functions. However, this still requires that the molecules that are being compared have been aligned in some way prior to the calculation of the similarity, and it was this need for an appropriate alignment procedure that spurred our interest in methods for field-based similarity searching. Our initial studies adopted a graph-theoretic approach to the representation and searching of MEPs.164 The basic idea underlying this approach is that it is possible to summarize the most important parts of an MEP by a much smaller number of points, with the resulting set of points being represented by a field graph, in which the points are the nodes of the graph and the interpoint distances are the edges. The field graphs are generated in two stages. In the first stage, a threshold potential is applied to the 3D grid representing an MEP to identify those grid elements that have the largest magnitudes (either positive or negative). In the second stage, a clustering-like procedure is applied to the resulting subset of the original grid elements to find the connected components that are present. The distance is calculated between the centers of all of the components that have been identified, and the resulting set of intercenter distances then forms the edge set of the field graph. A database is created by applying the graph-generation procedure to all of the constituent structures, and a similarity search is carried out by comparing the field graph representing the target structure with the field graphs of each of the molecules in the database. Each such comparison is done by means of an MCS procedure, based on the Bron-Kerbosch algorithm (vide supra), that specifies an alignment of

4194

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

Award Address

the corresponding MEPs. This alignment enables the calculation of the intermolecular similarity using the fast Gaussian algorithm of Good et al.163 An alternative approach uses a simple GA. The chromosome in this GA encodes a set of translations and rotations that, when applied to the 3D coordinates of one molecule, will align its MEP with the MEP of the other molecule, which is considered to be fixed in space. The fitness function for the GA is the similarity value resulting from a Gaussian similarity calculation, and the GA hence searches the space of possible molecular alignments to find the alignment that maximizes the value of this similarity calculation.165 This is clearly a very simple way of exploring 3D space but one that is very powerful. We have used variants on this idea for pharmacophore mapping74 and protein-surface comparison135 (vide supra), and it also forms an important component of our GOLD program for flexible ligand docking.27 The field-graph approach is much more rapid in operation than the GA approach, but the precise form of the graphs that are obtained is crucially dependent on the parameter values required for the choice of the subset of the grid points; moreover, it proved to be difficult to extend the approach to permit flexible fieldbased similarity searching. For these reasons, we have chosen to focus on the GA-based approach, which is embodied in a program called FBSS (for Field-Based Similarity Searching) that we have used in several database searching studies.166-168 This work will be illustrated by our studies of ring searching, as discussed below. Lead-discovery programs make extensive use of libraries of compounds consisting of a central ring scaffold that positions substituents so that they can make favorable interactions with residues in a protein’s binding site. It would hence be useful to be able to design libraries that are analogous to one that has been published, specifically to design libraries in which the functionality can still be positioned at the required positions in 3D space but in which a different central ring scaffold is employed.169,170 We have applied FBSS to the task, which we refer to as scaffold-searching,168 of identifying ring systems that are similar in shape to a user-defined target scaffold, T, and that can be substituted in the same approximate geometric arrangement as the points of attachment in T. Our initial work with FBSS considered just electrostatic fields, but it is easy to extend the approach to consider other types of molecular field. When FBSS is to be used for scaffold searching, we used a coefficient of shape similarity based on electron density (as suggested by Good and Richards171) to calculate the similarity between T and each of the rings in a database of ring systems. The ring systems are ranked in descending order of the calculated shape similarities, together with the corresponding alignment. Each such alignment is then checked to see if the points of attachment in T correspond to atoms in a database ring system that could, given suitable chemistry, have functionality attached to them. A distance threshold is used to determine whether a substitutable ring atom is an acceptable match for a point of attachment in T, and the search output is hence those rings that could act as alternatives to the ring

scaffold, ranked in order of decreasing shape similarity. Our experiments suggest that FBSS provides an effective way of identifying alternative ring scaffolds for a combinatorial library design. An operational system that uses some of these ideas has been described by Lewell et al.170 Current Work on 2D Similarity Searching Virtual screening plays an increasingly important role in modern approaches to the discovery of new agrochemicals and pharmaceuticals,84,172 and much of our current work is in this area. In particular, over the past few years, we have gone back to the simple similarity measures based on 2D fingerprints to see if it is possible to increase further the effectiveness of this very popular approach to virtual screening. Our work has focused on the use of different similarity coefficients and on the use of data fusion to combine the results of different similarity searches.143,173-178 There are very many different types of similarity measure,75,76,85 and this has spurred interest in comparative studies that seek to try to identify a single, “best” measure, using some quantitative performance criterion. While such comparisons are valuable in identifying robust procedures of proven effectiveness, they are rather limited in that they assume, normally implicitly, that there is some specific type of structural feature (similarity coefficient, weighting scheme, or whatever it is that is being investigated) that is uniquely well suited to describing the type(s) of biological activity that is being sought in a similarity search. This assumption, which underlay much of our early work,44 cannot be expected to be generally valid, given the multifaceted nature of biological activity, and this has led us to consider chemical applications of the technique known as data fusion179 (which is also referred to as consensus scoring in the ligand docking literature180-182). Data fusion was developed to combine inputs from different sensors, with the expectation that using multiple information sources enables more effective decisions to be made than if just a single sensor was to be employed. The methods are used in a wide range of military, surveillance, medical, and production engineering applications.179 Our interest was aroused by a paper by Belkin et al.,183 in which data fusion was used to combine the results of different searches of a text database, conducted in response to a single query but employing different indexing and searching strategies. A query was processed using different strategies, each of which was used to rank the database in order of decreasing similarity with the query. The ranks for each of the documents were then combined using one of several different fusion rules, the output of the fusion rule was taken as the document’s new similarity score, and the fused lists were finally reranked in descending order of similarity. We felt that such an approach would be equally applicable in the chemical context, as shown in Figure 12, and were thus heartened to see the appearance of a paper by workers at Merck,184 who discussed the use of both topological and geometric descriptors for similarity searching. Our initial work was carried out while evaluating the utility of the EVA descriptor for similarity searching, where we found that combining searches

Award Address

Figure 12. Data fusion. Combination of similarity rankings using data fusion to create a single output ranking that is expected to offer a better level of retrieval effectiveness than any one ranking on its own.

obtained using EVA and using Unity 2D fingerprints often resulted in rankings that were more effective than the individual search rankings.143 This finding was confirmed in subsequent studies showing that use of an appropriate fusion rule generally resulted in a level of performance (however quantified) that was at least as good as (and often noticeably better than) the best individual measure from among those that were fused. Because the best individual measure often varies from one target structure to another in an unpredictable manner, the use of a fusion rule will thus generally provide a more consistent level of searching performance than will a single measure of chemical similarity.174 These initial studies looked at the combination of similarity measures that were markedly different from each other, involving not just different structural representations but often different similarity coefficients as well. More recently, we have focused on a single representation, a 2D fingerprint, and looked to see whether it would be possible to obtain increases in performance merely by using different similarity coefficients. We have noted above that the Tanimoto coefficient is widely used for measuring the similarity between pairs of fingerprints. The Tanimoto coefficient is, however, but one of a number of similarity and distance coefficients that can be used for the calculation of such similarities; examples of these coefficients are shown in Figure 3. The initial comparative study suggesting that the Tanimoto coefficient might be an effective measure of molecular similarity used a simulated property-prediction approach with a number of small QSAR data sets,185 and we hence felt that it was necessary to conduct comparative experiments on a much larger scale, using a simulated virtual screening methodology. This more recent work compared no less than 22 different coefficients175,186 and revealed that many of them were closely related to each other in that several clusters of coefficients were identified that produced essentially the same database rankings when used for virtual screening. We hence focused on a subset of 13 coefficients, including all of those in Figure 3, in our studies of data fusion, which involved simulated virtual screening of the MDL Drug Data Report (MDDR) database for seven bioactivity classes.175 Here, fused similarity searches were carried out as shown in Figure 12, with the fusions being based on all of the nC13 possible combinations for n ) 1-13 coefficients. All but one of the seven activity classes showed an increase in performance (in terms of the numbers of actives retrieved in searches for a known active target structure) over single coefficients when fusion was used. This performance increase seemed to peak when about two to four coefficients are combined but, in many cases, still

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

4195

remained high for nearly all coefficients in combination. However, the best-performing combinations of coefficients were found to vary substantially from one activity class to another, and it was, unfortunately, not possible to identify some optimal combination that could be expected to improve performance over a single coefficient in all cases. It might be possible to identify the correct combination in a particular program and then to use that combination in subsequent similarity searches within that program. However, it must be remembered that similarity searching provides a very crude way of virtual screening, since it is most appropriate at an early stage in a program when just a few active molecules are available; if large amounts of training data are available, then more sophisticated approaches can be used, such as substructural analysis187 or pharmacophore mapping.188 Our inability to identify a robust combination of coefficients that could be expected to improve search performance in all circumstances was disappointing. However, an analysis of the extensive data that had been generated in the study revealed that there was a marked preference for certain coefficients to perform well when searching for active molecules of a particular size (as approximated by the number of bits set in a molecule’s fingerprint). For example, the Russell-Rao and Fossum coefficients appear in many of the best combinations involving larger active molecules while the Forbes and Baroni-Urbani coefficients appear in many of the best combinations involving smaller active molecules. The effect of molecular size on the performance of the Tanimoto coefficient for similarity-based and dissimilarity-based selection has been noted previously.189-191 Our study demonstrated that this was a general characteristic of similarity coefficients in current use, such as those shown in Figure 3.175 Importantly, we were subsequently able to provide a rationale for this behavior, thus permitting a similarity coefficient to be chosen that will maximize performance given some knowledge of the sorts of molecule that are required in a selection procedure.173 However, when averaged over a number of similarity searches for a number of different activity classes, we were not able to identify any similarity coefficient (or combination of similarity coefficients) that gave results that were consistently comparable with, or superior to, those obtained with the Tanimoto coefficient. In the absence of additional information, this hence remains the coefficient of choice for use with fingerprint-based virtual screening. Most of the studies of similarity searching that have been reported in the literature have considered the use of only a single bioactive reference structure. It is, however, increasingly the case that several structurally diverse reference structures may be available, e.g., published competitor compounds or HTS hits, and following work by Xue et al.192 and by Schuffenhauer et al.,193 we have become interested in similarity-based methods that can be used when multiple reference structures are available. Our initial studies compared several different ways of using the structural information implicit in a set of known actives and identified a novel data fusion technique, which we refer to as group fusion, as the most appropriate approach in such

4196

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

Award Address

circumstances. Conventional approaches to data fusion have involved carrying out similarity searches based on the use of several different similarity measures with a single target structure (for example, the different similarity coefficients in the work of Holliday et al. that has been discussed above). The group fusion approach, conversely, uses several different target structures, each represented in the same way, with a single similarity measure (a 2D fingerprint and the Tanimoto coefficient in our experiments).176 Specifically, if S(i,j) is the value of the coefficient for the similarity between some database molecule j and the ith of the set of known reference structures, then the fused similarity score given to j is

day here in Sheffield; of the extended collaborations that I have enjoyed with Pfizer, Syngenta, and Tripos Inc.; and of the support of my wife and daughter even when I should have been spending time with them rather than with chemoinformatics.

max{S(i,j)}

i ∈ actives

and the database is ranked in decreasing order of these maximum values. Simulated virtual screening experiments demonstrate that this approach enables very effective searches to be carried out even if as few as 10 reference structures are available and that this approach is particularly well-suited to virtual screening when the actives are structurally heterogeneous, i.e., encompass a range of different structural types.176,178 Conventional, single-target similarity searching is most useful when the actives are strongly clustered in structural space. Our results suggest that group fusion provides a highly effective and surprisingly simple alternative when this is not the case. Conclusions In this paper, I have tried to provide an overview of the major areas of interest of my research group over the past quarter-century, focusing on my fascination with (some might say addiction to) the subject of molecular similarity, in particular similarity based on 2D fingerprints. 2D similarity searching is by now almost 2 decades old, and recent work (both by ourselves and by others) has shown that there are still developments that can be made to the basic approach. Work on further developments, in particular on aspects of data fusion, is currently underway in Sheffield and will be reported shortly. Data fusion is but one example of work that we have carried out in chemoinformatics that has been strongly influenced by research in IR.183 Other examples include our studies of computing similarities using an inverted file17 and of hierarchic clustering methods.18 Currently, we are starting to make use of algorithmic techniques that have been developed in the pattern recognition and machine-learning communities, applying these techniques to the processing of fingerprint-based similarity data.194 Conversely, much of our work in bioinformatics has resulted from initial studies that were carried out in chemoinformatics, as discussed previously in this paper. Acknowledgment. I must give my heartfelt thanks to the many individuals and organizations that have contributed to the work reported here. It is not possible to name them all; there are over 230 people with whom I have coauthored papers and over 30 organizations that have funded my work. But I would like to make particular acknowledgment of the contributions of my colleagues Peter Artymiuk, Val Gillet, and John Holli-

References (1) Leach, A. R.; Gillet, V. J. An Introduction to Chemoinformatics; Kluwer: Dordrecht, The Netherlands, 2003. (2) Gasteiger, J.; Engel, T. Chemoinformatics; Wiley-VCH: Weinheim, Germany, 2003. (3) Adamson, G. W.; Cowell, J.; Lynch, M. F.; McLure, A. H. W.; Town, W. G.; Yapp, A. M. Strategic Considerations in the Design of Screening Systems for Substructure Searches of Chemical Structure Files. J. Chem. Doc. 1973, 13, 153-157. (4) Lynch, M. F.; Holliday, J. D. The Sheffield Generic Structures ProjectsA a Retrospective Review. J. Chem. Inf. Comput. Sci. 1996, 36, 930-936. (5) Vleduts, G. E. Concerning One System of Classification and Codification of Organic Reactions. Inf. Storage Retr. 1963, 1, 117-146. (6) Armitage, J. E.; Lynch, M. F. Automatic Detection of Structural Similarities among Chemical Compounds. J. Chem. Soc. C 1967, 521-528. (7) Clinging, R.; Lynch, M. F. Production of Printed Indexes of Chemical Reactions. I. Analysis of Functional Group Interconversions. J. Chem. Doc. 1973, 13, 98-102. (8) Lynch, M. F.; Willett, P. The Production of Machine-Readable Descriptions of Chemical Reactions Using Wiswesser Line Notations. J. Chem. Inf. Comput. Sci. 1978, 18, 149-154. (9) Bawden, D.; Devon, T. K.; Jackson, F. T.; Wood, S. I.; Lynch, M. F.; et al. A Qualitative Comparison of Wiswesser Line Notation Descriptors of Reactions and the Derwent Chemical Reactions Documentation Service. J. Chem. Inf. Comput. Sci. 1979, 19, 90-93. (10) Lynch, M. F.; Willett, P. The Automatic Detection of Chemical Reaction Sites. J. Chem. Inf. Comput. Sci. 1978, 18, 154-159. (11) Willett, P. The Evaluation of an Automatically Indexed, MachineReadable Chemical Reactions File. J. Chem. Inf. Comput. Sci. 1980, 20, 93-96. (12) Willett, P., Ed. Modern Approaches to Chemical Reaction Searching; Gower: Aldershot, U.K., 1986. (13) Chen, L.; Nourse, J. G.; Christie, B. D.; Leland, B. A.; Grier, D. L. Over 20 Years of Reaction Access Systems from Mdl: A Novel Reaction Substructure Search Algorithm. J. Chem. Inf. Comput. Sci. 2002, 42, 1296-1310. (14) Baeza-Yates, R.; Ribeiro-Neto, B. Modern Information Retrieval; Addison-Wesley: Harlow, U.K. 1999. (15) Salton, G. Automatic Text Processing; Addison-Wesley: Reading, MA, 1989. (16) Sparck Jones, K., Willett, P. Eds. Readings in Information Retrieval; Morgan Kaufmann: San Francisco, 1997. (17) Willett, P. A Fast Procedure for the Calculation of Similarity Coefficients in Automatic Classification. Inf. Proc. Manage. 1981, 17, 53-60. (18) El-Hamdouchi, A.; Willett, P. Comparison of Hierarchic Agglomerative Clustering Methods for Document Retrieval. Comput. J. 1989, 32, 220-227. (19) Popovic, M.; Willett, P. The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data. J. Am. Soc. Inf. Sci. 1992, 43, 384-390. (20) Carroll, D. M.; Pogue, C. A.; Willett, P. Bibliographic Pattern Matching Using the ICL Distributed Array Processor. J. Am. Soc. Inf. Sci. 1988, 39, 390-399. (21) Cringean, J. K.; England, R.; Willett, P. Network Design for the Implementation of Text Searching Using a Multicomputer. Inf. Proc. Manage. 1991, 27, 265-283. (22) Schinke, R.; Greengrass, M.; Robertson, A. M.; Willett, P. Retrieval of Morphological Variants in Searches of Latin Text Databases. Comput. Human. 1998, 31, 409-432. (23) Willett, P. Textual and Chemical Information Retrieval: Different Applications but Similar Algorithms. Inf. Res. 2000, 5 (at URL http://InformationR.net/ir/5-2/infres52.html). (24) Downs, G. M.; Lynch, M. F.; Willett, P.; Manson, G. A.; Wilson, G. A. Transputer Implementations of Chemical Substructure Searching Algorithms. Tetrahedron Comput. Methodol. 1988, 1, 207-217. (25) Ormerod, A.; Willett, P.; Bawden, D. Comparison of Fragment Weighting Schemes for Substructural Analysis. Quant. Struct.Act. Relat. 1989, 8, 115-129. (26) Willett, P.; Wilson, T.; Reddaway, S. F. Atom-by-Atom Searching Using Massive Parallelism. Implementation of the Ullmann Subgraph Isomorphism Algorithm on the Distributed Array Processor. JJ. Chem. Inf. Comput. Sci. 1991, 31, 225-233.

Award Address (27) Jones, G.; Willett, P.; Glen, R. C. Molecular Recognition of Receptor Sites Using a Genetic Algorithm with a Description of Desolvation. J. Mol. Biol. 1995, 245, 43-53. (28) Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Taylor, R. Development and Validation of a Genetic Algorithm for Flexible Docking. J. Mol. Biol. 1997, 267, 727-748. (29) Turner, D. B.; Willett, P.; Ferguson, A. M.; Heritage, T. W. Evaluation of a Novel Infra-Red Range Vibration-Based Descriptor (EVA) for QSAR Studies. I: General Application. J. Comput.Aided Mol. Des. 1997, 11, 409-422. (30) Hirons, L.; Holliday, J. D.; Jelfs, S. P.; Willett, P.; Gedeck, P. Use of the R-Group Descriptor for Alignment-Free Qsar. QSAR Comb. Sci., in press. (31) Read, R. C.; Corneil, D. G. The Graph Isomorphism Disease. J. Graph Theory 1977, 1, 339-363. (32) Wilson, R. Introduction to Graph Theory, 4th ed.; Longman: Harlow, U.K., 1996. (33) Diestel, R. Graph Theory; Springer-Verlag: New York, 2000. (34) Ray, L. C.; Kirsch, R. A. Finding Chemical Records by Digital Computers. Science 1957, 126, 814-819. (35) Sneath, P. H. A.; Sokal, R. R. Numerical Taxonomy; W. H. Freeman: San Francisco, 1973. (36) Everitt, B. S. Cluster Analysis; 3rd ed.; Edward Arnold: London, 1993. (37) Arabie, P., Hubert, L. J., De Soete, G., Eds. Clustering and Classification; World Scientific: Singapore, 1996. (38) Goldberg, D. E. Genetic Algorithms in Search, Optimization and Machine Learning; Addison-Wesley: Wokingham, U.K., 1989. (39) Back, T., Fogel, D., Michalewicz, Z., Eds. Handbook of Evolutionary Computing; Oxford University Press: New York, 1997. (40) Clark, D. E., Ed. Evolutionary Algorithms in Computer-Aided Molecular Design.; Wiley-VCH: Weinheim, Germany, 2000. (41) Willett, P. The Evaluation of Molecular Similarity and Molecular Diversity Methods Using Biological Activity Data. Methods Mol. Biol. 2004, 275, 51-63. (42) Brint, A. T.; Willett, P. Pharmacophoric Pattern Matching in Files of 3-D Chemical Structures: Comparison of Geometric Searching Algorithms. J. Mol. Graphics 1987, 5, 49-56. (43) Brint, A. T.; Willett, P. Algorithms for the Identification of ThreeDimensional Maximal Common Substructures. J. Chem. Inf. Comput. Sci. 1987, 27, 152-158. (44) Willett, P. Similarity and Clustering in Chemical Information Systems; Research Studies Press: Letchworth, U.K., 1987. (45) Gardiner, E. J.; Artymiuk, P. J.; Willett, P. Clique-Detection Algorithms for Matching Three-Dimensional Molecular Structures. J. Mol. Graphics Modell. 1997, 15, 245-253. (46) Barnard, J. M. Substructure Searching Methods: Old and New. J. Chem. Inf. Comput. Sci. 1993, 33, 532-538. (47) Gund, P. Three-Dimensional Pharmacophoric Pattern Searching. Prog. Mol. Subcell. Biol. 1977, 5, 117-143. (48) Jakes, S. E.; Willett, P. Pharmacophoric Pattern Matching in Files of 3-D Chemical Structures: Selection of Inter-Atomic Distance Screens. J. Mol. Graphics 1986, 4, 12-20. (49) Sheridan, R. P.; Nilakantan, R.; Rusinko, A.; Bauman, N.; Haraki, K. S.; Venkataraghavan, R. 3Dsearch: A System for Three-Dimensional Substructure Searching. J. Chem. Inf. Comput. Sci. 1989, 29, 255-260. (50) Fisanick, W.; Cross, K. P.; Rusinko, A. Similarity Searching on Cas Registry Substances. 1. Global Molecular Property and Generic Atom Triangle Geometric Searching. J. Chem. Inf. Comput. Sci. 1992, 32, 664-674. (51) Cringean, J. K.; Pepperrell, C. A.; Poirrette, A. R.; Willett, P. Selection of Screens for Three-Dimensional Substructure Searching. Tetrahedron Comput. Methodol. 1990, 3, 37-46. (52) Jakes, S. E.; Watts, N. J.; Willett, P.; Bawden, D.; Fisher, J. D. Pharmacophoric Pattern Matching in Files of 3-D Chemical Structures: Evaluation of Search Performance. J. Mol. Graphics 1987, 5, 41-48. (53) Good, A. C.; Mason, J. S. Three-Dimensional Structure Database Searches. Rev. Comput. Chem. 1996, 7, 67-117. (54) Ullmann, J. R. An Algorithm for Subgraph Isomorphism. J. ACM 1976, 16, 31-42. (55) Mitchell, E. M.; Artymiuk, P. J.; Rice, D. W.; Willett, P. Use of Techniques Derived from Graph Theory To Compare Secondary Structure Motifs in Proteins. J. Mol. Biol. 1990, 212, 151-166. (56) Artymiuk, P. J.; Poirrette, A. R.; Grindley, H. M.; Rice, D. W.; Willett, P. A Graph-Theoretic Approach to the Identification of Three-Dimensional Patterns of Amino Acid Side-Chains in Protein Structures. J. Mol. Biol. 1994, 243, 327-344. (57) Pearlman, R. S. Rapid Generation of High Quality Approximate 3D Molecular Structures. Chem. Des. Autom. News 1987, 2, 1-7. (58) Gasteiger, J.; Rudolph, C.; Sadowski, J. Automatic Generation of 3D Atomic Coordinates for Organic Molecules. Tetrahedron Comput. Methodol. 1990, 3, 537-547. (59) Warr, W. A.; Willett, P. The Principles and Practice of 3D Database Searching. Designing Bioactive Molecules: ThreeDimensional Techniques and Applications; American Chemical Society: Washington, DC, 1997; pp 73-95.

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

4197

(60) Crippen, G. M.; Havel, T. F. Distance Geometry and Molecular Conformation; Research Studies Press: Letchworth, U.K., 1988. (61) Raymond, J. W.; Willett, P. Similarity Searching in Databases of Flexible 3D Structures Using Smoothed Bounded Distance Matrices. J. Chem. Inf. Comput. Sci. 2003, 43, 908-916. (62) Clark, D. E.; Willett, P.; Kenny, P. W. Pharmacophoric Pattern Matching in Files of Three-Dimensional Chemical Structures: Use of Smoothed Bounded-Distance Matrices for the Representation and Searching of Conformationally-Flexible Molecules. J. Mol. Graphics 1992, 10, 194-204. (63) Clark, D. E.; Jones, G.; Willett, P.; Kenny, P. W.; Glen, R. C. Pharmacophoric Pattern Matching in Files of Three-Dimensional Chemical Structures: Comparison of Conformational-Searching Algorithms for Flexible Searching. J. Chem. Inf. Comput. Sci. 1994, 34, 197-206. (64) Hurst, T. Flexible 3D Searching: The Directed Tweak Technique. J. Chem. Inf. Comput. Sci. 1994, 34, 190-196. (65) Martin, Y. C. 3D Database Searching in Drug Design. J. Med. Chem. 1992, 35, 2145-2154. (66) Crandell, C. W.; Smith, D. H. Computer-Assisted Examination of Compounds for Common Three-Dimensional Substructures. J. Chem. Inf. Comput. Sci. 1983, 23, 186-197. (67) Varkony, T. H.; Shiloach, Y.; Smith, D. H. Computer-Assisted Examination of Chemical Compounds for Structural Similarities. J. Chem. Inf. Comput. Sci. 1979, 19, 104-111. (68) Levi, G. A Note on the Derivation of Maximal Common Subgraphs of Two Directed or Undirected Graphs. Calcolo 1972, 9, 341-352. (69) Barrow, H. G.; Burstall, R. M. Subgraph Isomorphism, Matching Relational Structures and Maximal Cliques. Inf. Proc. Lett. 1976, 4, 83-84. (70) Bron, C.; Kerbosch, J. Algorithm 457. Finding All Cliques of an Undirected Graph. Commun. ACM 1973, 16, 575-577. (71) Martin, Y. C.; Bures, M. G.; Danaher, E. A.; Delazzer, J.; Lico, I.; Pavlik, P. A. DISCO: A Fast New Approach to Pharmacophore Mapping and Its Application to Dopaminergic and Benzodiazepine Agonists. J. Comput.-Aided Mol. Des. 1993, 7, 83102. (72) Grindley, H. M.; Artymiuk, P. J.; Rice, D. W.; Willett, P. Identification of Tertiary Structure Resemblance in Proteins Using a Maximal Common Subgraph Isomorphism Algorithm. J. Mol. Biol. 1993, 707-721. (73) Carraghan, R.; Pardalos, P. M. Exact Algorithm for the Maximum Clique Problem. Oper. Res. Lett. 1990, 9, 375-382. (74) Jones, G.; Willett, P.; Glen, R. C. A Genetic Algorithm for Flexible Molecular Overlay and Pharmacophore Detection. J. Comput.-Aided Mol. Des. 1995, 9, 532-549. (75) Willett, P.; Barnard, J. M.; Downs, G. M. Chemical Similarity Searching. J. Chem. Inf. Comput. Sci. 1998, 38, 983-996. (76) Sheridan, R. P.; Kearsley, S. K. Why Do We Need So Many Chemical Similarity Search Methods? Drug Discovery Today 2002, 7, 903-911. (77) Carhart, R. E.; Smith, D. H.; Venkataraghavan, R. Atom Pairs as Molecular Features in Structure-Activity Studies: Definition and Applications. J. Chem. Inf. Comput. Sci. 1985, 25, 64-73. (78) Willett, P.; Winterman, V.; Bawden, D. Implementation of Nearest Neighbour Searching in an Online Chemical Structure Search System. J. Chem. Inf. Comput. Sci. 1986, 26, 36-41. (79) Adamson, G. W.; Bush, J. A. A Method for the Automatic Classification of Chemical Structures. Inf. Storage Retr. 1973, 9, 561-568. (80) Johnson, M. A., Maggiora, G. M., Eds. Concepts and Applications of Molecular Similarity; John Wiley: New York, 1990. (81) Patterson, D. E.; Cramer, R. D.; Ferguson, A. M.; Clark, R. D.; Weinberger, L. E. Neighbourhood Behaviour: A Useful Concept for Validation of “Molecular Diversity” Descriptors. J. Med. Chem. 1996, 39, 3049-3059. (82) Frye, S. V. Structure-Activity Relationship Homology (Sarah): A Conceptual Framework for Drug Discovery in the Genomic Era. Chem. Biol. 1999, 6, R3-R7. (83) Schuffenhauer, A.; Jacoby, E. Annotating and Mining the Ligand-Target Chemogenomics Knowledge Space. Biosilico 2004, 2, 190-200. (84) Bohm, H.-J., Schneider, G., Eds. Virtual Screening for Bioactive Molecules; Wiley-VCH: Weinheim, Germany, 2000. (85) Dean, P. M., Ed. Molecular Similarity in Drug Design; Chapman and Hall: Glasgow, U.K., 1994. (86) Brown, R. D.; Martin, Y. C. Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection. J. Chem. Inf. Comput. Sci. 1996, 36, 572-584. (87) Brown, R. D.; Martin, Y. C. The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding. J. Chem. Inf. Comput. Sci. 1997, 37, 1-9. (88) Martin, Y. C.; Kofron, J. L.; Traphagen, L. M. Do Structurally Similar Molecules Have Similar Biological Activities? J. Med. Chem. 2002, 45, 4350-4358.

4198

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

Award Address

(89) Chen, X.; Reynolds, C. H. Performance of Similarity Measures in 2D Fragment-Based Similarity Searching: Comparison of Structural Descriptors and Similarity Coefficients. J. Chem. Inf. Comput. Sci. 2002, 42, 1407-1414. (90) Sheridan, R. P.; Feuston, B. P.; Maiorov, V. N.; Kearsley, S. K. Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR. J. Chem. Inf. Comput. Sci. 2004, 44, 1912-1928. (91) Sneath, P. H. A. Relations between Chemical Structure and Biological Activity in Peptides. J. Theor. Biol. 1966, 12, 157195. (92) Harrison, P. J. A Method of Cluster Analysis and Some Applications. Appl. Stat. 1968, 17, 226-236. (93) Ward, J. H. Hierarchical Grouping To Optimize an Objective Function. J. Am. Stat. Assoc. 1963, 58, 236-244. (94) Jarvis, R. A.; Patrick, E. A. Clustering Using a Similarity Measure Based on Shared Nearest Neighbours. IEEE Trans. Comput. 1973, C22, 1025-1034. (95) Shemetulskis, N. E.; Dunbar, J. B.; Dunbar, B. W.; Moreland, D. W.; Humblet, C. Enhancing the Diversity of a Corporate Database Using Chemical Database Clustering and Analysis. J. Comput.-Aided Mol. Des. 1995, 9, 407-416. (96) Downs, G. M.; Barnard, J. M. Clustering Methods and Their Uses in Computational Chemistry. Rev. Comput. Chem. 2002, 18, 1-40. (97) Willett, P.; Winterman, V.; Bawden, D. Implementation of NonHierarchic Cluster Analysis Methods in Chemical Information Systems: Selection of Compounds for Biological Testing and Clustering of Substructure Search Output. J. Chem. Inf. Comput. Sci. 1986, 26, 109-118. (98) Doman, T. N.; Cibulskis, J. M.; Cibulskis, M. J.; McCray, P. D.; Spangler, D. P. Algorithm5: A Technique for Fuzzy Similarity Clustering of Chemical Inventories. J. Chem. Inf. Comput. Sci. 1996, 36, 1195-1204. (99) Menard, P. R.; Lewis, R. A.; Mason, J. S. Rational Screening Set Design and Compound Selection: Cascaded Clustering. J. Chem. Inf. Comput. Sci. 1998, 38, 497-505. (100) Downs, G. M.; Willett, P.; Fisanick, W. Similarity Searching and Clustering of Chemical-Structure Databases Using Molecular Property Data. J. Chem. Inf. Comput. Sci. 1994, 34, 1094-1102. (101) Murtagh, F. Multidimensional Clustering Algorithms.; Physica Verlag: Vienna, 1985. (102) Jardine, N.; van Rijsbergen, C. J. The Use of Hierarchic Clustering in Information Retrieval. Inf. Storage Retr. 1971, 7, 217-240. (103) Willett, P. Recent Trends in Hierarchic Document Clustering: A Critical Review. Inf. Proc. Manage. 1988, 24, 577-597. (104) Valler, M. J.; Green, D. Diversity Screening Versus Focussed Screening in Drug Discovery. Drug Discovery Today 2000, 5, 286-293. (105) Dean, P. M., Lewis, R. A., Eds. Molecular Diversity in Drug Design; Kluwer: Amsterdam, 1999. (106) Ghose, A. K., Viswanadhan, V. N., Eds. Combinatorial Library Design and Evaluation: Principles, Software Tools and Applications in Drug Discovery; Marcel Dekker: New York, 2001. (107) Bawden, D. Molecular Dissimilarity in Chemical Information Systems. Chemical Structures 2; Springer-Verlag: Heidelberg, Germany, 1993; pp 383-388. (108) Lajiness, M. S. Molecular Similarity-Based Methods for Selecting Compounds for Screening. Computational Chemical Graph Theory; Nova Science Publishers: New York, 1990; pp 299-316. (109) Kennard, R. W.; Stone, L. A. Computer Aided Design of Experiments. Technometrics 1969, 11, 137-148. (110) Snarey, M.; Terret, N. K.; Willett, P.; Wilton, D. J. Comparison of Algorithms for Dissimilarity-Based Compound Selection. J. Mol. Graphics Modell. 1998, 15, 372-385. (111) Voorhees, E. M. Implementing Agglomerative Hierarchic Clustering Algorithms for Use in Document Retrieval. Inf. Proc. Manage. 1986, 22, 465-476. (112) Holliday, J. D.; Ranade, S. S.; Willett, P. A Fast Algorithm for Selecting Sets of Dissimilar Structures from Large Chemical Databases. Quant. Struct.-Act. Relat. 1995, 14, 501-506. (113) Agrafiotis, D.; Lobanov, V. S. An Efficient Implementation of Distance-Based Diversity Measures Based on K-D Trees. J. Chem. Inf. Comput. Sci. 1999, 39, 51-58. (114) Mount, J.; Ruppert, J.; Welch, W.; Jain, A. N. Icepick: A Flexible Surface-Based System for Molecular Diversity. J. Med. Chem. 1999, 42, 60-66. (115) Waldman, M.; Li, H.; Hassan, M. Novel Algorithms for the Optimization of Molecular Diversity of Combinatorial Libraries. J. Mol. Graphics Modell. 2000, 18, 412-426. (116) Pickett, S. D.; Luttman, C.; Guerin, V.; Laoui, A.; James, E. DIVSEL and COMPLIBsStrategies for the Design and Comparison of Combinatorial Libraries Using Pharmacophore Descriptors. J. Chem. Inf. Comput. Sci. 1998, 38, 144-150. (117) Sheridan, R. P. The Centroid Approximation for Mixtures: Calculating Similarity and Deriving Structure-Activity Relationships. J. Chem. Inf. Comput. Sci. 2000, 40, 1456-1469.

(118) Trepalin, S. V.; Gerasimenko, V. A.; Kozyukov, A. V.; Savchuk, N. P.; Ivaschenko, A. A. New Diversity Calculations Used for Compound Selection. J. Chem. Inf. Comput. Sci. 2002, 42, 249258. (119) Gillet, V. J.; Willett, P.; Bradshaw, J. The Effectiveness of Reactant Pools for Generating Structurally-Diverse Combinatorial Libraries. J. Chem. Inf. Comput. Sci. 1997, 37, 731-740. (120) Jamois, E. A.; Hassan, M.; Waldman, M. Evaluation of ReagentBased and Product-Based Strategies in the Design of Combinatorial Libraries. J. Chem. Inf. Comput. Sci. 2000, 40, 63-70. (121) Gillet, V. J.; Willett, P.; Bradshaw, J.; Green, D. V. S. Selecting Combinatorial Libraries To Optimise Diversity and Physical Properties. J. Chem. Inf. Comput. Sci. 1999, 39, 169-177. (122) Gillet, V. J.; Khatib, W.; Willett, P.; Fleming, P. J.; Green, D. V. S. Combinatorial Library Design Using a Multiobjective Genetic Algorithm. J. Chem. Inf. Comput. Sci. 2002, 42, 375-385. (123) Berman, H. M.; Battistuz, T.; Bhat, T. N.; Blum, W. F.; Bourne, P. E.; et al. The Protein Data Bank. Acta Crystallogr. 2002, D58, 899-907. (124) Spriggs, R. V.; Artymiuk, P. J.; Willett, P. Searching for Patterns of Amino Acids in 3D Protein Structures. J. Chem. Inf. Comput. Sci. 2003, 43, 412-421. (125) Artymiuk, P. J.; Rice, D. W.; Mitchell, E. M.; Willett, P. Structural Resemblance between the Families of Bacterial Signal-Transduction Proteins and of G Proteins Revealed by Graph Theoretical Techniques. Protein Eng. 1990, 4, 39-43. (126) Artymiuk, P. J.; Grindley, H. M.; Park, J. E.; Rice, D. W.; Willett, P. Three-Dimensional Structural Resemblance between Leucine Aminopeptidase and Carboxypeptidase as Revealed by GraphTheoretical Techniques. FEBS Lett. 1992, 303, 48-52. (127) Artymiuk, P. J.; Grindley, H. M.; Poirrette, A. R.; Rice, D. W.; Ujah, E. C.; Willett, P. Identification of β-Sheet Motifs, of φ-Loops and of Patterns of Amino-Acid Residues in ThreeDimensional Protein Structures Using a Subgraph-Isomorphism Algorithm. J. Chem. Inf. Comput. Sci. 1994, 34, 54-62. (128) Artymiuk, P. J.; Poirrette, A. R.; Rice, D. W.; Willett, P. A Polymerase 1 Palm in Adenylyl Cyclase? Nature 1997, 388, 3334. (129) Kleywegt, G. J. Recognition of Spatial Motifs in Protein Structures. J. Mol. Biol. 1999, 285, 1887-1897. (130) Koch, I.; Kaden, F.; Selbig, J. Analysis of Protein Sheet Topologies by Graph Theoretical Methods. Proteins: Struct., Funct., Genet. 1992, 12, 314-323. (131) Schmitt, S.; Kuhn, D.; Klebe, G. A New Method To Detect Related Function among Proteins Independent of Sequence and Fold Homology. J. Mol. Biol. 2002, 32, 387-406. (132) Bruno, I. J.; Kemp, N. M.; Artymiuk, P. J.; Willett, P. Representation and Searching of Carbohydrate Structures Using Graph-Theoretic Techniques. Carbohydr. Res. 1997, 304, 6167. (133) Harrison, A.-M.; South, D. R.; Willett, P.; Artymiuk, P. J. Representation, Searching and Discovery of Patterns of Bases in Complex RNA Structures. J. Comput.-Aided Mol. Des. 2003, 17, 537-549. (134) Artymiuk, P. J.; Spriggs, R. V.; Willett, P. Graph Theoretic Methods for the Analysis of Structural Relationships in Biological Macromolecules. J. Am. Soc. Inf. Sci. Technol. 2005, 56, 518528. (135) Poirrette, A. R.; Artymiuk, P. J.; Rice, D. W.; Willett, P. Comparison of Protein Surfaces Using a Genetic Algorithm. J. Comput.-Aided Mol. Des. 1997, 11, 557-569. (136) Gardiner, E. J.; Willett, P.; Artymiuk, P. J. Native Protein Docking Using a Genetic Algorithm. Proteins: Struct., Funct., Genet. 2001, 44, 44-56. (137) Gardiner, E. J.; Willett, P.; Artymiuk, P. J. GAPDOCK: A Genetic Algorithm Approach to Protein Docking in Capri Round 1. Proteins: Struct., Funct., Genet. 2003, 52, 10-14. (138) Packer, M. J.; Hunter, C. A. Sequence-Structure Relationships in DNA Oligomers: A Computational Approach. J. Am. Chem. Soc. 2001, 123, 7399-7406. (139) Gardiner, E. J.; Hunter, C. A.; Packer, M. J.; Willett, P. Sequence-Dependent DNA Structure: A Database of Octamer Structural Parameters. J. Mol. Biol. 2003, 332, 1025-1035. (140) Gardiner, E. J.; Hunter, C. A.; Lu, X.-J.; Willett, P. A Structural Similarity Analysis of Double-Helical DNA. J. Mol. Biol. 2004, 343, 879-889. (141) Pepperrell, C. A.; Willett, P. Techniques for the Calculation of Three-Dimensional Structural Similarity Using Inter-Atomic Distances. J. Comput.-Aided Mol. Des. 1991, 5, 455-474. (142) Bath, P. A.; Poirrette, A. R.; Willett, P.; Allen, F. H. Similarity Searching in Files of Three-Dimensional Chemical Structures: Comparison of Fragment-Based Measures of Shape Similarity. J. Chem. Inf. Comput. Sci. 1994, 34, 141-147. (143) Ginn, C. M. R.; Turner, D. B.; Willett, P.; Ferguson, A. M.; Heritage, T. W. Similarity Searching in Files of Three-Dimensional Chemical Structures: Evaluation of the EVA Descriptor and Combination of Rankings Using Data Fusion. J. Chem. Inf. Comput. Sci. 1997, 37.

Award Address (144) Gillet, V. J.; Willett, P.; Bradshaw, J. Similarity Searching Using Reduced Graphs. J. Chem. Inf. Comput. Sci. 2003, 43, 338-345. (145) Raymond, J. W.; Willett, P. Maximum Common Subgraph Isomorphism Algorithms for the Matching of Chemical Structures. J. Comput.-Aided Mol. Des. 2002, 16, 521-533. (146) Raymond, J. W.; Gardiner, E. J.; Willett, P. RASCAL: Calculation of Graph Similarity Using Maximum Common Edge Subgraphs. Comput. J. 2002, 45, 631-644. (147) Raymond, J. W.; Gardiner, E. J.; Willett, P. Heuristics for Similarity Searching of Chemical Graphs Using a Maximum Common Edge Subgraph Algorithm. J. Chem. Inf. Comput. Sci. 2002, 42, 305-316. (148) Willett, P. Some Heuristics for Nearest Neighbour Searching in Chemical Structure Files. J. Chem. Inf. Comput. Sci. 1983, 23, 22-25. (149) Brint, A. T.; Willett, P. Upperbound Procedures for the Identification of Similar Three-Dimensional Chemical Structures. J. Comput.-Aided Mol. Des. 1988, 2, 311-320. (150) Pepperrell, C. A.; Taylor, R.; Willett, P. Implementation and Use of an Atom-Mapping Procedure for Similarity Searching in Databases of 3-D Chemical Structures. Tetrahedron Comput. Methodol. 1990, 3, 575-593. (151) Johnson, M. Relating Metrics, Lines and Variables Defined on Graphs to Problems in Medicinal Chemistry. Graph Theory and Its Applications to Algorithms and Computer Science; Wiley: New York, 1985; pp 457-470. (152) Carpaneto, G.; Martello, S.; Toth, P. Algorithms and Codes for the Assignment Problem. Ann. Oper. Res. 1988, 13, 193-223. (153) Raymond, J. W.; Willett, P. Effectiveness of Graph-Based and Fingerprint-Based Similarity Measures for Virtual Screening of 2D Chemical Structure Databases. J. Comput.-Aided Mol. Des. 2002, 16, 59-71. (154) Briem, H.; Kuntz, I. D. Molecular Similarity Based on DockGenerated Fingerprints. J. Med. Chem. 1996, 39, 3401-3408. (155) Matter, H. Selecting Optimally Diverse Compounds from Structure Databases: A Validation Study of Two-Dimensional and Three-Dimensional Molecular Descriptors. J. Med. Chem. 1997, 40, 1219-1229. (156) Kubinyi, H., Folkers, G., Martin, Y. C., Eds. 3D QSAR in Drug Design; Kluwer/ESCOM: Leiden, The Netherlands, 1998. (157) Carbo, R.; Leyda, L.; Arnau, M. An Electron Density Measure of the Similarity between Two Compounds. Int. J. Quantum Chem. 1980, 17, 1185-1189. (158) Manaut, M.; Sanz, F.; Jose, J.; Milesi, M. Automatic Search for Maximum Similarity between Molecular Electrostatic Potential Distributions. J. Comput.-Aided Mol. Des. 1991, 5, 371-380. (159) Burt, C.; Richards, W. G.; Huxley, P. The Application of Molecular Similarity Calculations. J. Comput. Chem. 1990, 11, 1139-1146. (160) Richard, A. M. Quantification of Molecular Electrostatic Potentials for Structure-Activity Studies. J. Comput. Chem. 1991, 12, 959-969. (161) Good, A. C.; Peterson, S. J.; Richards, W. G. QSAR’s from Similarity Matrices. Technique Validation and Application in the Comparison of Different Similarity Evaluation Methods. J. Med. Chem. 1993, 36, 2929-2937. (162) Mestres, J.; Rohrer, D. C.; Maggiora, G. M. A Molecular-FieldBased Similarity Study of Non-Nucleoside HIV-1 Reverse Transcriptase Inhibitors. 2. The Relationship between Alignment Solutions Obtained from Conformationally Rigid and Flexible Matching. J. Comput.-Aided Mol. Des. 2000, 14, 39-51. (163) Good, A. C.; Hodgkin, E. E.; Richards, W. G. The Utilization of Gaussian Functions for the Rapid Evaluation of Molecular Similarity. J. Chem. Inf. Comput. Sci. 1992, 32, 188-191. (164) Thorner, D. A.; Willett, P.; Wright, P. M.; Taylor, R. Similarity Searching in Files of Three-Dimensional Chemical Structures: Representation and Searching of Molecular Electrostatic Potentials Using Field-Graphs. J. Comput.-Aided Mol. Des. 1997, 11, 163-174. (165) Wild, D. J.; Willett, P. Similarity Searching in Files of ThreeDimensional Chemical Structures: Alignment of Molecular Electrostatic Potentials with a Genetic Algorithm. J. Chem. Inf. Comput. Sci. 1996, 36, 159-167. (166) Schuffenhauer, A.; Gillet, V. J.; Willett, P. Similarity Searching in Files of Three-Dimensional Chemical Structures: Analysis of the BIOSTER Database Using 2D Fingerprints and Molecular Field Descriptors. J. Chem. Inf. Comput. Sci. 2000, 40, 295307. (167) Drayton, S. K.; Edwards, K.; Jewell, N. E.; Turner, D. B.; Wild, D. J.; Willett, P.; Wright, P. M.; Simmons, K. Similarity Searching in Files of Three-Dimensional Chemical Structures: Identification of Bioactive Molecules. Internet J. Chem. 1998, http://www.ijc.com/articles/1998v1/37/. (168) Bohl, M.; Dunbar, J. B.; Gifford, E. M.; Heritage, T.; Wild, D. J.; Willett, P.; Wilton, D. J. Scaffold Searching: Automated Identification of Similar Ring Systems for the Design of Combinatorial Libraries. Quant. Struct.-Act. Relat. 2002, 21, 590597.

Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13

4199

(169) Schneider, G.; Neidhart, W.; Giller, T.; Schmid, G. ScaffoldHopping by Topological Pharmacophore Search. Angew. Chem., Int. Ed. 1999, 38, 2894-2896. (170) Lewell, X. Q.; Jones, A. C.; Bruce, C. L.; Harper, G.; Jones, M. M.; McLay, I. M.; Bradshaw, J. Drug Rings Database with Web Interface. A Tool for Identifying Alternative Rings in Lead Discovery Programmes. J. Med. Chem. 2003, 46, 3257-3274. (171) Good, A. C.; Richards, W. G. Rapid Evaluation of Shape Similarity Using Gaussian Functions. J. Chem. Inf. Comput. Sci. 1993, 33, 112-116. (172) Klebe, G., Ed. Virtual Screening: An Alternative or Complement to High Throughput Screening; Kluwer: Dordrecht, The Netherlands, 2000. (173) Holliday, J. D.; Salim, N.; Whittle, M.; Willett, P. Analysis and Display of the Size Dependence of Chemical Similarity Coefficients. J. Chem. Inf. Comput. Sci. 2003, 43, 819-828. (174) Ginn, C. M. R.; Willett, P.; Bradshaw, J. Combination of Molecular Similarity Measures Using Data Fusion. Perspect. Drug Discovery Des. 2000, 20, 1-16. (175) Salim, N.; Holliday, J. D.; Willett, P. Combination of FingerprintBased Similarity Coefficients Using Data Fusion. J. Chem. Inf. Comput. Sci. 2003, 43, 435-442. (176) Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A. Comparison of Fingerprint-Based Methods for Virtual Screening Using Multiple Bioactive Reference Structures. J. Chem. Inf. Comput. Sci. 2004, 44, 1177-1185. (177) Hert, J.; Willett, P.; Wilton, D. J.; Acklin, P.; Azzaoui, K.; Jacoby, E.; Schuffenhauer, A. Comparison of Topological Descriptors for Similarity-Based Virtual Screening Using Multiple Bioactive Reference Structures. Org. Biomol. Chem. 2004, 2, 3256-3266. (178) Whittle, M.; Gillet, V. J.; Willett, P.; Alex, A.; Losel, J. Enhancing the Effectiveness of Virtual Screening by Fusing NearestNeighbour Lists: A Comparison of Similarity Coefficients. J. Chem. Inf. Comput. Sci. 2004, 44, 1840-1848. (179) Hall, D. L. Mathematical Techniques in Multisensor Data Fusion; Artech House: Northwood, MA, 1992. (180) Wang, R.; Wang, S. How Does Consensus Scoring Work for Virtual Library Screening? An Idealized Computer Experiment. J. Chem. Inf. Comput. Sci. 2001, 41, 1422-1426. (181) Charifsen, P. S.; Corkery, J. J.; Murcko, M. A.; Walters, W. P. Consensus Scoring: A Method for Obtaining Improved Hit Rates from Docking Databases of Three-Dimensional Structures into Proteins. J. Med. Chem. 1999, 42, 5100-5109. (182) Clark, R. D.; Strizhev, A.; Leonard, J. M.; Blake, J. F.; Matthew, J. B. Consensus Scoring for Ligand/Protein Interactions. J. Mol. Graphics Modell. 2002, 20, 281-295. (183) Belkin, N. J.; Kantor, P.; Fox, E. A.; Shaw, J. B. Combining the Evidence of Multiple Query Representations for Information Retrieval. Inf. Proc. Manage. 1995, 31, 431-448. (184) Kearsley, S. K.; Sallamack, S.; Fluder, E. M.; Andose, J. D.; Mosley, R. T.; Sheridan, R. P. Chemical Similarity Using Physicochemical Property Descriptors. J. Chem. Inf. Comput. Sci. 1996, 36, 118-127. (185) Willett, P.; Winterman, V. A Comparison of Some Measures of Inter-Molecular Structural Similarity. Quant. Struct.-Act. Relat. 1986, 5, 18-25. (186) Holliday, J. D.; Hu, C.-Y.; Willett, P. Grouping of Coefficients for the Calculation of Inter-Molecular Similarity and Dissimilarity Using 2D Fragment Bit-Strings. Comb. Chem. High Throughput Screening 2002, 5, 155-166. (187) Cramer, R. D.; Redl, G.; Berkoff, C. E. Substructural Analysis. A Novel Approach to the Problem of Drug Design. J. Med. Chem. 1974, 17, 533-538. (188) Guner, O., Ed. Pharmacophore Perception, Development and Use in Drug Design; International University Line: La Jolla, CA, 2000. (189) Dixon, S. L.; Koehler, R. T. The Hidden Component of Size in Two-Dimensional Fragment Descriptors: Side Effects on Sampling in Bioactive Libraries. J. Med. Chem. 1999, 42, 2887-2900. (190) Godden, J. W.; Xue, L.; Bajorath, J. Combinatorial Preferences Affect Molecular Similarity/Diversity Calculations Using Binary Fingerprints and Tanimoto Coefficients. J. Chem. Inf. Comput. Sci. 2000, 40, 163-166. (191) Fligner, M. A.; Verducci, J. S.; Blower, P. E. A Modification of the Jaccard-Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings. Technometrics 2002, 44, 110-119. (192) Xue, L.; Stahura, F. L.; Godden, J. W.; Bajorath, J. Fingerprint Scaling Increases the Probability of Identifying Molecules with Similar Activity in Virtual Screening Calculations. J. Chem. Inf. Comput. Sci. 2001, 41, 746-753. (193) Schuffenhauer, A.; Floersheim, P.; Acklin, P.; Jacoby, E. Similarity Metrics for Ligands Reflecting the Similarity of the Target Proteins. J. Chem. Inf. Comput. Sci. 2003, 43, 391-405. (194) Wilton, D.; Willett, P.; Mullier, G.; Lawson, K. Comparison of Ranking Methods for Virtual Screening in Lead-Discovery Programmes. J. Chem. Inf. Comput. Sci. 2003, 43, 469-474.

JM0582165