4358
J. Med. Chem. 2005, 48, 4358-4366
A Robust Clustering Method for Chemical Structures Martin Stahl,*,† Harald Mauser,† Mark Tsui,‡ and Neil R. Taylor*,‡ Pharmaceutical Research, F. Hoffmann-La Roche Ltd., 4070 Basel, Switzerland, and Desert Scientific Software Pty Ltd., 6 Stirling Court, Castle Hill, Sydney, NSW, 2154, Australia Received December 13, 2004
A clustering method based on finding the largest set of disconnected fragments that two chemical compounds have in common is shown to be able to group structures in a way that is ideally suited to medicinal chemistry programs. We describe how markedly improved results can be obtained by using a similarity metric that accounts not just for the size of the shared fragments but also on their relative arrangement in the two parent compounds. The use of a physiochemical atom typing scheme is also shown to provide significant contributions. Results from calculations using a test set consisting of actives from nine different important biological target proteins demonstrate the strengths of our clustering method and the advantages over other approaches that are widely used throughout the pharmaceutical industry. Introduction
Chart 1
Medicinal chemists spend significant amounts of time dealing with large collections of chemical structures, such as hit lists from HTS campaigns, vendor catalogs, or project databases. Creating an overview of such collections typically involves the task of grouping individual compounds into structurally related families or clusters. This process is time-consuming and so automated clustering methods have become popular. Automated clustering requires two types of algorithms, a similarity metric for determining the pairwise comparison of molecular structures, and a clustering algorithm for grouping structures according to the calculated pairwise similarity values. The purpose of this work is to identify a method for detecting similarities between structures that have a similar spatial arrangement of similar functional groups or fragments without necessarily sharing a large common substructure. Finding such relationships is of special importance in processing HTS hit lists, with the ultimate goal to understand relationships between hits across chemical classes and to extract SAR information. Consider the two pairs of molecules illustrated in Chart 1. What we are seeking is a method able to detect the close relationship between the two dopamine D4 antagonists 1a and 1b as well as between the two histamine H3 receptor ligands 2a and 2b. At the same time, the method should classify the two pairs of molecules as being different from one another. Fingerprint-based methods, such as Daylight fingerprints1 and MACCS or MDL keys2 have been used for clustering with considerable success.3,4 These methods are particularly useful for very large data sets because of their speed. However, Daylight fingerprints are highly susceptible to the exchange of elements around the center of a molecule, and therefore the similarity values (Tanimoto index) for 1a vs 1b are just as low as for 1a vs 2a. MACCS keys, on the other hand, are sensitive to * To whom correspondence should be addressed. martin.stahl @roche.com;
[email protected]. † F. Hoffmann-La Roche Ltd. ‡ Desert Scientific Software Pty Ltd.
E-mail:
the presence or absence of specific functional groups, but insensitive to their relative position and orientation. With this method, the pairing of 1b and 2a yields a higher similarity than any of the other five pairings because these two molecules share more of the keyed functional groups. Another popular class of similarity metric is based on Maximum Common Substructure (MCS) calculations. A number of widely used molecular modeling and cheminformatics tools contain modules that can cluster structures based on MCS calculations.5,6 These and other substructure-based methods7,8 can be very effective tools in cases were classes of compounds can be defined by “scaffolds” or ring systems,9,10 but they fail when it comes to detecting the similarity between pairs of compounds in which the same building blocks are connected by different linkers. The problem presented in Chart 1 cannot be addressed by MCS calculations. Chart 2 illustrates this more definitively for another pair of histamine H3 ligands. For this pair, the MCS consists of only the imidazopyridine ring system, highlighted in green. Here we report on the development of a clustering method based on Maximum Overlapping Set (MOS) calculations. Whereas the MCS of two molecules is a single contiguous substructure, the MOS can consist of several substructures. For example, the MOS of 3a and 3b (Chart 2) consists of the substructures highlighted
10.1021/jm040213p CCC: $30.25 © 2005 American Chemical Society Published on Web 06/02/2005
Robust Clustering Method
Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13 4359
Chart 2
Figure 2 lists the complete sequence of steps we use when clustering compound sets. Implementation Details. The workflow in Figure 1 has been coded in a suite of tools called Spinifex.14 Spinifex was mainly implemented using the Python programming language. The maximum clique algorithms were first built and tested in Python, then implemented in C or C++ to obtain maximum performance. Cheminformatics problems were solved with the help of Openeye’s OEchem toolkit,15 and alternatively with its predecessor, OELib, and its python wrapper, created by Desert Scientific Software. A number of maximum clique algorithms have been published in the literature. We have implemented and tested: Rambin16 which is an implementation of the Bron and Kerbosch algorithm 457;17 Dfmax and Nmclique,18 which are improved versions of Dfmax; Pardalos;19 Wood;20 and Rascal.12 The fastest algorithm was found to be Rascal followed by Wood. Precise details of these algorithms are beyond the scope of this paper, and we refer readers to the respective publications. Heuristics have been implemented to prune search space by removing redundant matching units, as described by Raymond et al., and to help solve the problem of triangle/trinode mismatches. A redundant matching unit for two compounds is a pair of edge units (a bond and two atoms) that match only in symmetry-related solutions of the maximum clique problem or in suboptimal solutions. The run times for the algorithms can be dramatically affected by the large number of redundant matching units associated with ring structures, and so Spinifex carefully removes these redundant matching units for all the rings in structures being compared. Ring detection involving molecules with nonsimple rings, e.g. fused rings, required the implementation of an SSSR algorithm. The algorithm developed by Figueras21 was selected. A weaker form of heuristic, one which reduces the number of connectivities in the association graph rather than the number of matching units, has also been implemented (see ref 13). The C-C induced subgraph heuristic described by Raymond et al. is used, as is the weak ring heuristic they described. Turning off all heuristics increases the search space and can dramatically increase search time. A run level cutoff was implemented into each of our maximum clique algorithms in order to impose an upper limit for the number of algorithm cycles for calculations involving particularly tough pairs of molecules. In a typical clustering job, most pairwise similarity calculations finish within a few milliseconds; however, some solutions can take a minute to find, and in the very worst cases, can require hours of CPU time. If the maximum number of allowed algorithm cycles is reached in a calculation, the algorithm terminates and returns the best solution found up to that point. Our testing demonstrated that the calculated similarity obtained when the cutoff comes into effect is either identical with or slightly lower than the optimal solution, the difference corresponding to one or two matching edge units. The run level cutoff can be modified or turned off, as required. We explored (1) the effects of using heuristics to prune search space, (2) the effects of using a run level cutoff to prevent long running times of the maximum clique
in blue and green. Calculation of the MOS alone, however, does not completely solve the clustering problem. Another major challenge is to derive a similarity metric from the MOS subgraphs that reflects as much of the structural information as possible. Earlier applications of MOS calculations to the clustering problem may have been less successful because this aspect was not considered thoroughly enough.11 We report calculations on a set of several hundred test compounds and show how empirical parameters can be used to turn MOS calculations into a chemically intuitive and robust clustering method. The focus of this paper is on the practical aspects of using the MOS method and consequent benefits to medicinal chemistry research. Methods Overview. This work describes the development of a clustering method that is founded on using a graphbased approach to measure chemical similarity. This approach involves finding the solution to the Maximum Common Subgraph Isomorphism (MCSI) problem for pairs of compounds. In a recent paper, Raymond et al. described how the reduction of the MCSI problem to the maximum clique problem can lead to efficient algorithms, and so their methods were adopted.12 These methods involve working with “edge” graphs in which chemical bonds are the vertices and atoms are the edges and the challenge is to solve the maximum isomorphism (that is, the largest one-to-one correspondence) between two edge graphs. This is done by merging two edge graphs into one graph, called the association or correspondence graph, and finding the largest complete subgraph, that is, a set of vertices with edges between all pairs. The largest complete subgraph is also known as the maximum clique and in this work we refer to our algorithms as maximum clique algorithms. A solution obtained using this approach is a set of disconnected fragments (or substructures) called a Maximum Overlapping Set (MOS). Figure 1 illustrates the process of converting two chemical graphs into edge graphs, creating an association graph, and characterizing the maximum clique. There is an important caveatsthe method requires correct handling of triangle/trinode mismatches, in which the cyclopropane substructure matches the isobutane substructure in the edge graphs derived from chemical graphs. Raymond et al. also described how to speed up calculations by including heuristics to prune search space.13 We adopted a similar scheme.
4360
Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13
Stahl et al.
Figure 1. Graphical representation of the main aspects of the maximum clique detection method. Starting with two distinct compounds, such as those shown above, each atom-bond-atom unit is labeled (‘1’ to ‘7’ in the first molecule and ‘a’ to ‘e’ in the second) and each matching pair of units (e.g., ‘1’ and ‘a’) becomes a vertex in the association graph. Edges are then created for the association graph if two matching units share a common atom, for example ‘1a’ and ‘2b’, or if no atoms are shared, for example, ‘1a’ and ‘3c’. Last, the largest set of vertices in the association graph with edges between all pairs is defined as the maximum clique. Note that in this example the maximum clique has two solutions (both consisting of the same set of atom and bonds) with one highlighted in bold and the other highlighted with broken lines.
Figure 2. The sequence of steps used to determine chemical similarity and cluster compound sets using Spinifex.
algorithms, and (3) potential differences between the Wood and Rascal algorithms. As expected, the impact of turning off heuristics was found to be negligible on the quality of results. Also not surprisingly, it had a much greater impact on the timings for the Wood algorithm than for the Rascal algorithm. The effect of increasing the run level on both the timings of the calculations and the quality of results was found to be more detrimental for the Wood algorithm than for the Rascal algorithm. This exemplifies the superiority of the Rascal algorithm. What seems counterintuitive at first glance is that although clustering using the Rascal algorithm is significantly faster than clustering using the Wood algorithm, for the majority of pairwise similarity calculations the Wood algorithm is actually the faster of the two. This is because a single iteration of the Wood algorithm executes faster than a single iteration of the Rascal algorithm. Thus, even though the Wood algorithm has to cycle through more iterations to achieve completion, it can still finish faster than the Rascal algorithm. The nature of the complexity of the computational approach means that the rate-determining steps in a clustering experiment arise from just a small percentage of all pairwise comparisons, typically
those involving pairs of larger compounds containing polycyclic substructures. Although MOS calculations are computationally more demanding than bit string-based approaches, the low cost of CPU cycles provided by clusters of desktop PC machines running Linux makes clustering a more amenable job. Linux farms are fast becoming a hardware standard within medicinal research organizations, and Spinifex has been extended to run on these machines. A set of chemical similarity calculations belongs to the class of problems termed “embarrassingly” parallel, and increases in performance are almost linear with the number of available CPUs. Computations are done using the master/slave paradigm in which a large job is divided into multiple, smaller jobs. We use PVM.22 Generating a similarity matrix for a set of 3000 drugsize compounds takes approximately 4.5 h on 40 CPUs (2.4 GHz Pentium Xeon chips running Redhat 7.3). Atom and Bond Typing Scheme. As noted by other authors,23 graph matching is highly dependent upon graph labeling. The potential for Spinifex to reliably group compounds in the way a medicinal chemist would manually group compounds was significantly enhanced by fine-tuning the atomic labeling. Initially, all atoms were labeled by element type, that is, by atomic number. However, our early calculations strongly suggested that results could be improved by differentiating nitrogen atoms into two groups: those that can be hydrogen bond donors and those that cannot. The same is done for oxygen atoms. A scheme was devised that involved determining Sybyl atom types24 for each atom and assigning atom labels according to the number of protons bound to nitrogen and oxygen atoms. Furthermore, the halide elements F, Cl, Br, and I, were each labeled so that they become equivalent. For all other atom labels, the atomic number was used. Edge weights, that is, the bonds, were labeled according to bond order. Bond weights were set to 1, 2, 3, and 4, corresponding to single, double, triple, and aromatic, respectively. The matching of two chemical graphs requires both vertex (atom) matching and edge (bond) matching. To facilitate the matching of edge units, atom-based hashing functions were used, similar to those described by Ihlenfeldt
Robust Clustering Method
Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13 4361
et al..25 Alternative atom and bond weighting schemes can also be applied. For instance, two chemical graphs can be compared with reduced features with all atoms assigned identical labels, all bonds assigned identical labels, or both. Additionally, any set of user-defined atom types can be experimented with, for example, including features such as cylicity and/or charge. Similarity Metric and Correction Terms. Once the MOS of two chemical structures is determined, it is used to derive a similarity value reflecting the degree of identity of the two structures. The Raymond et al. metric is calculated from the numbers of vertices (NumVert) and edges (NumEdge) in the MOS graph and the graphs of the two structures (G, H) that are compared:
Chart 3. Effects of the Empirical Correction Terms on Similarity Values for Three Structure Pairsa
S)
(NumVert (MOS) + NumEdge (MOS))2 (NumVert(G) + NumEdge(G)) × (NumVert(H) + NumEdge(H))
Initial experiments with this metric indicated that much of the graph information obtained through the MOS calculation is lost. We also did not want to penalize the occurrence of MOS fragments in general, as has been done before.26 To extract a maximum of topological information from the MOS, we start with the Raymond et al. metric as a basis and then define two situations in which empirical penalties can apply and modify this score. 1. Arrangement correction terms: These act to reduce the similarity index if the arrangement of the three largest fragments in two molecules is not the same. Consider a pair of molecules with a MOS consisting of three fragments A, B, C. In one molecule, the fragments are arranged such that B is between A and C, i.e., A-BC, and in the other molecule, C is located in the middle, i.e, A-C-B. From a chemistry perspective, the two molecules are more dissimilar than the MOS implies, and so a fairly high penalty term, typically 0.3, is subtracted from the similarity metric for these two molecules. An example is given in Chart 3 by compounds 4a and 4b. Alternatively, one molecule may contain a linear arrangement of fragments, A-B-C, and the other may contain a cyclic arrangement, A-B-C, with an additional path A-C not through B. The chemical similarity for two such molecules is not so different to the graph similarity, and so a smaller penalty term, typically 0.1, is applied to reduce the similarity metric for these two molecules. There are even cases where it is not desirable to apply a penalty at all, e.g. compounds 5a and 5b in Chart 3. The arrangement penalty only comes into effect if all three largest fragments are larger than four atoms. The magnitude of these correction terms may be truncated to prevent the overall similarity from dropping below 0. A combination of Breadth First Search and Shortest Path algorithms is used to determine when to apply these correction terms. 2. Large substructure term: This acts to raise the similarity index S if the largest fragment of the MOS comprises 70% or more of one of the molecule’s atoms and bonds. Compounds sharing a large common substructure are generally regarded as belonging to the same chemical class. The correction term serves to make this relationship more apparent when two molecules
a Top: Arrangement correction term, center: cyclic vs linear arrangement, bottom: large substructure term (only the largest substructure fragment is shown). The numerical values show how the similarity value is modified upon introduction of these terms.
either differ strongly in size (as 6a and 6b in Chart 3) or carry completely different sets of substituents on the same scaffold. The arrangement penalty is never applied in these cases. The magnitude of this term may need to be truncated to prevent the overall similarity from reaching 1.0 if the two molecules are not identical (an upper limit is defined for nonidentical molecules). Simple atom counting rules are used to determine when to apply this correction term. A wide range of values for the arrangement and large substructure terms were tested using a number of different data sets (sets related to the one listed in Table 1 and sets obtained from in house HTS campaigns). Additionally, a size cutoff associated with the arrangement term was explored. Three different cutoff schemes were investigated: (1) always include the largest three fragments from the MOS, i.e., no size cutoff, (2) only calculate the penalty if each fragment had more than 2 atoms, and (3) only calculate the penalty if each fragment had more than four atoms. The optimal set of parameters for the arrangement term were determined to be 0.3 and 0.1 while using a cutoff of four atoms for the MOS fragments sizes, and a value of 0.1 for the large substructure term. Interestingly, these values are very close to the very first set of values that were tested, ones that were simply guessed based on experience and intuition. It must be noted that a calculated MOS may not be a unique solution; an alternative set of fragments and/or an alternative arrangement of fragments may define a
4362
Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13
Stahl et al.
Table 1. Origin and Number of Compounds Used as the Test Set compound class
abbreviation
references
number of compounds
acetylcholine esterase inhibitors angiotensin II antagonists cyclin dependent kinase 2 inhibitors cyclooxygenase 2 inhibitors dopamine 4 antagonists estrogen receptor ligands histamine receptor 3 ligands matrix metalloproteinase 3 inhibitors thrombin inhibitors
ACE AngII CDK2 COX2 D4 Estr H3 MMP3 Throm
33 33 34 34 Aureus 7TM database35 34 Aureus 7TM database35 34 33,36
50 55 45 79 68 44 23 75 37
maximum clique with an identical number of matching units. Thus, a consequence of the arrangement term is that it may act to reduce the similarity metric in one solution but not apply in an alternative solution. In other words, our correction terms can exaggerate differences in molecules in some cases. This can be observed to occur in some instances of switching the order of the two compounds in a similarity calculation or when the connection tables in a multi-mol file are reordered and a clustering job is repeated. Clustering. Clustering methods that were investigated include Jarvis-Patrick,27 exclusion sphere clustering,28 and the hierarchical agglomerative methods single link, complete link, Ward, and group average.29 The best results were found using the hierarchical agglomerative UPGMA method;30 it generally isolates “true” singletons and produces relatively few coherent clusters not containing outliers as observed with complete linkage. The first step in compound clustering is the calculation of a matrix containing similarity values for all pairs of compounds. The calculation time thus depends quadratically on the number of compounds. The matrix is converted to a hierarchical clustering tree using the UPGMA algorithm. Next, the hierarchical tree is “cut” into clusters at a number of similarity levels, typically 0.4, 0.5, and 0.6. The best results for viewing are consistently found for the value 0.5. A cluster ordering algorithm is employed to sort the clusters. A single-link approach is used, one that assigns sequential cluster labels to nearest neighbor clusters. Nearest neighbor clusters are defined as clusters containing the closest pair of molecules. The objective of using such an algorithm is to make it as easy as possible for chemists to navigate the hierarchical classification tree according to chemical similarity. It should be noted, however, that there is some arbitrariness in cluster ordering, since there are many equivalent depictions of hierarchical clustering trees, which can be interconverted by rotating branches by 180°. Thus there are many possible arrangements of clusters relative to each other. Once cluster IDs have been generated, Spotfire31 is used for browsing and analyzing structures. Compilation of the Test Set. Compounds binding to nine different biological target proteins were compiled from the literature and commercial databases. Compound sources are listed in Table 1. The original set consisted of 653 compounds. These were clustered by the exclusion sphere method28 with a cutoff radius of 0.7. Tanimoto coefficients based on 2048 bit fixed-length Daylight fingperprints were used as the similarity metric in this calculation. Cluster centers and singletons (466 molecules) were selected as the final test set. Alternative Clustering Methods. A similarity matrix based on MCS calculations was calculated using an
MCS algorithm implemented in Python and OEChem.15 The Raymond similarity metric was used as was the atom typing scheme employed in the MOS calculations. A minimum number of 6 atoms was required for an MCS similarity to be computed, otherwise the similarity was set to zero. A similarity matrix based on Daylight Tanimoto coefficients was calculated using 2048 fixedlength Daylight fingerprints. Hierarchical clustering based on both the MCS and the Daylight similarity matrices was performed using the Spinifex implementation of the UPGMA algorithm. Cluster IDs were calculated at a similarity level of 0.4 in both cases. MACCS keys were calculated within MOE32 and exported to Spotfire, where clustering was performed using Tanimoto distances and the UPGMA method. Cluster IDs were calculated at a similarity level of 0.6. The similarity cutoffs for Daylight, MCS, and MACCS clustering were chosen such that the total number of clusters was about 100, as in the MOS-based clustering. This was not achieved, however, for MCS based clustering, due to the high number of singletons at even the lowest similarity levels. Results and Discussion In this work we describe the development and application of a new compound clustering method in which similarity is based on 2D subgraph isomorphism. Chemical similarity is calculated by first determining the MOS for a pair of chemical structures. The MOS is a generalization of the MCS and can be thought of as the largest set of substructures that two compounds share. Following the work of Raymond et al., a maximum clique algorithm is used to compute the MOS. The Rascal algorithm,12 reported to be very fast, was implemented, and we incorporated the same types of heuristics to prune search space that they described.13 Calculating molecular similarity based on a MOS approach and then applying the hierarchical agglomerative clustering method UPGMA30 was found to be a powerful method for correctly separating compounds into target classes by means of fewer clusters. For the calculation of the MOS, there must be rules defining which atoms and bonds can be mapped onto each other for any given pair of structures. We consistently obtained the best results by using chemical elements to distinguish atoms and making a finer distinction for both oxygen and nitrogen atoms, between those potentially acting as hydrogen bond donors, and others not carrying hydrogen atoms. In addition, we allow all halogens to be mapped onto each other. Bonds are distinguished by bond order (single, double, triple, aromatic). The MOS of two chemical structures, and the way it is embedded in the parent chemical structures, provides
Robust Clustering Method
a host of information about the relationship between the two molecular graphs. This information can be used, for example, to align or overlay a pair of structures, and it can be visualized in a graphics package with the similarities highlighted. For clustering, however, matching subgraphs must be mapped to a chemical similarity metric in a manner that reflects how a chemist would assess the degree of identity between the two structures. We use a simple metric based on the number of atoms and bonds in the MOS and the numbers in the two parent structures, and we extend this metric by introducing correction terms. We penalize situations in which the three largest MOS fragments occur in a different arrangement in the two parent structures, and we increase the similarity if the largest MOS fragment comprises the large majority of the atoms and bonds in at least one of the molecules (see Methods section for details). A series of clustering experiments were undertaken in order to clearly establish whether the fine-tuning parameters produce real improvements in the quality of our results and to try to quantify the extent of any improvement. In particular, we wanted to test the overall impact of the atom typing scheme and the overall impact of the correction terms. We compiled a test set of 466 compounds known to bind to nine different targets (Table 1). The data set was constructed such that it did not contain homologous, virtually identical compounds belonging to the same chemical series, which would be trivial to group with any method. The boundary conditions for hierarchical clustering correspond to the two “ends” of the hierarchical tree where all compounds are either all equivalent or all unique. At the top of the tree all compounds are related to one another, that is, there is one “mixed” cluster. Herein we refer to a mixed cluster as one that contains compounds from two or more target classes. At the bottom of the tree all compounds are unique, that is, all of them are singletons. Conceptually, a clustering experiment can be considered as involving isolating compounds different to all others into singletons and grouping together compounds that are closely related. Using the hierarchical agglomerative UPGMA method, the starting point is the bottom of the hierarchical tree, then closest pairs are grouped together, individual or groups of compounds are added to other individual or groups of compounds, until all are grouped together. When a cluster only contains actives from a single target class, we refer to it as a “pure” cluster. The ideal result for our test set would be nine pure clusters, one for each target class. This is not realistically achievable, as there is not a simple one-to-one correspondence between chemical similarity and target class in the test set. However, this is not a major issue because the relative results are important when comparing different approaches. For each experiment, we (1) count the number of singletons (which we want to minimize), (2) count the number of molecules in pure clusters, and the number of such clusters (we want a large number of molecules in a small number of clusters), and (3) count the number of molecules in mixed clusters, and the number of these clusters (we want to minimize the number of molecules in these clusters).
Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13 4363 Table 2. Results of Clustering Calculations Examining the Overall Impact of the Atom Typing Scheme and Correction Terms, and Clustering Results Obtained with the MCS, MACCS, and Daylight Similarity Methods method MOS MOS MOS MOS MCS MACCS Daylight
correction atom pure clusters mixed clusters terms typing singletons (compounds) (compounds) on off on off -
on on off off on -
17 21 29 16 109 43 33
53 (307) 29 (128) 38 (212) 22 (98) 76 (255) 43 (194) 53 (224)
28 (142) 30 (317) 29 (235) 25 (352) 25 (102) 20 (229) 25 (209)
Table 2 shows the results of our experiments designed to examine the overall impact of the atom typing scheme and correction terms. Both enhancements clearly have a favorable and significant impact on clustering as seen in the numbers of molecules in pure clusters versus the number of molecules in mixed clusters. The correction terms play the larger role. Combined, the results are close to additive; 66% of the compounds are grouped in pure clusters. There is a cost in the total number of clusters obtained when incorporating both the atom typing scheme and the correction terms; however, this becomes less of an issue, as the clustering algorithm we use tends to label pure clusters from the same target class with sequential cluster IDs. There is more or less no cost associated with the number of singletons. Identical trends are seen when the hierarchal tree is cut at either a higher value or a lower value of the similarity metric. Table 2 lists the number of pure and mixed clusters as well as the number of singletons for the MOS and MCS method, MACCS keys, and Daylight fingerprints. For each of these methods, the hierarchical classification tree was cut at that value of the similarity metric which produced the best results. MCS clustering generates the most singletons, as should be expected, since the data set contains few compound pairs sharing large contiguous substructures. In addition, the MCS method identifies many pure clusters, each containing only few compounds per cluster. This is something of a trivial result; all of the other methods would be expected to have the same subset of groupings, but within larger clusters. Both MACCS keys and Daylight fingerprints are unable to separate loosely related compounds into distinct target classes as effectively as the MOS approach and generate many mixed clusters. Figure 3 shows a graphical representation of the results listed in Table 2. Each horizontal line represents a compound. Compounds are ordered according to cluster IDs and colored according to target class. In the right-hand part of each plot, mixed clusters are omitted (white), singletons are black, and pure clusters retain their target color. We do not attach great importance to the relative positions and separations of the colored bands in each plot, as there is always a certain degree of arbitrariness associated with graphical representations of hierarchical clustering trees. It is apparent that there are a few compound sets that can be grouped together by most methods. These most coherent groups of clusters are (1) AngII compounds, (2) a subset of the COX-2 inhibitors, and (3) a subset of the MMP3 inhibitors. The MCS method produces the smallest number of molecules in mixed clusters, as would be expected.
4364
Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13
Stahl et al.
Chart 4
Figure 3. Cluster maps for the MOS, MCS, Daylight, and MACCS keys methods. In each of the four panels, compounds are sorted according to Cluster IDs. The left-hand half of each plot is colored by target class only. Light green: D4, dark green: H3, brown: MMP3, red: ACE, Grey: thrombin, orange: CDK2, yellow: COX-2, light blue: estrogen, dark blue: AngII. The right-hand part shows pure clusters in target color, singletons in black, and mixed clusters white.
This method is best suited to identifying instances of very close chemical similarity but less suitable for identifying instances of other levels of similarity (thus resulting in large numbers of singletons). Both Daylight and MACCS results suggest that these methods have difficulties in distinguishing compounds with a more or less peptidic backbone; they mix thrombin, MMP3, and ACE inhibitors. Mixed Daylight and MACCS clusters also arise from combinations of H3 and D4 compounds, as indicated above (Chart 1). Overall, however, we were surprised to see Daylight fingerprints perform approximately as well as MACCS keys, because the test set had deliberately been designed to be diverse based on Daylight Tanimoto coefficients. This criterion generally removes compounds with high substructure similarity from the data set but retains compound pairs where smaller common substructures are connected with different linker fragments. While MACCS keys can cope with such situations, the total neglect of connectivity of the keyed substructures obviously introduces a fuzziness that creates many mixed clusters. Clearly,
both Daylight and MACCS retain their validity in particular for clustering large libraries due to their high speed. As stated before, a clustering method based only on chemical structure similarity would not be expected to group all compounds binding to one target into one cluster. Nevertheless, one should expect that compounds with related topology and equal relative arrangements of pharmacophore features to be grouped together. This is indeed the case for the MOS clustering results shown in Figure 3. Most of the activity classes are split into multiple subsets. It is important to note that these subsets are not necessarily closely related to each other. For example, the D4 compound class is split into two major groups of clusters (light green bands in Figure 3a). The upper one of these contains all antagonists with extended chains and a central amine function, whereas the lower one contains the structurally distinct tricyclic antagonists with an amine side chain. Likewise, MMP3 inhibitors are split up into three groups of clusters: sulfonamide hydroxamic acids (topmost brown band in Figure 3a), long chain biaryls with a carboxylic acid zinc binding motif (middle), and peptidic hydroxamates (lowest brown band). As a final example, thrombin inhibitors based on proline-containing peptides (top gray band) are clearly separated from others with sulfonamide-based scaffolds (lowest gray band). Scaffolds are not necessarily conserved within each of these target subgroups however the pharmacophores are conserved. Chart 4 lists examples of three ACE inhibitors (7a-c) and three COX-2 inhibitors (8a-c) clearly sharing common pharmacophore features. Only the MOS method is able to cluster these together. Conclusion In this paper we have described the development of a MOS-based compound clustering approach and its advantages over other methods. The superiority of a graph-based method that computes the largest set of substructures rather than just the maximum common
Robust Clustering Method
Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13 4365
substructure (MCS) was apparent throughout all of our results. We could show that a more fine-grained distinction between atom types and, most importantly, the introduction of empirical correction terms dramatically improves the performance of MOS-based clustering in grouping compounds according to their respective targets. MOS-based clustering seems to be capable of finding biologically meaningful relationships between chemical structures belonging to different chemical classes. We consider this to be an important step forward in automated clustering. In particular the method is well suited for the analysis of HTS data, where it is essential to find relationships between compounds one cannot detect by classical substructurebased analysis. We consider it particularly important to find structural similarities to compounds otherwise regarded as “singletons”, because the level of confidence in a single hit increases dramatically when a distant structural relationship to other hits can be found. Single hits occur particularly often when focused or diverse subsets of compound collections are screened for time or resource reasons. The success seen from using the clustering methods described in this work led to the development of a web interface enabling medicinal chemists to submit their own clustering calculations. This tool enables clustering results to be merged with results from property calculations and the output to be delivered in a spreadsheet format suitable for importing into other desktop applications such as Excel and SpotFire. Robust chemical similarity software has many practical applications in medicinal chemistry research. In this work we have mainly described the clustering of chemical datasets, with the intention of developing a tool for analyzing HTS hits. Other applications include database searching (saving the n closest hits for a query molecule) and the visualization of small molecule 3D alignments and overlays (based on shared topological features). We continue to fine-tune and extend the methods and algorithms described in this paper. Most importantly, we would like to further improve the performance of our clique detection algorithms. The current method can handle 3000 structures efficiently; however, it becomes cumbersome to process data sets up around 10000 structures and more. It would be highly desirable to be able to cluster these larger datasets with Spinifex in an “overnight” run on a standard Linux cluster. As noted by other groups23 a need remains to account for the statistical relevance of random similarities in multifragment based approaches. For instance, many molecules include carbon-carbon single bonds and benzene rings, and furthermore, two large molecules are highly likely to share some common features even if they would be classified to be really dissimilar by the majority of medicinal chemists. One extension we intend to pursue is a penalty term to deal with the occurrence of many small fragments in the calculated MOS. Also, our empirical correction terms do not capture all topological information obtained through the determination of the MOS. Experience with practical applications will show how the similarity metric can be further improved.
clustering algorithms, and our colleagues in modeling and chemistry at Roche Basel for many fruitful discussions and for testing the new method in many real-life projects.
Acknowledgment. We thank Tanja Schulz-Gasch for help with the batch-mode calculation of MACCS keys, Paul Gerber for an implementation of various
Supporting Information Available: Table listing cluster IDs calculated with four methods for the test set, corresponding to the graphical representations in Figure 3, and SMILES codes for all 466 ligands in the test set. This material is available free of charge via the Internet at http://pubs.acs.org.
References (1) http://www.daylight.com. (2) Durant, J. L.; Leland, B. A.; Henry, D. R.; Nourse, J. G. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 2002, 42, 1273-1280. (3) Brown, R. D.; Martin, Y. C. Use of structure-activity data to compare structure-based clustering methods and descriptors for use in compound selection. J. Chem. Inf. Comput. Sci. 1996, 36, 572-584. (4) McGregor, M. J.; Pallai, P. V. Clustering of large databases of compounds: Using the MDL “keys” as structural descriptors. J. Chem. Inf. Comput. Sci. 1997, 37, 443-448. (5) Shen, J. HAD: An automated database tool for analyzing screening hits in drug discovery. J. Chem. Inf. Comput. Sci. 2003, 43, 1668-1672. (6) http://www.bioreason.com. (7) Xu, J. A new approach to finding natural chemical structure classes. J. Med. Chem. 2002, 45, 5311-5320. (8) Cross, K. P.; Myatt, G.; Yang, C.; Fligner, M. A.; Verducci, J. S. et al. Finding discriminating structural features by reassembling common building blocks. J. Med. Chem. 2003, 46, 4770-4775. (9) Nilakantan, R.; Bauman, N.; Haraki, K. S.; Venkataraghavan, R. A ring-based chemical substructural query system: Use of a novel ring-complexity heuristic. J. Chem. Inf. Comput. Sci. 1990, 30, 65-68. (10) Nilakantan, R.; Bauman, N.; Haraki, K. S. Database diversity assessment: New ideas, concepts, and tools. J. Comput.-Aided Mol. Des. 1997, 11, 447-452. (11) Raymond, J. W.; Blankley, C. J.; Willett, P. Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures. J. Mol. Graphics Mod. 2003, 21, 421-433. (12) Raymond, J. W.; Gardiner, E. J.; Willett, P. RASCAL: Calculation of graph similarity using a maximum common edge subgraph algorithm. Comput. J. 2002, 45. (13) Raymond, J. W.; Gardiner, E. J.; Willett, P. Heuristics for similarity searching of chemical graphs using a maximum common edge subgraph algorithm. J. Chem. Inf. Comput. Sci. 2002, 42, 305-316. (14) http://www.desertsci.com/. (15) http://www.eyesopen.com/. (16) http://www.twisted-helices.com/computing/rambin/rambin.html. (17) Bron, C.; Kerbosch, J. Finding all cliques of an undirected graph. Commun. ACM 1975, 16, 575-577. (18) ftp://dimacs.rutgers.edu/pub/challenge/graph/solvers/. (19) Pardalos, P.; Xue, J. The maximum clique problem. J. Global Optim. 1992, 4, 301-328. (20) Wood, D. An algorithm for finding a maximum clique in a graph. Oper. Res. Lett. 1997, 21, 211-217. (21) Figueras, J. Ring perception using breadth-first search. J. Chem. Inf. Comput. Sci. 1975, 36, 986-991. (22) http://www.csm.ornl.gov/pvm/pvm_home.html. (23) Sheridan, R. P.; Miller, M. D. A method for visualizing recurrent topological substructures in sets of active molecules. J. Chem. Inf. Comput. Sci. 1998, 38, 915-924. (24) http://www.tripos.com/. (25) Ihlenfeldt, W. D.; Gasteiger, J. Hash codes for the identification and classification of molecular structure elements. J. Comput. Chem. 1994, 15, 793-813. (26) Raymond, J. W.; Willett, P. Effectiveness of graph-based and fingerprint-based similarity measures for virtual screening of 2D chemical structure databases. J. Comput.-Aided Mol. Design 2002, 16, 59-71. (27) Jarvis, R. A.; Patrick, E. A. Clustering using a similarity measure based on shared nearest neighbours. IEEE Trans. Comput. 1973, C-22, 1025-1034. (28) Butina, D. Unsupervised data base clustering based on Daylight’s fingerprints and Tanimoto similarity: A fast and automated way to cluster small and large data sets. J. Chem. Inf. Comput. Sci. 1999, 39, 747-750. (29) Downs, G. M.; Barnard, J. M. Clustering methods and their uses in computational chemistry. Reviews in Computational Chemistry; Wiley-VCH: New York, 2002; pp 1-40.
4366
Journal of Medicinal Chemistry, 2005, Vol. 48, No. 13
(30) Sokal, R. R.; Michener, C. D. A statistical method for evaluating systematic relationships. Univ. Kan. Sci. Bull. 1958, 28, 14091438. (31) http://www.spotfire.com. (32) http://www.chemcomp.com/. (33) Stahl, M.; Rarey, M.; Klebe, G. Screening of drug databases. Bioinformatics: From Genomes to Drugs; VCH: Weinheim, 2002; pp 229-264.
Stahl et al. (34) Stahl, M.; Todorov, N. P.; James, T.; Mauser, H.; Boehm, H.-J. et al. A validation study on the practical use of automated de novo design. J. Comput.-Aided Mol. Des. 2002, 16, 459-478. (35) http://www.aureus-pharma.com/. (36) Stahl, M.; Rarey, M. Detailed analysis of scoring functions for virtual screening. J. Med. Chem. 2001, 44, 1035-1042.
JM040213P