Clustering Compound Data: Asymmetric Clustering ... - ACS Publications

MDL Information Systems, Inc., San Leandro, CA. Home page: http://www.mdli.com/. 21. Daylight Chemical Information Systems, Inc., Mission Viejo, CA...
0 downloads 0 Views 1MB Size
Chapter 11

Clustering Compound Data: Asymmetric Clustering of Chemical Datasets Norah E . MacCuish and John D. MacCuish

Downloaded by COLUMBIA UNIV on August 23, 2012 | http://pubs.acs.org Publication Date: August 9, 2005 | doi: 10.1021/bk-2005-0894.ch011

Mesa Analytics and Computing, LLC, Santa Fe, NM 87501

We investigate asymmetric clustering of compound data as a viable alternative to more commonly used algorithms in this area such as Wards, complete link, and leader algorithms. We show that the Tversky measure, more commonly applied to similarity searching in compound databases, can be used in both a hierarchical asymmetric clustering algorithm and an asymmetric variant of a popular leader algorithm as effective means to cluster 2-dimensional molecular structures for template extraction, without the size bias usually associated with more common clustering measures and methods. We show the results of the combination of these measures and algorithms with several chemical datasets.

© 2005 American Chemical Society

In Chemometrics and Chemoinformatics; Lavine, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 2005.

157

158

Downloaded by COLUMBIA UNIV on August 23, 2012 | http://pubs.acs.org Publication Date: August 9, 2005 | doi: 10.1021/bk-2005-0894.ch011

Introduction Cluster analysis is the study of the methods to find groups or some form of structure in data (1,2). These methods fell under the general heading of unsupervised learning. Clustering is sometimes called classification, though it is distinct from the methods, also known as classification, employed to discriminate groups that are known to be in the data a priori. Discrimination methods (3) are often used to build classification models (classifiers) for predictive modeling. The latter form of classification falls under the general rubric of supervised learning. All of these methods are within the larger study of pattern recognition (4) and multivariate statistics (5). Clustering algorithms in turn have a complex taxonomy that is not well defined. The most general types are hierarchical (divisive and agglomerative) and partitional algorithms. A partition can be formedfroma hierarchy via a level selection technique (6) such as the Davies-Bouldin (7) or the Kelley (8) heuristics. Common agglomerative hierarchical algorithms are Wards, complete link, and group average. A popular partitional or relocation method is k-mems (9). Exclusion region algorithms such as Tayior-Butina (10,11) are also often used for grouping molecular structures. The algorithms can also be divided up into forms that strictly partition the data into disjoint sets, or where cluster membership is not unique such that the sets are non-disjoint (or overlapping). Membership can also be probabilistic, where elements are assigned a probability of membership for each cluster. Fuzzy clustering (12) is an overlapping method, where membership is assigned as a grade (often between 0 and 1, but not a probability). There are also parametric methods such as mixture models. EM or Expectation Maximization is one such algorithm (13). However, these methods, with their assumption of specific distributions, tend to work best with low dimensions and they are computationally expensive. Clustering in chemoinformatics is used for lead selection in HTS data, diversity analysis, lead hopping, compound acquisition decisions and related activities (14,15), often on large or very large data sets. Numerous clustering techniques have been employed with varying effectiveness in these pursuits (16). Algorithms must be effective in minimizing computational resources, and that the algorithm can be parallelized is often crucial with very large data sets. Clustering compound data beginsfirstwith molecular descriptors. Such descriptors are manifold: graph-based (17), chemical properties (18), shape descriptors (19). With large data sets the speed with which to operate on molecular descriptors becomes crucial. Thus, simple binary fingerprints that encode 2D chemical structure, whether feature or path based (20,21), are very common as they are relatively easy to generate and operate on. Proximity

In Chemometrics and Chemoinformatics; Lavine, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 2005.

Downloaded by COLUMBIA UNIV on August 23, 2012 | http://pubs.acs.org Publication Date: August 9, 2005 | doi: 10.1021/bk-2005-0894.ch011

159 measures for binary data are then used to compare binary molecular descriptors (22,23). These have varying properties; such as they may or may not be metric, they may be symmetric or asymmetric, or they may or may not be monotonic to one another. Thus, binary descriptors are very common, and many of the clustering techniques revolve around binary proximity measures and algorithms that can utilize them. All of the various binary data clustering algorithms mentioned above typically use symmetric proximity measures such as the Tanimoto, Euclidean, or Ochiai measures. However, there are algorithms that can use asymmetric measures such as the Tversky measure (24,25,26). For example, there is an asymmetric, agglomerative, hierarchical, strongly connected component algorithm due to Tarjan (27), and we have transformed the Tayior-Butina algorithm mentioned above to work with asymmetric measures. Asymmetry can be used with other chemical descriptors. Examples are graph based descriptors (17) and shape descriptors (28). In addition, clusters from asymmetric algorithms need not be strictly disjoint. For instance, non-disjoint variants of hierarchical algorithms (29) have been designed, and we have created a variant of our asymmetric Taylor-Butina's algorithm to produce overlapping clusters. Special situations arise however when using binary measures and clustering algorithms (30,31,32,33). The relative sizes of molecular structures in conjunction with certain measures can create biases (31). In addition, ties in proximity often becomes a much more serious problem with the use of binary descriptors (32,33). Ties in proximity can effect either directly or indirectly decisions within clustering algorithms, such as merging criteria in agglomerative hierarchical algorithms, or partitioning decisions. Algorithms in turn may include fundamental or implementation decisions that result in an ambiguous clustering. We show how asymmetric methods are largely equivalent to the common and popular clustering methods in use in chemoinformatics.

Motivation Anecdotal evidencefromthe chemical information industry suggests that the Tversky asymmetric measure is used with considerable efficacy in similarity searching ~ where, given one compound, a database is searched for similar compounds. It measures, via its parameterization of similarity, to what extent is a single molecular structure either super- or sub-similar to others. This gives rise to two possibly different proximities, hence the asymmetry. Similarity

In Chemometrics and Chemoinformatics; Lavine, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 2005.

Downloaded by COLUMBIA UNIV on August 23, 2012 | http://pubs.acs.org Publication Date: August 9, 2005 | doi: 10.1021/bk-2005-0894.ch011

160 searching uses one-to-many comparisons in one direction, whereas clustering typically uses many-to-many symmetric comparisons. Asymmetric clustering algorithms in otherfieldshave not been tried to our knowledge in chemoinformatics. Our research suggests that using asymmetric measures and asymmetric clustering algorithms may yield important new methods that provide insight into template extraction or determining substructures in common. Our original interest in asymmetry was in the hopes that it would help avoid the serious shortcomings of ties in proximity when using binary measures and various clustering algorithms. This benefit is marginal at best, but does not obviate the other benefits of the use of asymmetry. More generally, asymmetric clustering algorithms can be used with non-binary descriptors as well, such as clustering shape descriptors or graph-based descriptors.

Asymmetry Symmetric measures such at the Tanimoto and Euclidean measures of proximity are simply binary relations such that the proximity between two molecular structures is a single value. With asymmetric measures however, the proximity now has possibly two values. The proximity between structure A and structure B is not the same as the proximity of structure B and structure A.

Tversky Measure The Tversky measure is a parameterization of the Tanimoto measure. The parameters allow one to treat the measures as asymmetric. The Tanimoto and Tversky measures are defined in Equation 1 and Equation 2 for comparisons of molecular structures represented by binary bit strings.

Tanimoto = c / [a + b + c]

(1)

Tversky = c / [ aa + pb + c]

(2)

a = unique bits set in molecular structure A b = unique bits set in molecular structure B c = common bits set in structures A and B

In Chemometrics and Chemoinformatics; Lavine, B.; ACS Symposium Series; American Chemical Society: Washington, DC, 2005.

161

Downloaded by COLUMBIA UNIV on August 23, 2012 | http://pubs.acs.org Publication Date: August 9, 2005 | doi: 10.1021/bk-2005-0894.ch011

Optional constrain: 0