15
T h e S i m i l a r i t y of G r a p h s and M o l e c u l e s
1
2
Steven H.Bertz and William C. Herndon Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 26, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015
1
AT&T Bell Laboratories, Murray Hill,ΝJ07974 University of Texas at El Paso, El Paso, TX 79968-0509
2
A new definition of molecular similarity is presented, based upon the similarity of the corresponding molecular graphs. First, all of the subgraphs of the molecular graph are listed, and then various similarity indices are derived from the numbers of subgraphs. One of these compares favorably with the standard distance measures of sequence comparison. Measurement of similarity provides a new way to measure molecular complexity, as long as the most (or least) complex member of a set of molecules can be identified. The concept of the similarity of molecules has important ramifications for physical, chemical, and biological systems. Grunwald (7) has recently pointed out the constraints of molecular similarity on linear free energy relations and observed that "Their accuracy depends upon the quality of the molecular similarity." The use of quantitative structure-activity relationships (2-6) is based on the assumption that similar molecules have similar properties. Herein we present a general and rigorous definition of molecular structural similarity. Previous research in this field has usually been concerned with sequence comparisons of macromolecules, primarily proteins and nucleic acids (7-9). In addition, there have appeared a number of ad hoc definitions of molecular similarity (10-15), many of which are subsumed in the present work. Difficulties associated with attempting to obtain precise numerical indices for qualitative molecular structural concepts have already been extensively discussed in the literature and will not be reviewed here. Results and Discussion We begin with the way chemists perceive similarity between two molecules. This process involves, consciously or unconsciously, comparing several types of structural features present in the molecules. For example, considering the five aliphatic alcohols (represented by their Η-suppressed molecular graphs) in Figure 1, we note both similarities and differences: they are all four-carbon alcohols; a, b, c and d are acyclic, whereas e has a ring; a and b are primary alcohols, c and e are secondary alcohols and d is a tertiary alcohol; b and c have the same skeleton, but for the labeling of points (atoms), while the other skeletons are distinct; etc. 0097-6156/86/0306-0169$06.00/0 © 1986 American Chemical Society
In Artificial Intelligence Applications in Chemistry; Pierce, Thomas H., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 26, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015
170
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY
The first step in quantifying the concept of similarity is to list all subgraphs of the given molecular graphs, e.g. a-e, which has been done in the first column of Table I. The subgraphs include the vertices (atoms), all connected subgraphs, and the full molecular graphs themselves, since it can be seen that the molecular graphs for a and c are both subgraphs of e. Next, the number of each subgraph contained in the molecular graphs must be counted. Row 1 lists the number of C atoms, row 2 the number of Ο atoms, row 3 the number of C-C bonds, row 4 the number of C-O bonds, etc. Gordon and Kennedy (16) defined N.. as the number of subgraphs of graph j isomorphic with graph /, and more colloquially as "the number of distinct ways in which skeleton ι can be cut out of skeleton j" The entries in Table 1 are the number of ways the subgraphs can be cut out of the molecular graphs (the number of subgraphs of the molecular graphs isomorphic with the subgraphs in the first column). In terms of the numbers of C or Ο atoms, a-e are equally complex. In terms of C-C bonds (ethane subgraphs) a-d are 3/4 as complex as e; however, in terms of propane subgraphs (row 5) a and c are 1/2 as complex as e. A simple algorithm that takes account of all the subgraphs involves comparison of two columns at a time, examining them row by row and dividing the smaller of the numbers by the larger. A similarity index (57) can then be calculated by taking the average of the quotients. Of course, for two identical molecular graphs, 57-1. Inclusion of the molecular graphs in the list of subgraphs ensures that two different molecules which have the same number of each proper subgraph will not have 5/— 1. The values of S1(1) for a-e are summarized in the form of a similarity matrix SM(l) in Figure 2. A simpler similarity index can be calculated by dividing the sum of the lesser of the two numbers in each row by the sum of the greater. (Only two columns of Table I are considered at a time, of course.) The values of SI(2) for a-e are summarized in SM(2), also in Figure 2. According to both SI(l) and 5/(2), 1-butanol (a) and 2butanol (c) are the most similar, whereas f-butanol (d) and cyclobutanol (e) are the least similar pair. In between these extremes there are a significant number of disagreements between these indices. For example based on SI(l), c and e are more similar than c and d; however, c and d are more similar than c and e based on 57(2). There are seven such pairs (out of 45 possible pairs), and each index has one "degeneracy". By considering standard measures of "distance," 57(2) would appear to be the superior index (vide infra). The calculations of similarity indices can also be done with labeled subgraphs of a labeled molecular graph. The points can be labeled according to the valency of the corresponding atoms (i.e. whether they are primary, secondary, tertiary, etc.), labeled with stereochemical descriptors, or labeled to reflect isotopic composition to cite but a few examples. Furthermore, the number of similarity indices can be doubled by relaxing the stricture that only connected subgraphs be considered. We have concentrated on connected subgraphs, as they are more intuitively meaningful to the average chemist; nevertheless, for some applications the inclusion of disconnected subgraphs may be desirable or even necessary. Similarity and Distance. Two sequences of subgraphs m and η such as those in Table 1 have the property that there is a built-in one-to-one correspondence between the elements of one sequence (m,) and those of the other (/!,). Accordingly, it is straightforward to calculate various well-known (17) measures of the distance d between the sequences, e.g. Euclidean distance [2/0^r-/i,) ] , "city block" distance 2
1/2
In Artificial Intelligence Applications in Chemistry; Pierce, Thomas H., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 26, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015
15.
BERTZ A N D H E R N D O N
Figure 1.
Similarity of Graphs and Molecules
171
Selected four-carbon alcohols, abstracted as their Η-suppressed molecular graphs: a 1-butanol, b isobutanol, c 2-butanol, d /-butanol, e cyclobutanol.
SM(l)
b
c
d
0.561
0.682
0.417
1.000
0.472
0.576
1.000
0.472
-
1.000 1.000
SM(2) -
b
c
d
0.684
0.778
0.522
1.000
0.619
0.609
1.000
0.609 1.000 1.000
Figure 2.
Similarity matrices SM(l) and SM(2) for the graphs in Figure 1.
In Artificial Intelligence Applications in Chemistry; Pierce, Thomas H., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
ARTIFICIAL INTELLIGENCE APPLICATIONS IN CHEMISTRY
I.
Subgraph Enumeration for Some Four-carbon Alcohols.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 26, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015
SUBGRAPH
NUMBER IN GRAPH α
b
C
d
e
•
4
4
4
4
4
ο
1
1
1
1
1
3
3
3
3
4
1
1
1
1
1
2
3
2
3
4
1
1
2
3
2
1
0
1
0
4
1
2
1
0
2
0
1
0
1
0
0
0
1
3
1
0
0
0
0
1
α
1
0
0
0
2
b
0
1
0
0
0
0
0
1
0
2
d
0
0
0
1
0
e
0
0
0
0
1
— · •—0
X X π
c
X-
In Artificial Intelligence Applications in Chemistry; Pierce, Thomas H., et al.; ACS Symposium Series; American Chemical Society: Washington, DC, 1986.
Downloaded by UNIV OF CALIFORNIA SAN DIEGO on January 26, 2016 | http://pubs.acs.org Publication Date: April 30, 1986 | doi: 10.1021/bk-1986-0306.ch015
15.
BERTZ A N D HERNDON
Similarity of Graphs and Molecules
173
2,-ΙΉ/—Hjl, or Hamming distance, which counts the number of positions in which the corresponding elements are unequal. It may be noted that these are measures of dissimilarity; of course, it is easy to draw conclusions about similarity from them (e.g. by taking their inverse). Table II contains the distances calculated according to each of the definitions discussed above as applied to molecular graphs a-e. The three distance functions parallel each other quite closely: there are only two disagreements between Hamming distance and Euclidean distance, and there are no disagreements between city-block distance and Euclidean distance. There is a two-fold degeneracy within city-block distance and Euclidean distance (the same as S1(1) and S1(2)) and a four-fold one within Hamming distance, which is the crudest measure. Both city-block and Euclidean distance have only a single disagreement with 5/(2), but many with 5/(7); therefore, it is recommended that 5/(2) or one of the distance measures that parallel it be used to index similarity. Table II.