Topological statistics on a large structural file - Journal of Chemical

Topological statistics on a large structural file. Michel Petitjean, and Jacques Emile Dubois. J. Chem. Inf. Comput. Sci. , 1990, 30 (3), pp 332–343...
0 downloads 0 Views 1MB Size
J . Chem. InJ Comput. Sci. 1990, 30, 332-343

332

Formal Grammar. J . Chem. InJ Compuf.Sei. 1989, 29, 106-1 12. (9) Cooke-Fox, D. 1.; Kirby, G. H.; Rayner, J. D. Computer Translation of IUPAC Systematic Organic Nomenclature. 3. Syntax Analysis and Semantic Processing. J . Chem. Inf: Comput. Sei. 1989, 29, 112-1 18. (IO) Conrow. K . Computer Generation of Baeyer System Names of Saturated, Bridged, Bicyclic, Tricyclic, and Tetracyclic Hydrocarbons. J . Chem. Doc. 1966, 6, 206-21 2. ( I I ) Van Binnendyk, D.; MacKay, A. C. Computer-Assisted Generation of IUPAC Names of Polycyclic Bridged Ring Systems. Can. J . Chem. 1973, 51, 718-723. (12) Vander Stouw, G. G.;Gustafson, C.; Rule, J . D.; Watson, C. E. The Chemical Abstracts Service Chemical Registry System. IV. Use of the Registry System to Support the Preparation of Index Nomenclature. J. Chem. In$ Comput. Sei. 1976, 16. 213-218. ( 1 3) Mockus, J.; Isenberg, A. C.; Vander Stouw, G. G. Algorithmic Generation of Chemical Abstracts Index Names. I . General Design. J .

Chem. InJ Comput. Sei. 1981, 21, 183-195. S. R. Microcomputer Generation of Chemical Nomenclature from Graphic Structure Input. Am. Lab. 19%8,20( 1 I), 92-96. Meyer, D. E.; Warr, W. A.; Love, R. A. Chemical Structure Software for Personal Computers; ACS Professional Reference Book; American Chemical Society: Washington, DC, 1988. Wisniewski, J. L. Effective Text Compression with Simultaneous Digram and Trigram Encoding. J . InJ sci. 1987, 13, 159-164. Willet, P. A Review of Chemical Structure Retrieval Systems. J . Chemom. 1987, I , 139-155. International Union of Pure and Applied Chemistry. Revision of the Extended Hantzsch-Widman System of Nomenclature for Heteromonocycles. Pure Appl. Chem. 1983, 55 (2), 409-416. Tenenbaum, A. M.; Augenstein, M. J. Data Structures Using Poscal. Prentice-Hall: Englewood Cliffs, NJ, 1981; pp 252 and 318.

(14) Meyer, D. E.; Could, (1

5)

(16)

(17) (18) ( 1 9)

Topological Statistics on a Large Structural File MICHEL PETITJEAN and JACQUES-EMILE DUBOIS* Institut de Topologie et de Dynamique des Systemes (ITODYS), associE au CNRS, Universiti de Paris VII, 1 rue Guy de la Brosse, 75005 Paris, France Received March 7. 1990 Statistics based upon connection tables have been determined for a large structural file. Many distributions have unexpected local maxima and minima. Parity phenomena are observed in the distribution of hydrogen and carbon. An interpretation of even and odd distributions is proposed. Some of the compounds which represent topological extremes are shown. INTRODUCTION

REPRESENTATION OF CHEMICAL COMPOUNDS

Many statistics in chemistry can be developed from chemical compounds that are, with their associated data, registered in databases, covering for example, spectroscopy, biological activity, thermodynamic properties, and so on. These parameters represent a source for numerous statistical investigations and research into correlations. When such investigations focus on the structural data for fully characterized compounds, very few analyses have been reported and they all depend upon Chemical Abstracts Service (CAS) for their source data.’V2 When large files are involved, some aspects of our basic knowledge of chemical data depend largely upon these statistics. In this paper, we present original results derived from a CAS file3containing 3 424428 compounds registered through

The expanded formulas of the compounds in the database are coded by means of a DARC-like4colored graph, but the statistical study is carried out without any preconceived idea concerning coding rules. The study is aimed essentially at extracting fundamental topological information from the file. The following colored graph terminology is used in the presentation of the results. In a compound formula: the graph nodes are the atoms the graph edges are the chemical bonds The atoms and bonds are both colored. Each atom assumes one of the 103 colors defined by the Mendeleev Table (the 103 atomic symbols), and each bond takes one of the following values: SI (simple), TA (tautomer), AR (aromatic), DO (double), or TR (triple). In addition, the atoms are labeled with secondary chromatic information, and in this way, the complete description provided for each molecular structure includes “Unusual” valency: arithmetic positive value between 0 and 99, a value of zero meaning the usual valency. Charge: algebraic signed value between -9 and +9 associated with the delocalization flag; localized charge or not. Isotope: arithmetic positive value giving the integer mass of the isotope. Zero means natural abundance. Stereochemistry: may be included in the code, but is not registered in this part of the file.

J u l y 1978.

Although statistics on cyclic and heterocyclic systems have been reported,’V2some complementary topological information in the structural data are reported here and provide valuable information for the chemist investigating large chemical datasets. Such information is useful for optimization of algorithms which require a statistical knowledge of topological data, such as atomic excentricities or concentric layers around a focus which can be used, for example, when applying the Cahn-ingold-Prelog rules in computation of configuration. Some unexpected statistical results were obtained and these cannot be interpreted without the use of graph theory. Most of the statistical variables, therefore, will be derived from graph theory rather than from chemical considerations.

0095-2338/90/ 1630-0332$02.50/0 0 1990 American Chemical Society

J . Chem. If. Comput. Sci., Vol. 30, No. 3, 1990 333

TOPOLOGICAL STATISTICS

Excentricity of an x atom: greatest distance from x to any y aton

e(01.3

eiFI.3

F

22

No+ e(Ha)=O

First component: Radius R=Min(e(x))=2 Diameter D=hlax(e(x))=3

Second Component: Radius R=Uin(efx))=O Diameter D=Max(e(x))=O

,

4

origin: depth=O

\

F-.

r\-

U

- Annth.1

/

Concentric layers around a focus The greatest depth

1s

also the excentricity of the focus

Figure 1. Different graph theory topological variables, exemplified by sodium trifluoroacetate. Table 1. Different Sets of Observations Considered in the File type observations total observations 3 424 428 chemical compounds atoms 77915 142 components 4019514 shortest path between atoms-pairs 2 225 906 690 concentric layers focused around atoms 898 037 816

A compound such as sodium fluoroacetate is comprised of components, in this case sodium ion and fluoroacetate ion. The components in the graph of a compound are labeled. Each component carries a ratio, which is the fractional multiplicity coefficient of the component, which, for the first registered component, is arbitrarily set at I:]. In addition, hydrogen atoms (except for 7957 hydrogens bearing secondary chromatisms) are implicit and are not recorded in the connection table. Thanks to the graph concept, this topological representation can express most of the expanded formulas correctly. Some variations on this concept (e.g., oriented multigraphs, recording of hydrogens, different sets o f chromatisms) are possible and lead to modifications in the set of compounds that can be coded.3 It was not felt to be necessary to code every compound in the file but only to code enough compounds to allow extraction of some of the robust statistical phenomena from the database. STATISTICAL VARIABLES The statistical variables can be divided into two categories: (a) Distributions dependent upon the particular coding of

Table 11. Number of Compounds with a Given Number of Components compounds components compounds components 1 2 863 557 7 79 2 533 323 8 57 3 22 47 1 9 5 4 4 I28 IO 5 5 557 11 3 6 242 12 1

the graph used on a computer. These are essentially those that depend on the internal numbering of the atoms. (b) Distributions that depend only on the chemical information carried in each graph. The number of carbon atoms, for example, does not depend upon the local coding. It should also be noted that the interpretation of the distribution(s) depends partly on the set of nonrepresentable compounds. Every interpretation will be correct for our subfile of the CAS database, but some differences may be discerned when the full CAS file is considered. Only these latter distributions will be explored. In this way, numerous results of little value to chemists will be avoided. If a set of primary distributions has been defined, it is always possible to build many new synthetic distributions, either univariate or multivariate, affording some complex results. In order to limit the quantity of combinatorial distributions, only some of the more important ones are given here. Among these important distributions, those concerning cycles and heterocycles have been published recently’**and will not be reported

334 J. Chem. In5 Comput. Sci., Vol. 30, No. 3, 1990 Table 111. Number of ComDounds with a Given Number of H Atoms H compds H compds H compds 474 0 26 667 52 10062 I04 I 9414 3 405 105 22 1 53 2 16229 106 8 050 410 54 I07 2 772 I38 55 20 I77 3 56 IO8 6 086 407 31 957 4 57 109 2 I96 I34 35 985 5 58 4 597 57 808 1IO 6 334 7 59 1744 183 Ill 60 60 I 60 3 809 285 8 91 650 1 I2 1472 9 I I3 61 I42 93 766 3 098 258 114 62 131064 IO I424 115 II 63 120 266 1 I6 64 2 593 I64 729 235 12 1 I6 1268 I I7 1 I4 65 137 056 13 14 1 18 66 2 663 181 755 190 1122 1 I9 98 67 I42 664 I5 2 382 120 195 183715 16 68 69 1168 121 I29 539 17 1 I4 70 2051 I22 I74 683 18 173 71 123 19 I I6053 1 000 1 I6 I24 1918 72 153 599 I43 20 I25 99 849 86 21 I010 73 74 126 1737 I32 270 182 22 127 1054 75 83 084 85 23 76 128 1620 113 151 120 24 77 129 735 60 046 79 25 78 I30 I635 1 IO 94 288 26 79 131 712 51 56014 27 I377 77 274 80 132 113 28 133 61 70 1 29 81 43 855 I34 108 30 82 I241 67 I80 135 89 31 83 610 34 396 1196 136 98 32 84 53 559 137 588 64 33 85 27018 34 I131 1 I7 138 86 44 IO5 139 35 87 506 20 459 52 140 36 88 877 35 731 102 141 37 89 460 15839 53 142 38 90 948 27 129 107 143 44 39 91 420 I3 070 144 112 40 92 68 1 21 244 145 41 93 54 I O 049 458 146 42 94 18383 666 I10 147 43 95 335 8 659 55 148 44 96 559 15761 140 149 45 97 31 1 7 796 67 150 46 98 14 764 90 516 151 47 99 6 334 57 280 152 48 13813 77 IO0 538 153 49 101 5 402 46 191 154 I2375 85 50 102 477 155 49 21 I 51 4666 IO3 Table IV. Number of Compounds with a Given Number of D or T Atoms D or T D compds T compds 0 3 423 017 3 424 316 1 100 515 2 525 5 4 155 3 4 90 0 5 0 28 6 0 23 7 7 1 8 I 3 9 I 0 IO 2 0 12 1 0 17 I 1

here. The aim of this paper is to complement the published results and, hopefully, provide a different insight into the chemical information that is in the file. Some brief comparisons between our statistics on the 1978 file and the dif-

PETITJEAN AND DUBOIS H I56 I57 158 159 160 161 162 163 164 I65 166 167 168 169 I70 171 172 173 I74 175 176 177 178 179 180 181 182 I83 I84 185 186 187 188 189 I90 191 192 193 194 195 196 197 198 199 200 20 1 202 203 204 205 206 207

compds 90 33 92 29 89 33 78 29 67 39 58 18

57 19 49 22 52 18

51 24 37 18 37 19 45 9 43 23 39 17 27 13 41 15 32 18 49 15 31 14 40 12 29 14 29 17 26 11

32 IO 31 15

H 208 209 210 21 1 212 213 214 215 216 217 218 219 220 22 1 222 223 224 225 226 227 228 229 230 23 1 232 233 234 235 236 237 238 239 240 24 1 242 243 244 245 246 247 248 249 250 25 1 252 253 254 255 256 257 258 259

compds 28 16 38 23 42 16 24 7 24 16 24 12 27 21 25 17 38 27 30 22 26 9 29 17 19 16 25 IO 12 15 20 11

22 9 7 9 11

7 IO 4 13 5 8 7 14 6 3 4 2 2 7 2

H 260 262 263 264 265 266 267 268 269 270 272 274 275 278 280 282 283 284 286 288 290 29 1 292 294 296 297 298 299 300 302 304 308 312 318 320 323 326 330 342 346 348 350 352 358 360 368 383 384 418 450

compds 4 5 1

3 3 3 2 5 1

6 3 2 2 2 2 2 1

2 1

3 1

2 1 1 1

2 1 1

2 1

2 1 1 1

2 1 1

2 2 1 1 1 1 1

2 1 1 1 1 1

Table V. Number of Atoms with a Given Connection Degree connection degree connection degree (no. of neighbors) atoms (no. of neighbors) atoms 403 021 8 629 I6 814229 9 227 37072401 IO 3 281 2 I 202 169 11 45 2 357 230

12

89

49 183 1 1 825 198

13 14

5 4

ferent statistical results derived by CAS from the 1974, 1979, and 1987 files show only minor variations. An exception concerns the number of compounds containing either deuterium or tritium. Such compounds occur less frequently in the 1978 file. The file can be considered as a set of five different types of observation (see Table I). Each of these five types of observation is a possible topological unit, and they will be examined in turn. In order to avoid confusion, some classical

J . Chem. InJ Comput. Sci., Vol. 30, No. 3, 1990 335

TOPOLOGICAL STATISTICS

Table VI. Number of Atoms with a Given Excentricity E E atoms E atoms E atoms 22 132 100 285 0 403 027 50 19725 101 178 109 369 1 51 20 1 I5 196 102 468417 2 52 20 433 103 162 1133319 3 53

4 5 6 7 8 9 IO 11

12 13 14

15 16 17 18 -0

0 IO 20 30 40 50 60 70 Figure 2. Hydrogen distribution, concerning compounds (Table 111). The upper curve corresponds to even values, with an absolute maximum for 16 hydrogens (183715 compounds). Only the range 1-75 is displayed: highest values represent less than 0.9% of the hydrogen-

containing compounds.

0

IO

20

30

40

50

Figure 3. Carbon distribution, concerning compounds [data from (ref 3)]. The upper curve corresponds to even values, with an absolute maximum for 12 carbons (202276 compounds). Only the range 1-50 is displayed: highest values represent less than 1% of the carbon-

containing compounds. terms for nonoriented graphs are defined here following Berge.s These terms are also illustrated in Figure 1. (1) The topological “distance” between two nodes (atoms) x and y is the minimum number of edges (chemical bonds) between the two nodes. Trifluoroacetic acid, for example, has a distance d(F,O) of 3 between any of the fluorines and any of the oxygens. This distance is a mathematical distance which satisfies the three axioms: d(x,y) 3 0 and d(x,y) = 0 < = > x = y d(x,y) = dCv,x) d(x,y) L d(x,z)

+ d(z,y)

(triangular inequality)

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

2964236 5 335 970 7435 105 8412779 8634264 8012310 6 820 316 5 501 553 4350907 3 398 567 2 676495 2 055 809 1 629 633 1 308 694 I 045464 849 553 694321 577 229 491 954 41 5 228 354948 307910 263 759 229 265 201 747 178 890 152 623 136 269 118865 IO3 773 95490 87 563 77 472 69 122 63 I74 54804 51 488 46 829 39861 36231 33 429 29 425 29 232 27 972 24 171 23 825

54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

17315 I7 120 I6 553 15 I27 15461 14 155 12808 1 1 536 IO 948 9 837 9 540 9 300 8 152 8 138 7698 6 090 6 480 6001 5 985 5912 5 254 4051 3917 3 761 3 236 2 878 2 733 2 685 2 800 2 341 2464 2 I95 2 097 2 083 1930 1716 1473 1318 846 729 707 527 485 397 396 247

104 105 106 107 108 I09 110 111

112 I I3 1 I4

115 1 I6 1 I7 I I8 1 I9 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149

159 142 1 I7 99 138 80 84 49 58 48 59 52 54 49 56 45 48 44 50 40 41 36 34 32 22 22 20 14 14 16 14 12 12 12 IO 8 6 8 6 6 6 12 4 0 0 0

When there is no path between the two nodes, the distance d(x,y) is considered to be infinite. For example, sodium trifluoroacetate has no path between the sodium and the other atoms, so d(Na,C), d(Na,F), and d(Na,O) are all infinite. Bond multiplicities do not affect the values of the distances. (2) A finite value for the distance d(x,y) defines a relationship between x and y which is reflexive, symmetrical, and transitive. There are thus equivalence classes: the subgraph containing all the nodes of a class is called a component. Sodium trifluoroacetate, as noted, has two components: the trifluoroacetate ion and the sodium ion. (3) In a given component, the excentricity e(x) of the node x is the maximum value of d(x,y) taken in the set of ally nodes in the component. It is also the maximum number of concentric layers focused on x (see Figure 1). In trifluoroacetate ion, e(F) = e ( 0 ) = 3 for the three fluorines and the two oxygens, and e(C) = 2 for the two carbons. Every isolated atom, such as the sodium ion, has zero excentricity: e(Na) = 0. (4) The radius R of a component is the minimum excentricity e(x), taken from the set of all the x nodes. The node for which e(x) = R is called the centroid of the component,

336 J . Chem. If. Comput. Sci., Vol. 30, No. 3, 1990

PETITJEAN AND DUBOIS

Table VII. Number of Comwnents with a Given Number of Atoms ~

atoms

compns

atoms

compns

atoms

compns

1

403027 9 I99 13818 30997 58 200 38069 38 526 58 142 63719 88 998 109012 128 170 143647 I54670 I58922 193884 169810 176002 167672 170321 I59 I30 I56 370 I38869 I 34663 I I6668 108 962 90361 86 570 73404 68408 56 955 53318 42 737 3997I 31 240 30003 22993 22459 17453 17600 I3751 14296 I O 970 1 I894 8 848 9 033 6 826 7 576 5 473 5 943 4 626

52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99

4 860 3 670 4 299 3 174 3 648 2 870 3 284 2514 2 947 2 285 2 341 I977 2088 I592 I917 1433 1530 1192 I280 I054 I326 961 1134 905 1037 I084 976 690 767 698 701 591 725 630 626 469 542 428 475 419 453 437 422 365 405 286 313 263 304 274 279

103 I04 IO5 106 I07

LJJ

2 3 4 5

6 7 8 9 10

II

12 13 14 15 16 17 18

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

42 43 44 45 46 47 48 49 50 51

IO0 101

102

108

109 1 IO

Ill

112 1 I3 I I4 115 1 I6 1 I7 1 I8

I I9 120 121 122 I23 124 125 I26 127 I28 I29 130 131 132 I33 I34 135 I36 137 138 139 I40 141 I42 I43 144 145 146 147 148 I49

150 151

152 153

but it is often not unique. Each centroid is a focus minimizing the number of concentric layers around it. The two carbons of trifluoroacetate ion are both centroids, and the radius of this ion is R = 2. The radius of an isolated node, such as the sodium ion, is always R = 0. (5) The diameter D is a maximum of e(x), taken from the set of all the x nodes. The nodes for which e(x) = D are called extremal nodes. By applying triangular inequality to the centroid and the extremal nodes, it may be shown that D varies between R and 2R, depending on the component. Each extremal atom is a focus maximizing the number of concentric layers around it. Trifluoroacetate ion for example, has a diameter of D = 3. The diameter of an isolated atom, like the sodium ion, is always D = R = 0. For acyclic components with an even diameter, the centroid is unique and D = 2R. For acyclic components with an odd diameter, D = 2R - 1 , and there are always two centroids xl and x2 with d(xl,x2) = 1. For most hydrocarbons and their derivatives, chemical nomenclature is closely related to the value of D,which leads to the alkane series (containing D + 1 carbons) describing the

-3ic

269 25I 215 178 212 195 165 183 I66 126 151 I95 177 155 143 148 I40 I07 124 111

111 92 109 92 95 93 70 71 80 85

83 84 65 91 61 57 51 71 51 42 60 72 65 34 52 65 64 50 45 61

atoms

I54 155 156 157 158

159 160 161 I62 163 164 165 I66 167 168 169 170 171 I72 173 174 175 176 177 I78 179 I80 181

182 183 184 185 I86 I87 I88

189 I90 191 192 193 I94 I95 I96 197 198 I99 200 201 202 203 204

compns 41

53 35 58

43 46 41 44 39 33 27 36 36 37 26 36 40 32 31 43 29 32 32 39 34 32 34 23 25 20 19 25 26 32 23 25 28 36 19 17 15

22 27 26 14 27 20 60 26 23 19

~~

atoms

compns

205 206 207 208 209 210 21 1 212 213 214 215 216 217 218 219 220 22I 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 25I 252 253 254 255

19 22 30 65 24 26 25 25 18

16 28 26 27 18

25 20 18

16 12 12 17 15 15

17 19 14 16 12

12 24 19 11

24 25 38 19 25 21 21 19 16 19 5 17 8 26 19 17 17 0 0

component. When D = 0 the compound will be a methane derivative, D = 1 is an ethane derivative, and so on. DISTRIBUTIONS IN COMPOUNDS A chemical compound is the natural unit usually considered by the chemist. The following statistics are derived from a database of 3 424428 compounds (except for the carbon distribution, which is based on the 3 387 025 carbon-containing compounds in the file). The bond distribution in the file is as follows: simple, 56.0%; tautomer, 5.4%; aromatic, 30.9%; double, 7.4%; triple, 0.3%. The bond chromatism, in particular the choice between aromatic and tautomeric, is computed by the coding algorithm which uses a set of rules that may not be optimal for all complex systems. The bond chromatism depends on this set of rules and thus on the coding algorithm. The charge distribution, which is sometimes difficult to define in delocalized systems, is in the same situation, as is the valency distribution. The concept of “usual valency” is unclear when applied to metallic elements which have numerous oxidation numbers and which can form chelates with different numbers of neighbors.

J. Chem. Inf. Comput. Sci., Vol. 30, No. 3, 1990 337

TOPOLOGICAL STATISTICS Table VIII. Number of Components with a Given Number of C Atoms C compns C comm C 45 3 441 0 483 57 1 90 1 46 3 893 21 838 91 40151 92 2 47 2313 25917 3 48 3 933 93 4 49 I867 61 202 94 5 52961 50 2 683 95 6 51 1678 133 323 96 7 108 259 52 2 072 97 8 139671 53 1263 98 9 152 953 54 2 096 99 198 105 IO 55 1312 IO0 178 233 I1 56 1588 101 12 21 1654 57 I116 102 13 184800 58 I214 103 14 200 48 1 59 747 104 15 I85 439 60 1346 105 16 I89 068 61 680 106 17 I57 393 62 996 107 18 160 877 63 705 108 19 134 878 64 865 109 20 I40 627 65 583 110 21 I I 9 I84 66 743 111 22 108 950 67 375 112 23 83 453 68 659 113 24 69 419 79712 114 25 70 540 55 998 I I5 26 71 31 1 54535 1 I6 27 45319 72 692 1 I7 28 43 128 73 278 118 29 30 869 74 306 1I 9 30 75 277 34 897 120 31 20818 76 438 121 32 22814 77 238 122 33 15114 78 339 I23 34 17 259 79 193 I24 35 10437 80 329 125 36 14417 81 232 126 37 7 848 82 247 127 38 9 229 83 132 128 39 6001 84 270 129 40 85 137 8 568 I30 41 86 179 4 789 131 42 87 132 7317 132 43 3819 88 210 133 44 5271 89 131 134

local mean no. of atoms no. of bonds simple bonds tautomer bonds aromatic bonds

SD

maxima

22.753 16.626 18 23.998 13.776 0;8 13.435 9.760 16 1.290 2.940 0:2;4 7.417 7.497 06;12;18;24;30 ...(6*i)... double bonds 1.784 1.866 0 triple bonds 0.072 0.347 0 heteroatoms 5.953 4.822 4 (sharp) "unusual" valencies 0.390 0.987 0 no. of charges 0.109 0.530 0 no. of "isotopes" 0.006 0.107 0 no. of characters in 10.157 2.797 8;lO mol formula hydrogens (mol 20.706 14.013 14;16 formula) carbons 16.985 9.561 12;14

compns 192 116 120 1I3 108 I I6 170 100 146 119 I26 97 95 56 79 66 83 56 95 61 73 61 48 46 61 45 47 51 43 29 66 38 47 35 34 41 36 26 29 34 46 46 50 23 42

C 135 136 137 138 139 140 141 142 143 144 I45 146 I47 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 I63 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179

compns 31 66 50 27 29 33 37 63 23 36 36 32 28 38 19 43 44 21 22 33 21 17 15 19 24 16 7 9

IO 12 5 4 3 11 9 7 3 2

8 2 4 1 6 3 0

C

comDns

180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 20 1 202 203 204 205 206 207 208 209 210 21 1 212 213 214 215 216 217 218 219 220 22 1 222 223 224

5 2 1 3 0 1

0 1

1 3 3 1

0 0 0 0 2 0 1

1 1

0 0 0 0 0 1 0 1

0 0 0 0 0 0 0 0

0 0 0 1 0 0 0 0

local minima 2;lO 2 1;3;5

9 15

13

The complete distributions for the number of compounds with a given number of components are given in Table 11. The distribution of compounds with a given number of hydrogen atoms is given in Table Ill and Figure 2 and for a given number of deuterium and tritium atoms in Table IV. Data concerning carbon distribution (the numbers of compounds having specific numbers of carbon atoms) was published previously3 and is displayed in Figure 3.

0 IO 20 30 40 50 Figure 4. Carbon distribution, concerning components (Table VIII). The upper curve corresponds to even values, with an absolute maximum for 12 carbons (21 I 654 components). Only the range 1-50 is displayed: highest values represent less than 0.9% of the carbon-containing components.

338 J . Chem. In5 Comput. Sci.. Vol. 30, No. 3, 1990 Table IX. Number of Components with a Given Number of Bonds bonds compns bonds compns bonds compns 0 403 027 48 IO 642 96 51 I 1 8 346 42 I 97 9 199 49 13538 2 404 7 950 98 50 51 3 38 I 7 007 99 30673 4 52 6 788 56 357 100 353 5 53 5 537 31 340 333 I01 102 54 5 322 33 565 34 1 6 7 55 4610 47 645 103 334 56 4515 48 743 I04 339 8 57 3 923 105 305 9 65612 58 3 596 106 252 74412 IO 59 3 198 107 258 97 394 I1 60 3 523 274 115798 12 108 61 2 944 229 119796 13 109 62 2 846 238 130 453 14 I IO 63 2813 234 139 455 15 Ill 64 2 669 244 170 998 16 112 2 292 194 I13 65 145 322 I? 188 1 I4 66 2 322 149651 18 67 2 027 200 151 243 19 115 1975 182 I I6 68 150199 20 I933 186 117 69 I45 982 21 I608 I I8 I80 70 139 803 22 1551 71 1 I9 169 133 189 23 1849 72 I29 867 I20 186 24 I265 73 I22 824 121 136 25 74 1157 26 I22 I45 I I3 002 1224 I76 75 27 I23 105 423 I114 I24 I42 76 28 94 946 77 1084 125 1 I8 84 592 29 1129 78 I26 78 780 123 30 79 127 107 31 935 69 307 I I2 128 32 962 64 629 80 129 844 103 33 57 075 81 75 I 51 943 82 98 34 130 131 46 209 83 775 I08 35 36 132 42117 84 1059 130 88 37 I33 34456 85 657 38 134 30 929 86 I06 625 135 27 140 87 96 620 39 136 24315 96 65 1 40 88 137 20089 64 41 89 566 I38 19234 105 42 90 687 I39 72 91 I5 645 43 555 I40 I4 690 82 92 44 604 77 141 13 149 93 466 45 I42 90 94 46 I I951 470 143 57 I O 424 47 95 427 Table X. Number of Components with a Given Number of Cycles cycles compns cycles compns cycles compns 0 4 931 825 21 423 42 1 0 785 193 22 112 43 9 874 538 23 99 44 0 3 0 667 176 24 92 45 4 426 887 25 71 46 0 26 51 47 5 0 184 585 6 80 28 1 4 27 50 48 7 28 33 49 29 200 0 8 1 5 145 0 29 47 50 9 7331 30 52 51 2 IO 5 320 31 12 52 0 I1 2 567 32 17 53 2 I2 2 323 0 33 9 54 13 I022 0 I 55 34 14 634 35 6 56 0 605 17 15 36 14 57 16 413 2 I 58 37 244 17 38 127 59 3 18 39 52 60 205 3 19 1756 0 40 46 61 908 20 0 41 4 62

-

PARITY PHENOMENA The distributions of each of the 103 elements were calculated. It was observed that most of them decrease rapidly and

PETITJEAN AND DUBOIS bonds I44 I45 I46 I47 148 149 I50 151

152 153 154 I55 156 I57 158 159 I60 161 I62 I63 I64 165 166 167 I68 I69 170 171 I72 173 174 I75 I76 177 178 179 180 181

I82 183 I84 185 186 I87 188

I89 190 191

compns 62 54 58 59 62 59 64 47 71 56 47 45 82 36 58 44 74 42 48 36 40 27 35 34 52 24 35 31 48 30 48 31 34 31 35 19 46 28 30 I8 36 26 19 44 30 22 22 20

bonds 192 I93 194 195 I96 I97 I98 199 200 20 1 202 203 204 205 206 207 208 209 210 21 I 212 213 214 215 216 217 218 219 220 22 1 222 223 224 225 226 227 228 229 230 23 1 232 233 234 235 236 237 238 239

compns 28 30 36 22 28 18

25 22 31 27 22 27 25 26 35 21 46 15 23 14 22 25 14 28 63 33 22 21 26 12 26 30 18 19 20 16 23 9 26 14 25 21 IO 9 17 16 23 6

bonds 240 24 1 242 243 244 245 246 247 248 249 250 25 1 252 253 254 255 256 257 258 259 260 26 1 262 263 264 265 266 267 268 269 270 27 1 272 273 274 275 276 277 278 279 280 28 1 282 288

compns 16 17 32 21 16 21 38 24 23 15 13 18 18 19 10 21 7 6 11

12 19 IO 6 4 4 7 5 9 4 5 2 2 5 5 3 11

3 6 8 3 3 1

6 1

Table XI. Number of Components with a Given Minimum Nodal Connection Degree (Range 0-5) and a Maximum Nodal Connection Degree (Range 0-1 4) 0 I 2 3 4 5 total 0 403027 403 027 I 9 199 9 I99 2 38419 11248 49 667 3 2047980 109229 38 2 157 247 4 I354393 25338 5 15 1379751 5 8258 536 28 124 31 8 977 6 7137 397 1 14 6 7555 0 0 I12 7 95 1 7 0 8 39 1 127 I 0 0 519 9 I25 3 1 1 0 0 157 IO 2879 258 19 4 0 3160 II 33 7 3 2 0 45 12 78 1 0 1 0 0 89 5 0 0 0 0 5 13 0 0 0 0 4 14 4

total

403027

3468996

147198 97

159 37

4019514

monotonically. Some exceptions are boron with local maxima of 3, 6, and 12; oxygen, with a local maximum of 2; and fluorine, with local maxima of 3, 6, and 12; but the major exceptions, as mentioned previ~usly,~ are hydrogen and carbon. The frequency curve can be divided into two parts, one related

J . Chem. In& Comput. Sci., Vol. 30, No. 3, 1990 339

TOPOLOGICAL STATISTICS Table XII. Number of Components with a Given Radius R compns R compns R compns

0 1

2 3 4 5 6 7 8 9

IO 11

12 13 14 15 16 17 18 19 20 21 22 23 24

403027 99542 101434 403971 671 409 737589 604704 377290 227990 144267 82197 48431 31285 21 507 14520 10997 8265 5821 5799 3887 3826 2314 1788 1211 896

25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

745 523 447 513 315 362 263 277 20 1 230 161 1 I5 187 205 130 136 57 75 59 78 63 157 51 37 48

50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74

31 9 9 2 11

IO 4 1

5 3 4 1

0 6 2 2 0 3 3 1 1 1

0 3 0

Table XIII. Number of Comwnents with a Given Diameter D compns D compns D compns 403027 50 329 100 0 21 51 269 1 9 483 101 1 2 92319 102 6 52 245 3 48 924 103 53 246 5 4 92432 104 54 173 1 168 808 28 1 105 55 5 2 265915 6 198 106 56 0 7 280 176 157 107 57 0 346 557 145 108 58 11 8 9 359 507 190 109 59 4 373 907 165 60 IO 1 IO 4 316658 160 61 II 111 0 276 997 62 12 92 112 0 204 392 13 104 113 0 63 169 564 14 174 64 1 I4 1 15 128014 102 65 1 I5 0 16 97 66 98 343 0 116 17 67 138 117 0 81 188 18 89 68 118 60 101 1 19 69 46 499 1 I9 1 60 20 70 96 35 240 120 1 21 71 37 26 878 121 0 21 044 22 72 68 122 1 23 73 108 I7 062 123 0 74 24 77 124 12 733 0 75 25 I52 12627 1 125 76 26 49 8 658 126 5 77 21 127 8 146 82 2 28 78 46 128 5 895 0 29 79 73 I29 6113 1 30 49 80 130 4515 1 81 31 22 131 4 591 0 32 82 32 132 3 357 0 33 3 087 83 28 133 0 84 34 45 134 2 375 1 35 2916 85 28 135 1 86 36 28 136 2 604 0 87 37 137 2 142 32 0 88 38 40 138 I642 I 89 39 2071 1 20 I39 90 40 36 140 1611 0 91 41 1212 126 141 0 92 42 1052 26 142 1 93 43 I003 11 143 0 94 44 144 745 37 0 45 637 95 145 0 14 96 46 I46 572 20 2 47 97 444 147 0 20 98 48 397 148 0 26 49 99 149 9 406 0

Table XIV. Number of Components with a Given Number of Centroids ~~

centroids I 2 3 4 5 6 7 8 9 10 II 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

compns 1 910 024 I 370 650 373 873 210908 51 341 19556 5 203 4 204 2 473 2 063 1195 1466 653 1020 55 1 692 257 512 I44 412 I59 272 98 321 76 178 74 140 48 156 21 I43 26 50 17 104 6 31 7 51 2 42 3 36

centroids 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 19 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100

11

18 3 34 8 19

compns 2 12 2

IO 1

12 2 4 0

compns

101 102 103 104 105 106 107 108 109

0 1 0 2 0

11

I IO

0 0 5 8 2 2 0 4 2 4 0 13 0

111

1

0 1 1

2 0 6 8 1 1

4 1 1

0 4 2 3 0 5 0 1

0 3 0 3 0 2

~

centroids

112 113 1 I4 1 I5 1 I6 1 I7 118 1 I9 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150

1

0 2 0 1 0 1 1 0

0 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0

to even values and the other to odd values (see Figures 2 and 3). The two frequency subcurves have similar shapes, with relative weights We W, = 100%: Hydrogens: We - W, = 17.57% (26667 compounds with no H are not included) Carbons: We - W, = 6.04%(37 403 compounds with no C are not included) A possible explanation of this bias toward even values can be found in graph theory. The number of odd-connected nodes in a graph is always evenss Since the connection degrees of the nodes are the valencies of the elements (a double bond is represented with two edges, so very few carbons will not be tetraconnected), then the set of odd-connected atoms will contain a large number (70908 564) of hydrogen atoms, most of which are monoconnected. At a lower frequency 10568 323 nitrogen atoms, most triconnected, and 1 1 10 863 chlorine atoms, most monoconnected, can be found. If monovalent hydrogen atoms were the only monoconnected atoms, one should observe 100%of even-hydrogen-containing compounds, but since monovalent hydrogen is merely the most abundant odd-connected element, one observes only 58.8% of even-hydrogen-containing compounds.

+

PETITJEAN AND Dumrs

340 J . Chem. If. Comput. Sci., Vol. 30, No. 3, 1990

Table XV. Number of Components with a Given Number of Extrema1 Atoms ext atoms 1 2 3 4 5 6 7 8 9

IO I1 I2 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

compns ext atoms compns ext atoms compns 0 2 101 403 027 51 I 8 I02 I 204 847 52 0 I03 2 53 1035 109 I 104 809 145 6 54 105 1 0 266 572 55 0 106 11 167416 56 0 107 0 57 65 846 I 108 3 58 33 472 0 109 59 0 13 352 8 1 IO I 60 5 784 61 0 Ill 0 1328 112 I 62 0 6 145 1 I3 0 63 3 604 64 I I4 0 6 1642 115 2 0 65 702 1 1 I6 0 66 645 67 0 117 0 I70 3 118 1 I003 68 0 I I9 0 69 I IO 4 70 120 1 34 I 0 121 71 0 27 1 9 122 72 0 144 I23 0 0 73 43 124 74 0 0 322 I25 41 0 0 75 I26 99 I 0 76 I27 71 77 1 0 0 I28 3 78 493 129 0 79 0 35 I30 I30 3 0 80 131 12 81 8 0 82 0 132 0 Ill 83 1 0 I33 24 84 134 2 1 36 I 85 135 0 16 136 1 I 86 96 87 I37 0 0 5 4 88 138 0 20 89 139 I 0 11 90 140 1 0 28 91 141 0 0 2 142 0 0 92 42 0 143 93 0 0 144 94 I 0 28 145 0 0 95 11 I46 2 0 96 12 147 97 I 0 0 148 0 0 98 28 149 99 0 0 8 0 6 100 0 I50

A similar explanation may hold for the carbon distribution. If the connection degrees of the nodes are the numbers of neighbors of the elements (double bonds are represented here with only one edge, so carbons may be either even- or oddconnected), then the set of odd-connected atoms contains a large number of triconnected carbons-57 528 23 1 carbons in the file, 57% of which are spZa6This is supported by the high number of benzenoid compounds; the benzene ring is by far the most common ring system in the database, about 20 times more common than the next most abundant ring (pyridine).2 Monoconnected or pentaconnected carbons are very rare. There is also a large set of monoconnected hydrogens and, at a much lower occurrence frequency, odd-connected oxygens (5 820 786 in the database, some of them, such as alcohols and other hydroxy compounds, being even-connected). There are also odd-connected nitrogens and so on. If sp2 (triconnected) carbons and monovalent hydrogens were the only odd-connected elements, one would observe 100%of even- (carbon and hydrogen) containing compounds and thus 100% of evencarbon-containing compounds because, as shown previously, the number of hydrogens would also be even. Since sp2 (triconnected) carbons and monovalent hydrogens are the most

CH~-C-O-CH~ P

3

~CH~OCH~+H~-O-C-CH~

n

n

RN=1319-63-

7

RN- 65000-5L-6 tBu

Figure 5. Some topologically extrema1 compounds in the file. Polyethers R N = 1319-63-7 and RN = 42892-32-0: diameter D = 146 and radius R = 73. Polyimine RN = 65000-54-6: R = 73, with 140 centroids. When focused on the methylene group, the deepest concentric layer of the silane R N = 30859-25-7 has 78 atoms.

abundant odd-connected elements, one observes only 53.0% of even-carbon-containing compounds. The parity phenomenon is also observed in connection with the atom distribution, i.e., the number of compounds with a given number of atoms, excluding hydrogen. The difference We- W,= 2.0% is weak and may be related to the parity of the carbon distribution.

DISTRIBUTIONS OF ATOMS The second type of observation to be investigated is the set of all atoms. It may be recalled that hydrogens (except for the 7957 special hydrogen atoms mentioned previously) were not recorded and that multiple bonds were recorded as simple edges. The following statistics refer to the 77 915 142 atoms:

connection degree (no. of neighbors) "unusual" valency algebraic charge isoptope excentricity

mean

SD

local maxima

local minima

2.109

0.793

2;lO

7

0.094 0.172 0.016 10.526

0.994 2.526 1.506 6.829

5

26

4-1

14 0;8

1

J . Chem. In& Comput. Sci., Vol. 30, No. 3, 1990 341

TOPOLOGICAL STATISTICS Table XVI. Number of Paths with a length paths length 77915 I42 0 49 50 1 164 357 002 51 2 229 244 876 52 3 237 692 512 4 53 221 359 272 5 54 202 636 784 6 17993471 2 55 7 56 152 706 078 57 8 127 099 050 9 58 102337698 59 IO 82 488 9 IO 11 60 66382512 61 12 53 557 820 62 13 43 501 662 14 63 36 087 700 64 29 67 1 486 15 25 029 242 16 65 21 479812 66 17 18251 854 67 18 15719910 19 68 69 13 859 090 20 21 70 12031 596 22 71 IO584 384 9451 346 23 72 8 25 1 666 24 73 74 7 330 208 25 26 75 6 692 854 5 926 080 27 76 77 5 320062 28 78 4 847 706 29 79 4 314768 30 31 80 3 890 326 81 32 3 634 336 3 274 204 33 82 3 002 064 34 83 84 2787416 35 2 498 736 36 85 86 2 278 066 37 87 2 I68938 38 39 88 1 956 388 89 40 1786716 41 90 1 675 338 91 42 1521 776 92 43 1405 354 44 1 362 388 93 94 45 1243010 46 95 1 160560 47 96 1 099 932 48 97 1 003 772 ~

Given Length paths length 98 932612 99 904 376 100 822 726 101 764 290 102 726 886 103 666 188 104 61 1690 105 583 028 I06 525 988 107 482 148 108 452 832 109 406912 1 IO 371 236 353 212 111 317674 112 292 086 1 I3 276 096 1 I4 247612 115 221 624 1 I6 208 694 1 I7 187 798 118 176 378 1 I9 161 236 120 144 684 121 128 382 122 119352 123 IO4 I68 124 125 95 674 87 854 126 127 78 998 128 71 760 129 68 252 130 60 444 131 55 406 132 50 674 45 004 133 134 39 322 135 35 996 29 642 136 137 25 780 20 948 138 139 17510 14002 140 12 386 141 142 10258 143 9 I38 7 750 144 7 022 145 5 848 146

paths 5 532 4 542 4 428 3 692 3 426 2 866 2 758 2 358 2 I84 1840 1866 1548 1560 1332 1 304 1146 1138 1008 972 872 840 726 694 614 578 516 456 392 348 306 258 230 204 180 166 152 134 120 1 IO 94 84 74 80 60 64 56 64 24 4

It should be noted that 0.48% of the atoms bear a charge. Of these charges, 57.7% are positive and localized (55% are + l ) ; 4.9% are also positive, but delocalized; 34.7% are negative and localized (34.5% are -1); while 2.7% are negative and delocalized. The complete distributions are given for the number of atoms with a given connection degree in Table V, and with a given excentricity in Table VI. It should also be noted that the values of the excentricities do not depend on bond multiplicity, but would be increased by one unit for most organic compounds if hydrogens were recorded. The excentricity of an atom x varies from R (centroid excentricity) to D (extremal atom excentricity) and is also the maximal number of concentric layers focused around X.

DISTRIBUTIONS IN COMPONENTS The component is the most useful topological unit defined in a large set of graphs because there is no connection between two atoms which belong to different components. The following statistics for components are derived from a database of 4 0 19 5 14 compounds.

Table XVII. Number of Layers with depth lavers depth 0 77915 I42 49 1 77 512 115 50 77 402 746 2 51 76 934 329 3 52 75801 010 4 53 72 836 774 5 54 67 500 804 6 55 7 56 60 065 699 5 I 652 920 8 57 43 018 656 9 58 59 IO 35 006 346 II 28 186 030 60 61 22 684 477 12 62 18 333 570 13 63 14 I4 935 003 12 258 508 64 15 65 IO 202 699 16 66 8 573 066 17 67 7 264 372 18 68 19 6 21 8 908 69 5 369 355 20 70 4 675 034 21 71 4 097 805 22 72 3605851 23 73 3 190623 24 74 2 835 675 25 75 2 527 765 26 76 2 264 006 27 77 2 034 741 28 78 1832994 29 79 30 1 654 104 80 31 1 501 481 81 32 1365 212 82 33 1 246 347 83 34 1 142 574 84 I 047 084 35 36 85 959 521 37 86 882 049 38 87 812927 39 88 749 753 40 89 694 949 41 643 46 1 90 42 91 596 632 43 92 556771 44 93 520 540 94 45 487 11 I 46 457 686 95 47 428 454 96 97 48 400482

atoms (without H) hydrogens (mol formula) carbons bonds cycles (see also ref 2) min connection degree max connection degree radius of components diameter of components centroids extremal nodes

a Given Depth layers depth 376 31 1 98 99 352486 330 354 100 310629 101 290514 102 270 08 1 103 104 252 766 105 235 646 106 219093 107 203 966 108 188 505 109 174 350 161 542 I IO 1 50 006 111 139 058 112 129221 113 1 I4 119681 1 I5 110381 102 229 1 I6 94 09 1 1 I7 86 393 1 I8 1 I9 80 303 120 73 823 121 67 822 122 61 837 123 55 925 124 50 67 1 125 46 620 126 42 703 127 38 942 128 35 706 129 32 828 130 30 095 131 27410 132 24610 133 22 269 134 I9 805 135 17610 136 15513 137 I3 430 138 11 500 139 9 784 140 8311 141 6 993 142 6 I47 143 5418 144 471 1 I45 4 I84 146 3 699

mean

SD

19.384 17.640 14.312 20.445 2.061 0.937 3.042 5.090 9.634 1.825 3.144

13.343 14.408 10.402 14.670 1.899 0.365 1.157 3.193 6.258 1.410 1.743

layers 3 302 2 906 2659 2 374 2 I96 2 000 1838 1679 1537 1420 1321 1183 1103 1019 970 912 864 805 753 699 650 594 549 50 1 457 407 367 326 290 256 224 202 180 160 146 132 1 I6 102 90 78 66 56 48 42 34 28 22 16 4

local maxima

local minima

1;5;16 0;1;3;14; I6 10;12;14 0;4;16 1;3 1;4 3;lO

2;6;17 2;5;15 11;13 5 2 3 7

0;5

1 1;3

0;2;10 1

2

Complete distributions are given for atoms in Table VII, for carbons in Table VIII, for bonds in Table IX, for cycles in Table X, for minimum and maximum connection degrees (bivariate) in Table XI, for radii of components in Table XII, for their diameters in Table XIII, for multiplicities of centroids in Table XIV, and for multiplicities of extremal atoms in Table

xv.

The distribution of hydrogen atoms in components shows the same parity phenomenon as it did in compounds, except for components containing two hydrogen atoms, where a local

342 J . Chem. Inf. Comput. Sci., Vol. 30, No. 3, 1990 Table XVIII. Number of Layers with a Given Number of Atoms atoms I 2 3

4 5 6 7

8 9 10 II 12 13 14 15 16 17 18 19 20 21 22 23 24

layers 274232 1 I I 265 064 075 175572 167 96 901 365 45 I78 432 23 444 257 9423 141 4 358 I66 1 866 522 910094 397 662 312082 102025 101 255 56 960 31 880 I6 824 28 884 7 251 7 253 9610 2 177 670 5 209

atoms 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 45 54 56 60 78

layers 588 I340 818 2 452 118 435 219 459 80 12 38 117 13 4 295 26 210 28 1 8 222 6 2 I

minimum is observed, and components with three hydrogen atoms, where a local maximum is noted. This may be a consequence of the existence of numerous components such as HCI or OH-, which have only one hydrogen, or those such as NH, or H,P04, which have three hydrogens. If components with 0, 1, 2, or 3 hydrogens are removed from the analysis, the relative difference between even and odd distributions is We - W, = 17.78%,which is very close to the 17.57% observed for compounds. The distribution of carbon in components, shown in Figure 4, shows the same parity phenomenon as that in compounds. In components, the relative difference between even and odd distributions is We- W, = 7.99% (483 571 components with no carbon were not included). This difference is more than the 6.04% observed for the distribution in compounds. A similar explanation is thought to apply, but the phenomenon may be less pronounced in multicomponent compounds. If one considers an infinite population of components and makes the following assumptions: (a) one even component has a constant appearance probability p, and one odd component has a constant appearance probability q = 1 - p (b) k-multicomponents are built at random with k independent components then for each k-multicomponent there is a probability p ( k ) that it will be even and a probability of q ( k ) = 1 - p ( k ) that it will be odd, with p(1) = p and q(1) = q and Since (p - q ) C I , then k > 1 =+ [p(k) - q ( k ) ] < @ - q). This means that multicomponent compounds will have a parity difference that is lower than that for monocomponent compounds. Neglecting the set of compounds with components that are devoid of carbon and using p - q = 7.99% in conjunction with the data in Table 11, these assumptions lead to a bias WeW, = 6.78’76, which corresponds to 232 229.8 compounds. The true percentage is 6.04%, corresponding to 204 655 compounds, and so the observed parity is less than the calculated parity. This residual discrepancy is due either to the presence in the Table I1 data of compounds with carbon-free components or to the inadequacy of the assumptions that have been made. The repartitions of each of the other elements in components have been made and were found to be close to the corresponding distributions in compounds.

PETITJEAN

AND

DUBOIS

The atom distribution, Le., the number of components with a given number of atoms, excluding hydrogens, has a weak bias We- W, = 2.8% (403 027 monoatomic compounds being excluded), and this may be explained as the atom distribution concerning the compounds. The number of components with one atom (0 bonds, 0 cycles, R = D = 0, only 1 extremal node) is about 10%of the total number of components. Some 47% of these monoatomic components are either CI- or HCI. The extreme value of the diameter D = 146 (Table XIII) is given by two polyethers (RN 13 19-63-7 and 42892-32-0) whose structures are shown in Figure 5. The extreme value of the radius R = 73 (Table XII) is also due to these two compounds, together with the polyimine (RN = 65000-54-6), which is also shown in Figure 5 . The extreme value of the number of centroids (Table XIV) is due to this same polyimine, which has 140 centroids. The extreme value of the number of extremal atoms (Table XV) arises from the macrocyclic compound (CH2),36(RN = 63217-83-4), which has 136 extremal atoms and the same number of centroids. DISTRIBUTIONS OF PATHS AND CONCENTRIC LAYERS Although many processing algorithms that analyze the chemical environment of a focus need to be statistically optimized, there is a paucity of data pertaining to environments. The following statistics refer to the 2 225 906 690 shortest paths from one atom (x) to every other atom (j). There are n*n paths in a component with n atoms and the length of a specific path is the distance d(x,y).

path lengths

mean

SD

local maxima

7.489

7.799

3

local minima

The complete distribution is given in Table XVI. The number of paths with zero length is the same as the number of atoms in the whole file. All other path lengths are even values because both distances d(x,y) and d(j,x) are counted. The following statistics concern the 898 037 816 concentric layers. Each atom x in each component can be viewed as the origin of a set of concentric layers, each y atom in a layer being at a given distance d(x,y) from the atom at the origin.

distances from origin (depth of layers) atoms in layer

mean

SD

local maxima

7.286

7.956

1

2.479

1.526

1

local minima

The two complete distributions are given in Tables XVII and XVIII. Again, the zero distance corresponds to the number of atoms in the whole file. The number of concentric layers is about half the number of paths because all the distances from an origin to all the atoms that constitute a layer are counted as one layer depth. The number of layers with increasing numbers of atoms does not decrease monotonically after I7 atoms. The extreme value of 78 atoms is due to the disilylmethane ( R N = 30859-25-7) shown in Figure 5. MULTIVARIATE DISTRIBUTIONS Many bivariate or trivariate statistics are not reported here because the appropriate graphical definition is too large to be conveniently represented. Among the bivariate distributions in components, those for the atoms-bonds, atoms-cycles, bonds-cycles, radiudiameters, and centroids-extremal nodes were computed. The first three of the bivariate distributions are the three projections of the atoms-bonds-cycles distribu-

TOPOLOGICAL STATISTICS tion, which lies in a plane defined by atoms - bonds + cycles = no. of components of the graph = 1 None of these distributions is readily printed. To do so graphically would require a diagram 70 X 70 cm, and an alphanumeric description would require hundreds of valuevalue-frequency 3-tuples. The bivariate distribution of the number of layers with a given number of atoms at a given depth has also been computed and presents the same display problems. All of these bivariate distributions show different local maxima and minima, which define clusters of chemical structures and largely nonexistent chemical entities. It is hoped that multivariate exploration will reveal much more than the univariate or bivariate distributions. For example, the atombond-cycle-radius-diameter-centroid-extrema1 nodes distribution would be expected to show many local extrema, which could be interpreted as families of chemical structures in a statistical classification of the components. The problem is to build an algorithm in a multidimensional space from which the number of groups and their shapes can be computed. This problem is far from a solution.’

J . Chem. Inf. Comput. Sci., Vol. 30, No. 3, 1990 343 for example, substructure searching strategy or organized computerized chemical libraries. Searching and organized substructure data spaces could be achieved advantageously by consideration of the statistical weight of the data and avoiding use of models with poorly defined working spaces. Such working spaces are usually defined pragmatically or by trial and error for specific applications. The predictive space of many quantitative structure-activity relationship endeavors, for example, is usually vague outside the original training set. The statistics reported here also represent a first step toward a better understanding of the composition of the complete corpus of organic chemistry. They show the salient features that are encountered in derivation of an organic chemical classification based upon statistical and topological methods, rather than nomenclatural concepts which stem from the chemist’s perception of small sets of data (e.g., the concept of functional groups). Such a classification could lead to new tools for computer-managed structure elucidation and computer-aided property correlation, because the topological approach to molecular structure is more closely bound to chemical and spectroscopic properties than are classical nomenclature systems.

CONCLUSION Interpretations have been offered for some of the distributions that have been determined experimentally in this paper, but many of them show unexpected local maxima or minima, or exhibit unusual parity phenomena. It seems clear that the composition of the database cannot be understood simply. Some explanations of the parity phenomena have been advanced, but a full interpretation of the various local extrema remains rather difficult. The formulation of many algorithms dealing with large data sets can be optimized by means of statistical considerations,

REFERENCES AND NOTES Stobaugh, R. E. J . Chem. In/. Compur. Sci. 1980, 20, 76. Stobaugh, R. E. J . Chem. In/. Compur. Sci. 1988, 28, 180. Petitjean M.; Dubois, J. E. Collecr. Czech. Chem. Commun. 1990,55, 1404.

Dubois, J. E.; Panaye, A.; Attias, R. J . Chem. In/. Compur. Sci. 1987, 27, 74.

Berge, C. Graphes et Hypergraphes. Dunod Universitt, Paris, 1973. ISBN 2-04-009755-4. Petitjean, M. Unpublished results. Everitt, B. S.; Hand, D. J. Finite Mixture Distributions; Monographs on Applied Probability and Statistics. Chapman & Hail: London, 1981. ISBN 0-412-22420-8.