An Additive Definition of Molecular Complexity - ACS Publications

Feb 9, 2016 - Department of Chemistry and Konstanz Research School Chemical Biology, University of Konstanz, Universitätsstrasse 10, 78457. Konstanz ...
0 downloads 0 Views 1MB Size
Article pubs.acs.org/jcim

An Additive Definition of Molecular Complexity Thomas Böttcher* Department of Chemistry and Konstanz Research School Chemical Biology, University of Konstanz, Universitätsstrasse 10, 78457 Konstanz, Germany S Supporting Information *

ABSTRACT: A framework for molecular complexity is established that is based on information theory and consistent with chemical knowledge. The resulting complexity index Cm is derived from abstracting the information content of a molecule by the degrees of freedom in the microenvironments on a peratom basis, allowing the molecular complexity to be calculated in a simple and additive way. This index allows the complexity of any molecule to be universally assessed and is sensitive to stereochemistry, heteroatoms, and symmetry. The performance of this complexity index is evaluated and compared against the current state of the art. Its additive character gives consistent values also for very large molecules and supports direct comparisons of chemical reactions. Finally, this approach may provide a useful tool for medicinal chemistry in drug design and lead selection, as demonstrated by correlating molecular complexities of antibiotics with compound-specific parameters.



INTRODUCTION Complexity is an important yet controversial concept in organic chemistry with implications ranging from drug development and synthesis planning to the basic understanding of life.1,2 There is growing interest in measuring and correlating complexity on the molecular scale, and the term complexity is commonly used to describe chemical structures of natural products and synthetic organic compounds. In many cases, however, it remains elusive what complexity of a structure actually means, and the use of the term is based on subjective perception rather than on objective criteria. Peer-based assessment of complexity by letting experienced chemists rank the complexities of compounds has been used to assess the synthetic complexity, accessibility, and drug-likeness of different chemical compounds.3−5 However, the individual chemists’ perception of the complexity of a given chemical structure was subject to extensive variation, demonstrating that these intuition-based methods, although useful to assess extrinsic complexity, are not suitable to measure the intrinsic structural complexity of compounds.4,6,7 Here extrinsic complexity is dependent on external conditions such as the ease of synthesis, while intrinsic complexity is determined only by the chemical structure. Initial mathematical approaches investigating molecular topology were introduced by Randić8 and Bonchev.9 The first general index of molecular complexity was developed by the pioneering work of Bertz, who applied a combination of graph theory and information theory to determine molecular topology.10 Here a molecule is represented by a hydrogensuppressed skeletal molecular graph composed of pairs of adjacent lines corresponding to the number of connections, η.10 After symmetry corrections, the term C(η) for skeletal complexity is combined with a term C(ε) for elemental diversity in a molecule to give Bertz’s index C(η,ε). A simplified © 2016 American Chemical Society

form of Bertz’s formula that can be applied more easily for manual and computational calculations was given by Hendrickson, Huang, and Toczko.11 Bertz’s complexity index C(η,ε) is still the maybe most popular measure of molecular complexity today. It has resulted in numerous interesting applications11−13 and is listed for every chemical structure in the public repository PubChem, a widely used chemical substance and structure database.14 Recently, molecular complexity was used to correlate odorants with the number of olfactory notes of a compound.15 Molecular complexity also has important implications for organic synthesis planning, in silico drug design, and pharmaceutical development including QSPR and QSAR approaches.16−19 Thus, a large number of different approaches have been dedicated to the development of mathematical concepts and indices to quantify molecular complexity. The most of these indices are based on graph theory, whereby chemical structures are transformed into skeletal molecular graphs and different techniques are applied to count their subgraphs. The most important of these are subgraph-based indices developed by Bonchev and later also by Bertz, such as the total number of subgraphs (NT) and the number of different kinds of subgraphs (NS).20−22 Bonchev also introduced the concept of overall connectivity TC and TC1 as topological complexity indices based on connected subgraphs.23,24 Other measures of complexity include the total walk count (twc) proposed by Rücker.25,26 A plentitude of other topological indices have been proposed, including for example the approaches of Balaban, Bonchev, Lu, Randić, and Ren.9,27−31 However, graph-theory-based indices mainly address skeletal complexity and have been criticized for facing Received: December 7, 2015 Published: February 9, 2016 462

DOI: 10.1021/acs.jcim.5b00723 J. Chem. Inf. Model. 2016, 56, 462−470

Article

Journal of Chemical Information and Modeling several limitations.16,28 As will be shown in the present work, graph theory may overall not be an ideal measure of molecular complexity, although some recent approaches trying to account for symmetry have been reported.32 A second type of index that can be distinguished from those based on graph-theoretical approaches are substructure-based indices.7 The latter, such as those of Whitlock33 and Barone and Chanon,34 have been developed on an entirely empirical basis to assess molecular complexity by counting selected features such as the numbers of heteroatoms, rings, chiral centers, and double bonds and multiplying them by arbitrary yet empirically optimized weighing factors. A similar metric includes the electronegativities of the atoms and the numbers of rings and chiral centers.35 In order to assess complexity in the context of ligand recognition, Hann developed a model combining connectivity and substructure descriptors.7,19 This vast abundance of different approaches and indices for molecular complexity reflects the fact that no individual index has yet been completely satisfactory and each has its own limitations.21,34 Frequently criticized shortcomings are the failure to address chirality in graph-theoretical approaches and missing sensitivity to skeletal structure, branching, and symmetry of other indices.16,21,28 Inspired by the systematic approach of Bertz’s work, I will here propose a framework for molecular complexity relying on both mathematical rigor and chemically consistent inherent logics. The intrinsic complexity of a molecule is hereby calculated using the principles of information theory by sampling the information content in the chemical microenvironment of every individual atom of a molecule. This approach results in an entirely additive index that avoids the biases resulting from graph-theoretical approaches and includes considerations of symmetry and stereochemistry. In the following, the performance of the newly established index will be compared with those of other indices in order to demonstrate its universal applicability and conceptive power. It will be shown that the new index with its additive character allows simple manual calculations and provides an elegant solution for molecular complexity.

to an existing sequence of binary code does not change the information content of the former sequence, although the overall information content increases. Thus, molecular complexity should also be additive, which can be accomplished only by assessing the complexity on a per-atom basis. Accordingly, every atom in a molecule contributes to the total complexity of the molecule by its information content, which is a function of site-specific descriptors of its chemical microenvironment. Translating the principles of information theory into a metric for molecular complexity leads to discrete variables on a per-atom basis that reflect the degrees of freedom or number of possible states at every atom position in a molecule. These variables can be deduced from basic chemical considerations and are selected in the form of a necessary and sufficient set of parameters to fully describe the microenvironment and bonding situation at every atom in a molecule. In general, the information content increases with the number of possible states, while the trivial solution of a variable does not contribute information. For example, hydrogen is the most abundant element in the universe, and any free valence of other elements is by default saturated by hydrogen. Consequently, an atom position in a molecule occupied by hydrogen is the trivial solution and is not counted. One variable is needed to identify the nature of the element by its valence shell, and four variables are required as descriptors of the bonding environment: the number of bonds, the number of chemically different bonds, the element diversity, and the stereochemistry. The number of valence electrons Vi of a neutral element at position i is used as a basic descriptor for an element that takes into account the number of theoretical possibilities for bond connections an atom can have. Applying log2(Vi) increases the contributions to complexity for elements from left to the right in the periodic table. The variable for bond connectivity is given by the total number of bonds bi to atoms with Vibi > 1 (nonhydrogen atoms) at the ith atom of a molecule. As hydrogen typically has the possibility of contributing only one electron to a bond (vide infra), its information content calculated as log2(Vibi) becomes zero, which has the practical side effect that hydrogen atoms can be omitted from complexity calculations, as has been a general practice in other approaches.8,10 While the parameter bi characterizes the general bonding situation at position i, it does not take into account whether these neighbors are chemically equivalent. Thus, the parameter di is introduced to characterize the number of chemically nonequivalent bonds to atoms with Vibi > 1 at the ith position. To include the presence of heteroatoms, the parameter ei gives the number of different non-hydrogen elements or isotopes involved in the bonding situation, which includes atom i and its direct neighbors. For example, in ethane ei = 1 for each atom, while for ethanol ei = 2 for the oxygen atom and its neighboring methylene group, where two elements are involved in bond formation. Finally, the number of states at an atom position i increases with the number of isomeric possibilities si at this position. If there is only one possible configuration at this atom position, then si = 1. For chiral centers, axially chiral centers (atropisomerism), and geometric isomerism at double bonds, si = 2, corresponding to two possible configurations. For geometric isomerism, the center with the lower partial complexity value is chosen as the isomeric center. To apply more weight to ei and si, these variables are factorized as coefficients in front of the logarithm. Because of symmetry considerations of individual atoms in a symmetry center, di also



RESULTS AND DISCUSSION Complexity in general is context-dependent, and for instance, the synthetic complexity of a compound may change over time with ongoing progress in the field of organic chemistry. In contrast to such extrinsic measures of complexity, molecular complexity is an intrinsic measure that will not change over time or with advancing knowledge.26 To this aim, molecular complexity will be treated here in an abstracted form as the information content of a molecule. Information content according to Shannon is defined by the entropy H according to eq 1: H = −∑ pi logx pi i

(1)

where pi is the probability of an event i. The base x of the logarithm is arbitrary and typically is chosen to be 2, whereby entropy is calculated in bits. In a situation where every event has equal probability, pi corresponds to 1/N, where N is the number of possible different states (e.g., in binary code, with the two states “1” and “0”, N = 2 and pi = 0.5). With this definition, the information content of a system is additive and can be determined by the sum of the information of its subsystems. In consequence, adding a line of binary code 463

DOI: 10.1021/acs.jcim.5b00723 J. Chem. Inf. Model. 2016, 56, 462−470

Article

Journal of Chemical Information and Modeling

to carboxylate. In the following, carboxylic acids will be treated as symmetric regarding the carboxyl group. Because every atom i with Vibi > 1 (typically all non-hydrogen atoms) contributes to the complexity of a molecule, isotopes are included in the variables di, ei, and si or can be handled by additional symmetry or chirality considerations in analogy to other substituents. Thus, Cm can even be used to address cases of isotope chirality (Figure S4).36 In contrast to other indexes, Cm is universally applicable to covalent molecules of any chemistry including polyphosphates, silicones, silanes, boron−nitrogen compounds, and boranes. In the special case of boranes, even three-center two electron bonds can be considered, with the bridging hydrogens in this case contributing to the molecular complexity because bi = 2 (Figure S3). The complexity scale of Cm is defined in the range [0; +∞]. Complexity as calculated by Cm increases linearly within any homologous series of a compound class also for nonsymmetric compounds such as aliphatic amines. However, complexity should not be simply a function of the molecular mass, a criterion that is fulfilled by Cm, as demonstrated by a plot of the complexity values of 51 cellular metabolites against their molecular masses (Figure S5). Symmetry of molecules is a critical factor in the concept presented here, and the importance of symmetry was also recognized previously in topological descriptions.37,38 The more symmetric a molecule, the lower is its molecular complexity. For example, parasubstituted aromatic compounds with higher symmetry consistently result in lower complexity values than the corresponding ortho- and meta-substituted analogues (Table S1). While most topological indices do not address symmetry sufficiently, the index of augmented valence complexity (AVC or ξ) described by Randić and Plavšić correlates like Cm with a molecule’s symmetry.28,38 Remarkably, augmented valence complexity largely converges with Cm for simple branched and cyclic hydrocarbons (Table S2), demonstrating the value of Cm even for topological analysis. However, as an entirely topological index, augmented valence complexity does not account for heteroatoms or stereochemistry and is thus not generally applicable. In contrast, Cm combines the advantages of topological symmetry with applicability to any compound class. The complexity of NMR spectra may serve to a limited extent as a proxy of molecular complexity. Within a homologous series of compounds such as aliphatic alcohols, the spectral complexity of 1H and 13C NMR spectra rises monotonically and stepwise with each additional methylene group. Asymmetric compounds with many chemically nonequivalent positions always produce more complex spectra than their corresponding more symmetrical isomers. This relationship is reflected by the good correlation of the molecular complexity Cm with the number of nonequivalent carbons in alkane isomers (Table S3). Spin couplings at chiral centers and diastereotopic groups typically generate more complex multiplicities than comparable nonchiral compounds with homotopic groups. As a suitable test of the new complexity index, the four structural isomers of bromobutane were chosen because their qualitative order of complexity can be easily deduced from the numbers of nonequivalent protons in NMR spectroscopy (Figure 2). A comparison of Cm with the most prominent complexity indices demonstrates that only Cm gives the correct order of complexity for the different isomers, where chiral (R)2-bromobutane has the highest complexity and 2-bromo-2methylpropane the lowest complexity. Betz’s index C(η,ε) overemphasizes quaternary carbons, underrepresents symme-

has to be placed as a coefficient, resulting in eq 2 to give the molecular complexity: Cm =

∑ dieisi log 2(Vbi i) i

(2)

Herein the symbol Cm is used to denote molecular complexity, for which the unit mcbit is chosen in order to distinguish a bit of molecular complexity from a bit of information. To account for symmetries of a molecule, the corresponding atom positions of chemically equivalent sets of atoms for each symmetric position j are subtracted (eq 3): 1 Cm = ∑ dieisi log 2(Vb ∑ djejsj log 2(Vjbj) i i) − 2 j i (3) Detailed guidance for the application of this equation is given in the Supporting Information. The factor of 1/2 ensures that the complexities of molecules with more than two equivalent sets of atoms (e.g., cycloalkanes) increase within a homologous series (Figure S1). This measure decreases the complexity of highly symmetric molecules and ensures that the bias of oddnumbered versus even-numbered symmetric compounds is corrected. For example, symmetric hydrocarbon chains with an odd number of backbone atoms experience at the symmetry center a lower value for di compared with even-numbered compounds, which lack an atom position in the symmetry plane. This correction and reduction of complexity is also in line with NMR spectra, where symmetric or chemically equivalent substituents do not contribute additional signals. In the homologous series of alkanes, after symmetry correction Cm increases linearly by 3 mcbit per additional methylene group (Figure S2). Two detailed examples of molecular complexity calculations are given in Figure 1. As the

Figure 1. Examples illustrating how to calculate Cm on a per-atom basis for a molecule. Different atom positions i are labeled by letters (a−f), and for tert-butyl acetate the symmetry correction for the positions f is included. In the equations, “log” stands for log2.

complexity index of eq 3 is additive, substituents and functional groups can be easily summed up as long as altered bonding situations at the connecting atom, heteroatoms, and potentially arising symmetries are taken into account. Resonance structures of aromatic compounds give identical results for Cm and do not have to be treated differently. Special bond situations such electron delocalization in the carboxylate ion that results in a new symmetry plane can be addressed by counting 1.5 bonds and including the oxygen atoms in the symmetry correction (Figure S3). After these corrections, the complexity still increases in the order from alcohol via aldehyde 464

DOI: 10.1021/acs.jcim.5b00723 J. Chem. Inf. Model. 2016, 56, 462−470

Article

Journal of Chemical Information and Modeling

try, and does not include chirality and thus yields the lowest complexity value for 1-bromobutane. The same problem exists for Barone’s index, which includes neither symmetry corrections nor contributions for chirality. Although Whitlock’s index S includes chiral information, it does not employ symmetry corrections and is entirely insensitive to branching and hydrocarbon chain length. The graph-theory-based indices NS and NT indirectly include symmetry but not stereochemistry and display reduced sensitivity to branching, resulting in identical values for 1-bromo-2-methylpropane and 2-bromo-2methylpropane. An overview of molecular complexity values for a larger number of compounds and a comparison of Cm with various other complexity indices are given in Tables S4 and S5, respectively. According to Whitlock’s index, hydrocarbon chains do not contribute to complexity, and thus, formaldehyde (CH2O) and 4-methyl-3-penten-1-ol (C6H12O) give identical values. With this index, the symmetrical molecule acetylene (C2H2) even scores higher in complexity than chiral (S)-3-methyl-1-

Figure 2. Molecular complexity of bromobutane isomers. The order of the isomers is given from the most complex (left) to the least complex (right) according to the spectral complexity of the 1H NMR spectra as a proxy of molecular complexity. The numbers of nonequivalent protons are in line with the relative complexities given by distinct signals and multiplicities of the corresponding 1H NMR spectra. Complexity values are given for Cm and other indices.

Figure 3. Nonlinear behavior of C(η,ε) in comparison to the linear behavior of Cm. (A) Homologous series from ethanol to C20H41OH, where Cm gives a linear increase and C(η,ε), C(η), and NT give nonlinear increases. A linear extrapolation (lin.) from the first two values of C(η,ε) is given for better visualization. (B) Changes in complexity ΔC(η,ε) and ΔCm for the addition of the nth glycine residue to a polyglycine chain, showing the nonlinear behavior of C(η,ε). 465

DOI: 10.1021/acs.jcim.5b00723 J. Chem. Inf. Model. 2016, 56, 462−470

Article

Journal of Chemical Information and Modeling

Figure 4. Comparison of the graph-theory-based index NS with Cm. (A) Linear and nonlinear increases within the homologous series of cumulenes for Cm and NS, respectively. (B) Invariance of ΔCm to the chain length of aliphatic alcohols in the oxidation reaction to give the corresponding aldehydes, in contrast to an increase in ΔNS for the same reaction with increasing hydrocarbon chain length.

octadecanol (C19H41O). Betz’s C(η,ε) and Barone’s empirical index avoid these problems and are qualitatively very similar to each other. They, however, both do not include stereochemical contributions and encounter problems with over proportional contributions of tertiary and quaternary carbons. The Betz index C(η,ε) uses symmetry considerations and includes the term C(ε) accounting for the diversity of elements in a molecule. The lack of both symmetry and diversity contributions is a severe shortcoming of Barone’s index, which gives the same complexity for symmetric butane-1,4diol as for asymmetric 4-amino-1-butanol. In regard to these problems, C(η,ε) is the more accurate index. However, in a homologous series such as linear aliphatic alcohols, C(η,ε) increases nonlinearly because of nonlinearity in the connectivity term C(η) (Figure 3A). For the simple polypeptide polyglycine, calculating ΔC(η,ε) for each added glycine residue to a chain of 1−200 residues demonstrates that linear behavior of C(η,ε) is approached asymptotically only for a very large values of η, corresponding to very large molecules (Figure 3B), where Δ2η becomes the dominant term due to eq 4:

lim [log 2(η + 1) − log 2(η)] = 0

n →∞

(4)

Thus, the formation of a dipeptide by adding one glycine residue to glycine increases the complexity C(η,ε) by 84.1 while the 200th added glycine residue yields an increase of 199.4. This is a general problem of C(η,ε) that makes all of the chemical transformations of a molecule nonadditive and prevents simple modular calculations for larger molecules. In contrast, Cm is linear and additive over its entire scale, allowing even manual calculation of the complexity for very large molecules (Figure 3). The two most important graph-theoretical indices, NT and NS, have found applications in synthesis planning by examining retrosynthetic disconnections and minimizing the complexity of synthetic routes.22,26,39 A variety of additional graph-theorybased indices have been developed to approximate NT and NS, such as TT and PT and TS and PS, respectively. These indices, however, are less robust and not sensitive to many structural features such as branching, double bonds, and heteroatoms, which is why they will not be discussed here any further.12 466

DOI: 10.1021/acs.jcim.5b00723 J. Chem. Inf. Model. 2016, 56, 462−470

Article

Journal of Chemical Information and Modeling While NT was after Cm the most useful index for assessing the complexity of isomers of bromobutane, it behaves nonlinearly within a homologous series of aliphatic compounds. In contrast to C(η,ε), complexity measured by NT increases as a quadratic polynomial function (Figure 3A). This means that the longer a hydrocarbon chain, the larger is the contribution of every additional methylene group added. Especially for larger molecules and for comparison of chemical reactions of different molecules, this property of NT is problematic. Similar behavior has been shown for total walk count (twc), another graphtheory-based index.27 Unlike NT, the index for the number of kinds of subgraphs, NS, is linear for aliphatic compounds and meets the requirement of being additive within some compound classes. However, it is problematic that the presence of multiple double bonds leads to a sharp increase in the number of kinds of subgraphs, causing the complexity of cumulenes to increase within their homologous series as a higher-degree polynomial function (Figure 4A). This problem is solved by Cm, which also increases linearly for cumulenes and any other homologous series. Because of its additive character, the new index Cm also is applicable for calculating the molecular complexity of any compound type, including even polymers and large biomolecules such as proteins and nucleic acids. Furthermore, modular calculations allow one to focus on the complexity of relevant substructures during a chemical transformation, which has possible applications in solid-phase synthesis or enzyme-anchored reaction steps such as the biosynthesis of polyketides. Hereby, Cm may be of great advantage compared with graph-theory-based indices. The reason why graph theory is not ideal for assessing the complexity of molecules is that it treats chemical structures as entirely mathematical constructs without a fundamental chemical basis. Various complexity indices have been applied to investigate convergence in chemical synthesis and to assess the complexity of total synthesis of natural products.33,39,40 A major shortcoming of these complexity indices appears in comparisons of chemical reactions with different molecular structures. The oxidation of an aliphatic alcohol to an aldehyde yields an increasing ΔNS with increasing length of the hydrocarbon chain even though the chemical impact of a functional group does not increase with the molecule’s size (Figure 4B). This problem is solved for Cm, where the oxidation reaction results in a constant ΔCm irrespective of the length of the hydrocarbon chain. In contrast to graph-theory-based indices, Cm has the important advantage that changes in molecular complexity for chemical transformations are invariant to distal structural components that are not involved in the reaction unless they have an impact on symmetry or stereochemistry. Thus, values of ΔCm (calculated as the complexity of products minus the complexity of educts) for different chemical reactions can be directly compared even for reactants that differ significantly in molecular complexity, which does not apply for graph-theory-based indices (Table 1 and Figure S6). For example, a stereoselective aldol reaction with any aldehyde and acetone always results in the same ΔCm for every reaction. The aldol reaction generates stereochemical information, yielding a large ΔCm, in contrast to aldol condensation reactions, where a significant part of the functionality and complexity is subsequently lost during dehydration (Figure S6). For some reactions, ΔCm also may become negative, when the overall complexity is decreased in the reaction (Table 1). The 1,3-dipolar cycloaddition reaction of azides and alkynes generates a very large ΔCm by formation

Table 1. Changes in Molecular Complexity for Selected Chemical Reactions

*

The reactant and product are considered asymmetric.

of a triazole ring. The ability to generate in one reaction a large increase in molecular complexity may contribute to the reasons why aldol reactions and the Huisgen 1,3-dipolar cycloaddition and its application as a click reaction have become very popular. By the use of Cm, changes in complexity can even be generalized for a given reaction type if the chemical environment at the reaction site and potentially occurring symmetries are comparable between the reactions. An overview of ΔCm values for selected reactions is given in Table 1. This kind of generalization of complexity changes for different reactions has not been possible with other indices. In general, molecular complexity on its own does not quantify distinct structural elements such as defined functional groups, charged residues, hydrogen acceptors, or hydrogen donors. However, the molecular complexity per molar mass unit may provide an integrated measure of the density of molecular complexity in a molecule and thus indirectly of functional groups, heteroatoms, unsaturations, and stereochemistry. Indeed, comparing the molecular complexity densities of different compounds resulted in significant differences that were indicative of fatty acids, amino acids, and sugars, but variations also occurred within compound classes, especially for amino acids (Table S6). For example, higher complexity densities were found in sugars compared with amino acids and fatty acids. Thus, considering molecular complexity in combination with other parameters may provide useful information for medicinal chemistry. Central questions in medicinal chemistry are whether a molecule is drug-like and how to select promising leads and develop screening libraries.41 A great deal of effort has been put into developing concepts to address these questions,42 most prominently represented by Lipinski’s rule of five, which has become the standard rule of thumb in drug development.43 However, it is well-known that promising drug candidates may be excluded and that even some approved drugs would be removed by these rules.44,45 Thus, alternative strategies to assess drug-likeness have been proposed.42,45 Instead of defining strict cutoff values for drug development, molecular complexity may help to map a space of drug-like compounds. To test whether a molecular complexity density function can be 467

DOI: 10.1021/acs.jcim.5b00723 J. Chem. Inf. Model. 2016, 56, 462−470

Article

Journal of Chemical Information and Modeling

Figure 5. Molecular complexity plotted against (A) molecular mass and (B) the combined number of hydrogen bridge donor and acceptor sites of a molecule. In (A), the area shaded in gray maps the distribution space of common antibiotics, which largely overlaps with the space occupied by peptides. In (B), antibiotic classes are clustered, as highlighted by different shadings.

useful in medicinal chemistry, Cm values were calculated for approved drugs of different major classes of antibiotics and plotted against the corresponding molar masses (Figure 5A). In the complexity−molar mass space, the hydrocarbon line approximates the lower limit. Interestingly, antibiotics occupied a clearly distinct space, and the majority of drugs were clustered in the range of short peptides (di- to tetrapeptides) and nucleosides, although their structures are not related. This is maybe not too surprising, as the native substrates of the target enzymes of antibiotics typically are peptides and nucleosides. Consequently, antibiotics bind their targets by mimicking substrate binding, which may be reflected in a similar distribution of complexity in relation to molecular mass. Some antibiotics, however, including aminoglycosides, tetracyclines, and lincosamides, crossed the space of nucleosides, resulting in higher complexity densities due to abundant chirality centers and heteroatoms. This approach may help to define upper and lower limits of molecular complexity for drugs

and to assess the drug-likeness of a compound. Hydrogen-bond formation is one of the major factors contributing to selectivity and efficacy of target binding. To map the correlation of hydrogen bonds and molecular complexity, Cm of the compounds was plotted against the combined number of hydrogen-bond donor and acceptor sites, in the following termed hydrogen-bond sites (Figure 5B). Again, a spatial correlation with peptides and nucleosides was observed whereby antibiotics gave on average higher values of Cm for identical numbers of hydrogen-bond sites. The larger molecular complexity per hydrogen-bond site implies that complexity in commercial antibiotics is gained to a larger extent from nonhydrogen-bond sites. This may be a result of the common efforts in drug development to optimize bioavailability by keeping the numbers of hydrogen-bond donor and acceptor sites rather low and balancing solubility against lipophilicity. Mapping the dependence of the molecular complexity on the number of hydrogen-bond sites may help to define a drug-like 468

DOI: 10.1021/acs.jcim.5b00723 J. Chem. Inf. Model. 2016, 56, 462−470

Article

Journal of Chemical Information and Modeling

(4) Li, J.; Eastgate, M. D. Current Complexity: A Tool for Assessing the Complexity of Organic Molecules. Org. Biomol. Chem. 2015, 13, 7164−76. (5) Takaoka, Y.; Endo, Y.; Yamanobe, S.; Kakinuma, H.; Okubo, T.; Shimazaki, Y.; Ota, T.; Sumiya, S.; Yoshikawa, K. Development of a Method for Evaluating Drug-Likeness and Ease of Synthesis Using a Data Set in Which Compounds Are Assigned Scores Based on Chemists’ Intuition. J. Chem. Inf. Model. 2003, 43, 1269−75. (6) Sheridan, R. P.; Zorn, N.; Sherer, E. C.; Campeau, L. C.; Chang, C. Z.; Cumming, J.; Maddess, M. L.; Nantermet, P. G.; Sinz, C. J.; O’Shea, P. D. Modeling a Crowdsourced Definition of Molecular Complexity. J. Chem. Inf. Model. 2014, 54, 1604−16. (7) Selzer, P.; Roth, H. J.; Ertl, P.; Schuffenhauer, A. Complex Molecules: Do They Add Value? Curr. Opin. Chem. Biol. 2005, 9, 310− 6. (8) Randić, M. On Characterization of Molecular Branching. J. Am. Chem. Soc. 1975, 97, 6609−6615. (9) Bonchev, D.; Mekenyan, O.; Trinajstic, N. Topological Characterization of Cyclic Structures. Int. J. Quantum Chem. 1980, 17, 845−893. (10) Bertz, S. H. The First General Index of Molecular Complexity. J. Am. Chem. Soc. 1981, 103, 3599−3601. (11) Hendrickson, J. B.; Huang, P.; Toczko, A. G. Molecular Complexity: A Simplified Formula Adapted to Individual Atoms. J. Chem. Inf. Model. 1987, 27, 63−67. (12) Bertz, S. H. Complexity of Synthetic Reactions. The Use of Complexity Indices to Evaluate Reactions, Transforms and Disconnections. New J. Chem. 2003, 27, 860−869. (13) D'Amboise, M.; Bertrand, M. J. General Index of Molecular Complexity and Chromatographic Retention Data. J. Chromatogr., A 1986, 361, 13−24. (14) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Chen, J.; Fu, G.; Gindulyte, A.; Han, L.; He, J.; He, S.; Shoemaker, B. A.; Wang, J.; Yu, B.; Zhang, J.; Bryant, S. H. Pubchem Substance and Compound Databases. Nucleic Acids Res. 2016, 44, D1202−D1213. (15) Kermen, F.; Chakirian, A.; Sezille, C.; Joussain, P.; Le Goff, G.; Ziessel, A.; Chastrette, M.; Mandairon, N.; Didier, A.; Rouby, C.; Bensafi, M. Molecular Complexity Determines the Number of Olfactory Notes and the Pleasantness of Smells. Sci. Rep. 2011, 1, 206. (16) Chanon, M.; Barone, R.; Baralotto, C.; Julliard, M.; Hendrickson, J. B. Information Theory Description of Synthetic Strategies in the Polyquinane Series. The Holosynthon Concept. Synthesis 1998, 1998, 1559−1583. (17) Nilar, S. H.; Ma, N. L.; Keller, T. H. The Importance of Molecular Complexity in the Design of Screening Libraries. J. Comput.Aided Mol. Des. 2013, 27, 783−92. (18) Leach, A. R.; Hann, M. M. Molecular Complexity and Fragment-Based Drug Discovery: Ten Years On. Curr. Opin. Chem. Biol. 2011, 15, 489−96. (19) Hann, M. M.; Leach, A. R.; Harper, G. Molecular Complexity and Its Impact on the Probability of Finding Leads for Drug Discovery. J. Chem. Inf. Model. 2001, 41, 856−64. (20) Bonchev, D. G. Kolmogorov’s Information, Shannon’s Entropy, and Topological Complexity of Molecules. Bulg. Chem. Commun. 1995, 28, 567−582. (21) Bonchev, D. Novel Indices for the Topological Complexity of Molecules. SAR QSAR in Environ. Res. 1997, 7, 23−43. (22) Bertz, S. H.; Sommer, T. J. Rigorous Mathematical Approaches to Strategic Bonds and Synthetic Analysis Based on Conceptually Simple New Complexity Indices. Chem. Commun. 1997, 2409−2410. (23) Bonchev, D. Overall Connectivity and Molecular Complexity. In Topological Indices and Related Descriptors; Devillers, J., Balaban, A. T., Eds.; Gordon and Breach: Reading, U.K., 1999; pp 361−401. (24) Bonchev, D. Overall Connectivities/Topological Complexities: A New Powerful Tool for Qspr/Qsar. J. Chem. Inf. Model. 2000, 40, 934−41. (25) Rücker, G.; Rücker, C. Substructure, Subgraph, and Walk Counts as Measures of the Complexity of Graphs and Molecules. J. Chem. Inf. Model. 2001, 41, 1457−1462.

space that could aid in the selection of drug candidates. Thus, in combination with parameters such as molar mass and the number of hydrogen-bond sites, Cm has great potential for applications in medicinal chemistry.



CONCLUSIONS This work has created a framework for molecular complexity based on site-specific descriptors for the microenvironment of every atom in a molecule. Information-theoretical considerations have led to the conclusion that molecular complexity should be an additive measure. On the basis of these considerations, the universal index Cm provides a general solution for molecular complexity that combines the advantages of topological indices with inherent chemical logics. This new, general index addresses the major shortcomings of other indices, including stereochemistry, unsaturated bonds, branching, and symmetry, and can be simply calculated manually. Its conceptive power has been demonstrated for individual compounds as well as chemical transformations, and it is applicable even to non-carbon chemistries and macromolecular compounds such as synthetic polymers, proteins, and nucleic acids. In combination with other parameters such as molar mass and the number of possible hydrogen bonds, molecular complexity allows clustering of antibiotics in two-dimensional complexity spaces and their correlation with other biomolecules. Thus, the index Cm may be a valuable tool for medicinal chemistry to investigate drug candidates, support in the selection of lead structures, and aid in silico development.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.jcim.5b00723. A more detailed guide for calculating molecular complexity and additional tables and figures (PDF)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS I thank Prof. Dr. Andreas Marx and his group for their generous support and gratefully acknowledge funding by the Emmy Noether Program of the Deutsche Forschungsgemeinschaft (DFG), the EU FP7 Marie Curie Zukunftskolleg Incoming Fellowship Program (University of Konstanz Grant 291784), the Fonds der Chemischen Industrie (FCI), and the Konstanz Research School Chemical Biology.



REFERENCES

(1) Whitesides, G. M.; Ismagilov, R. F. Complexity in Chemistry. Science 1999, 284, 89−92. (2) Nicolaou, K. C.; Hale, C. R. H.; Nilewski, C.; Ioannidou, H. A. Constructing Molecular Complexity and Diversity: Total Synthesis of Natural Products of Biological and Medicinal Importance. Chem. Soc. Rev. 2012, 41, 5185−5238. (3) Ertl, P.; Schuffenhauer, A. Estimation of Synthetic Accessibility Score of Drug-Like Molecules Based on Molecular Complexity and Fragment Contributions. J. Cheminf. 2009, 1, 8. 469

DOI: 10.1021/acs.jcim.5b00723 J. Chem. Inf. Model. 2016, 56, 462−470

Article

Journal of Chemical Information and Modeling (26) Rücker, C.; Rücker, G.; Bertz, S. H. Organic Synthesis - Art or Science? J. Chem. Inf. Model. 2004, 44, 378−386. (27) Rücker, G.; Rücker, C. Walk Counts, Labyrinthicity, and Complexity of Acyclic and Cyclic Graphs and Molecules. J. Chem. Inf. Model. 2000, 40, 99−106. (28) Randić, M.; Plavšić, P. Characterization of Molecular Complexity. Int. J. Quantum Chem. 2003, 91, 20−31. (29) Randic, M.; Brissey, G. M.; Spencer, R. B.; Wilkins, C. L. Search for All Self-Avoiding Paths for Molecular Graphs. Comput. Chem. 1979, 3, 5−13. (30) Lu, C. H.; Guo, W. M.; Wang, Y.; Yin, C. S. Novel DistanceBased Atom-Type Topological Indices Dai for Qspr/Qsar Studies of Alcohols. J. Mol. Model. 2006, 12, 749−756. (31) Ren, B. Atom-Type-Based Ai Topological Descriptors: Application in Structure-Boiling Point Correlations of Oxo Organic Compounds. J. Chem. Inf. Model. 2003, 43, 1121−1131. (32) von Korff, M.; Sander, T. About Complexity and Self-Similarity of Chemical Structures in Drug Discovery. In Chaos and Complex Systems; Stavrinides, G., Banerjee, S., Caglar, S. H., Ozer, M., Eds.; Springer: Berlin, 2013; pp 301−306. (33) Whitlock, H. W. On the Structure of Total Synthesis of Complex Natural Products. J. Org. Chem. 1998, 63, 7982−7989. (34) Barone, R.; Chanon, M. A New and Simple Approach to Chemical Complexity. Application to the Synthesis of Natural Products. J. Chem. Inf. Model. 2001, 41, 269−72. (35) Allu, T. K.; Oprea, T. I. Rapid Evaluation of Synthetic and Molecular Complexity for in Silico Chemistry. J. Chem. Inf. Model. 2005, 45, 1237−43. (36) Kawasaki, T.; Matsumura, Y.; Tsutsumi, T.; Suzuki, K.; Ito, M.; Soai, K. Asymmetric Autocatalysis Triggered by Carbon Isotope (13c/ 12c) Chirality. Science 2009, 324, 492−5. (37) Bonchev, D.; Kamenski, D.; Kamenska, V. Symmetry and Information Content of Chemical Structures. Bull. Math. Biophys. 1976, 38, 119−133. (38) Randić, M.; Plavšić, P. On the Concept of Molecular Complexity. Croat. Chem. Acta 2002, 75, 107−116. (39) Bertz, S. H. Complexity of Synthetic Routes: Linear, Convergent and Reflexive Syntheses. New J. Chem. 2003, 27, 870−879. (40) Bertz, S. H. Convergence, Molecular Complexity, and Synthetic Analysis. J. Am. Chem. Soc. 1982, 104, 5801−5803. (41) Walters, W. P.; Murcko, M. A. Prediction of ’Drug-Likeness’. Adv. Drug Delivery Rev. 2002, 54, 255−71. (42) Leeson, P. D.; Springthorpe, B. The Influence of Drug-Like Concepts on Decision-Making in Medicinal Chemistry. Nat. Rev. Drug Discovery 2007, 6, 881−890. (43) Lipinski, C. A. Lead- and Drug-Like Compounds: The Rule-ofFive Revolution. Drug Discovery Today: Technol. 2004, 1, 337−341. (44) Walters, W. P.; Murcko, M. A. Prediction of ’Drug-Likeness’. Adv. Drug Delivery Rev. 2002, 54, 255−271. (45) Bickerton, G. R.; Paolini, G. V.; Besnard, J.; Muresan, S.; Hopkins, A. L. Quantifying the Chemical Beauty of Drugs. Nat. Chem. 2012, 4, 90−98.

470

DOI: 10.1021/acs.jcim.5b00723 J. Chem. Inf. Model. 2016, 56, 462−470